I am finding difficulty on modeling HBase table for the following requirement.
I have a table 'Store' where it stores the store details (Pizza Hut).
I have a table 'Order' which has the summary of the transaction (total transaction amount etc...).
I have another table 'Order_Item' where every ordered Item in the transaction is stored (This has the item id, item name, item count, tax etc..)
Example : Date Range - Last Week, Store - Pizza A, Item - A, Total Income - 120$
Example : Date Range - Last Week, Store - Pizza A, Item - A, %Percentage Income - 23%
I am really stuck on how to model the hbase tables and the deadline makes me tensed.
Please can some one assist me on this.
In HBase, you want to be sure that you design your tables around your typical queries. If you design your tables based on some arbitrary "that makes sense" you are going to see bad performance.
Since the major requirement is to query by date range / store / item, you want this to be your key. If this is your key, then your queries are going to be fast.
I suggest you make your key the concatenation of date range + store + item along with some delimiter, e.g.:
20110103-PIZZAHUT-MEATLOVERS
20110103-PIZZAHUT-VEGETABLE
20110104-PIZZAHUT-MEATLOVERS
20110105-DOMINOS-HAWAIIAN
Then, store each item sold into the first column family as (ID:profit). ID here is something like a unique timestamp, a UUID, a receipt ID, or something.
For the first query, All you do is do a key lookup on DATE-STORE-ITEM, then sum all of the values you retrieve.
For the second query, do a range scan from 20110107-PIZZAHUT-! to 20110206-PIZZAHUT-~. Sum the items you are looking for and all the items you are not also. At the end, calculate the percentage.
The approach suggested by orangeoctopus is storing one row per day, per store, per item, with a column for every transaction. That's a good one; the other approach is to store each transaction in its own row, with the same key fields plus the unique ID as part of the key. Then there's a single column in a single column family, for the amount.
20110103-PIZZAHUT-MEATLOVERS-857283394
20110103-PIZZAHUT-MEATLOVERS-857283395
20110103-PIZZAHUT-MEATLOVERS-857283396
20110103-PIZZAHUT-VEGETABLE-859238494
20110103-PIZZAHUT-VEGETABLE-859238494
etc.
The same logic applies in this design; your queries both scan over a specific date range and get the data they need that way (and, if you want to restrict to a single store, or a store product combo, you can do that). The only difference is that now you're scanning over a bunch of rows, instead of many columns in one row per date/store/item combination.
These are the two key design techniques in HBase: entities as rows, or entities as columns nested within a parent entity row. The advantage to the latter is that all columns within a row can be updated transactionally; the downside is that the code to retrieve it is a little more complicated (and, you pay a slight price for that transactionality if you have high concurrency).
FYI, what you can't do efficiently with this row key is a query that doesn't lead with the parts of your row key, in order. So for example, if you wanted sales for pizza hut for all time, you'd have to scan every row in the table on the server side (which is presumably not desirable b/c presumably you have a LOT of data in this table, otherwise you wouldn't be using HBase ... :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With