In my HBase table, each row may be have different columns than other rows. For example;
ROW COLUMN
1-1040 cf:s1
1-1040 cf:s2
1-1043 cf:s2
2-1040 cf:s5
2-1045 cf:s99
3-1040 cf:s75
3-1042 cf:s135
As seen above, each row has different columns than other rows. So, when I run scan query like this;
scan 'tb', {COLUMNS=>'cf:s2', STARTROW=>'1-1040', ENDROW=>'1-1044'}
I want to get cf:s2 values using above query. But, does any performance issue occur due to each row has different columns?
Another option;
ROW COLUMN
1-1040-s1 cf:value
1-1040-s2 cf:value
1-1043-s2 cf:value
2-1040-s5 cf:value
2-1045-s99 cf:value
3-1040-s75 cf:value
3-1042-s135 cf:value
In this option, when I want to get s2 values between 1-1040 and 1-1044, I am running this query for this;
scan 'tb', {STARTROW=>'1-1040s2', ENDROW=>'1-1044', FILTER=>"RowFilter(=, 'substring:s2')"}
When I want to get s2 values, which option is better in read performance?
HBase stores all records for a given column family in the same file, and so the scan has to run over all key-value pairs, even if you apply a filter. This is true of both ways you suggest for storing the data.
For optimal performance of this specific scan, you should consider storing your s2 data in a different column family. Under-the-hood, HBase will store your data in the following way:
One file:
1-1040 cf1:s1
2-1040 cf1:s5
2-1045 cf1:s99
3-1040 cf1:s75
3-1042 cf1:s135
Another file:
1-1040 cf2:s2
1-1043 cf2:s2
Then you can run a scan over just cf2, and HBase will only read data containing s2, making the operation much faster.
scan 'tb', {COLUMNS => 'cf2', STARTROW=>'1-1040s2', ENDROW=>'1-1044'}
Considerations:
s5, s75 etc. In this case, your composite rowkey option is better as HBase only need look at the rowkey, and not column qualifiers.s2, but might not be fastest for other queries.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With