What's The Best Practice In Designing A Cassandra Data Model? [closed]

Tags:

And what are the pitfalls to avoid? Are there any deal breaks for you? E.g., I've heard that exporting/importing the Cassandra data is very difficult, making me wonder if that's going to hinder syncing production data to development environment.

BTW, it's very hard to find good tutorials on Cassandra, the only one I have http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model is still pretty basic.

Thanks.

311

asked Oct 01 '09 08:10

Jerry

1 Answers

For me, the main thing is a decision whether to use the OrderedPartitioner or RandomPartitioner.

If you use the RandomPartitioner, range scans are not possible. This means that you must know the exact key for any activity, INCLUDING CLEANING UP OLD DATA.

So if you've got a lot of churn, unless you have some magic way of knowing exactly which keys you've inserted stuff for, using the random partitioner you can easily "lose" stuff, which causes a disc space leak and will eventually consume all storage.

On the other hand, you can ask the ordered partitioner "what keys do I have in Column Family X between A and B" ? - and it'll tell you. You can then clean them up.

However, there is a downside as well. As Cassandra doesn't do automatic load balancing, if you use the ordered partitioner, in all likelihood all your data will end up in just one or two nodes and none in the others, which means you'll waste resources.

I don't have any easy answer for this, except you can get "best of both worlds" in some cases by putting a short hash value (of something you can enumerate easily from other data sources) on the beginning of your keys - for example a 16-bit hex hash of the user ID - which will give you 4 hex digits, followed by whatever the key is you really wanted to use.

Then if you had a list of recently-deleted users, you can just hash their IDs and range scan to clean up anything related to them.

The next tricky bit is secondary indexes - Cassandra doesn't have any - so if you need to look up X by Y, you need to insert the data under both keys, or have a pointer. Likewise, these pointers may need to be cleaned up when the thing they point to doesn't exist, but there's no easy way of querying stuff on this basis, so your app needs to Just Remember.

And application bugs may leave orphaned keys that you've forgotten about, and you'll have no way of easily detecting them, unless you write some garbage collector which periodically scans every single key in the db (this is going to take a while - but you can do it in chunks) to check for ones which aren't needed any more.

None of this is based on real usage, just what I've figured out during research. We don't use Cassandra in production.

EDIT: Cassandra now does have secondary indexes in trunk.

177

answered Sep 17 '22 18:09

MarkR

Related questions
                            
                                Is it a good idea to use an integer column for storing US ZIP codes in a database?
                            
                                Is it possible to query a tree structure table in MySQL in a single query, to any depth?
                            
                                Database schema design for a double entry accounting system? [closed]
                            
                                PostgreSQL Index on JSON
                            
                                Database design: 3 types of users, separate or one table? [closed]
                            
                                Why use SQL database? [closed]
                            
                                Database design for user settings
                            
                                Why are composite primary keys still around?
                            
                                How entity framework works for large number of records? [closed]
                            
                                When to use an auto-incremented primary key and when not to?
                            
                                varchar Fields - Is a Power of Two More Efficient?
                            
                                Database Design Best Practices [closed]
                            
                                Should you enforce constraints at the database level as well as the application level?
                            
                                Why should I avoid loops when designing relationships for a database?
                            
                                MongoDB vs. Cassandra vs. MySQL for real-time advertising platform
                            
                                What is the best way to manage permissions for a web application - bitmask or database table?
                            
                                Should OLAP databases be denormalized for read performance? [closed]
                            
                                How to store directory / hierarchy / tree structure in the database?
                            
                                Allow null in unique column
                            
                                Underscores or camelCase in PostgreSQL identifiers, when the programming language uses camelCase?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's The Best Practice In Designing A Cassandra Data Model? [closed]

Tags:

database-design

nosql

cassandra

Jerry

People also ask

1 Answers

MarkR

Recent Activity

Donate For Us