I've designed the database for the web-app i'm building "by the book". That is, I've:
Everything is going well (so far). I've denormalized before with great results, and am curently implementing a part of the app which will use data that hasn't been denormalized yet. Doing so for this particular part will, I predict, increase performance somewhat substantially (reading from 1 Column_Family ("table" in the relational world) instead of 7).
However, I fear that I may be denormalizing too much. If I were to to do so for the part in question, it would pretty much reduce the Column_Family/table count in my app by about 20%, and having that much of my database denormalized makes me nervous for some reason.
Should the app end up being enough of a success that I'm able to get a database designer or administrator on board, I'd like for him to be able determine that the denormalization I'm performing is necessary for the performance i'm seeking (best-case) or at the very least not harmful (worst-case).
Are there specific things I should look out for when making denormalization decisions that may indicate whether doing so would be bad, or does it always come down to speed vs. maintainability?
Designing a schema for cassandra is very different than designing a schema for a sql database. With a sql database your data fits on one machine, the database will maintain indexes for you, you can perform joins, and you can do complex queries with sql. These all make normalizing data practical.
In cassandra you data does not fit on one machine so you can't perform joins, the only query you can do efficiently is get a range of columns on a key, and cassandra will only maintain limited indexes for you. This makes normalizing your data impractical.
In cassandra, you typically design your schema to serve the queries that you are going to make, and you denormalize to do that. My favorite example of this is what twitter does for their stats for rainbird as explained in this post,
For example, say someone clicks on a t.co link to blog.example.com/foo at 11:41am on 1st Feb.
Rainbird would increment counters for:
t.co click: com (all time)
t.co click: com.example (all time)
t.co click: com.example.blog (all time)
t.co click: com.example.blog /foo (all time)
t.co click: com (1st Feb 2011)
t.co click: com.example (1st Feb 2011)
t.co click: com.example.blog (1st Feb 2011)
t.co click: com.example.blog /foo (1st Feb 2011)
t.co click: com (11am-12 on 1st Feb)
t.co click: com.example (11am-12 on 1st Feb)
t.co click: com.example.blog (11am-12 on 1st Feb)
t.co click: com.example.blog /foo (11am-12 on 1st Feb)
t.co click: com (11:41-42 on 1st Feb)
t.co click: com.example (11:41-42 on 1st Feb)
t.co click: com.example.blog (11:41-42 on 1st Feb)
t.co click: com.example.blog /foo (11:41-42 on 1st Feb)
This 1 click is copied 16 times to satisfy the 16 queries that can be done.
This is a good presentation on how to do indexing in cassandra.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With