I am researching cassandra as a possible solution for my up coming project. The more I research the more I keep hearing that it is a bad idea to sort on fields that is not setup for sorting when the table was created.
Is it possible to sort on any field? If there is a performance impact for sorting on fields not in the cluster what is that performance impact? I need to sort around or about 2 million records in the table.
I keep hearing that it is a bad idea to sort on fields that is not setup for sorting when the table was created.
It's not so much that it's a bad idea. It's just really not possible to make Cassandra sort your data by an arbitrary column. Cassandra requires a query-based modeling approach, and that goes for sort order as well. You have to decide ahead of time the kinds of queries you want Cassandra to support, and the order in which those queries return their data.
Is it possible to sort on any field?
Here's the thing with how Cassandra sorts result sets: it doesn't. Cassandra queries correspond to partition locations, and the data is read off of the disk and returned to you. If the data is read in the same order that it was sorted in on-disk, the result set will be sorted. On the other hand if you try a multi-key query or an index-based query where it has to jump around to different partitions, chances are that it will not be returned in any meaningful order.
But if you plan ahead, you can actually influence the on-disk sort order of your data, and then leverage that order in your queries. This can be done with a modeling mechanism called a "clustering column." Cassandra will allow you to specify multiple clustering columns, but they are only valid within a single partition.
So what does that mean? Take this example from the DataStax documentation.
CREATE TABLE playlists (
  id uuid,
  artist text,
  album text,
  title text,
  song_order int,
  song_id uuid,
  PRIMARY KEY ((id),song_order))
WITH CLUSTERING ORDER BY (song_order ASC);
With this table definition, I can query a particular playlist by id (the partition key).  Within each id, the data will be returned ordered by song_order:
SELECT id, song_order, album, artist, title 
FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204
ORDER BY song_order DESC;
id                                   | song_order | album                 | artist         | title
------------------------------------------------------------------------------------------------------------------
62c36092-82a1-3a00-93d1-46196ee77204 | 4          | No One Rides For Free |      Fu Manchu |             Ojo Rojo    
62c36092-82a1-3a00-93d1-46196ee77204 | 3          |             Roll Away | Back Door Slam |  Outside Woman Blues
62c36092-82a1-3a00-93d1-46196ee77204 | 2          |          We Must Obey |      Fu Manchu |     Moving in Stereo
62c36092-82a1-3a00-93d1-46196ee77204 | 1          |          Tres Hombres |         ZZ Top |            La Grange
In this example, if I only need to specify an ORDER BY if I want to switch the sort direction.  As the rows are stored in ASCending order, I need to specify DESC to see them in DESCending order.  If I was fine with getting the rows back in ASCending order, I don't need to specify ORDER BY at all.
But what if I want to order by artist? Or album? Or both? Since one artist can have many albums (for this example), we'll modify the PRIMARY KEY definition like this:
PRIMARY KEY ((id),artist,album,song_order)
Running the same query above (minus the ORDER BY) produces this output:
SELECT id, song_order, album, artist, title 
FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204;
id                                   | song_order | album                 | artist         | title
------------------------------------------------------------------------------------------------------------------
62c36092-82a1-3a00-93d1-46196ee77204 | 3          |             Roll Away | Back Door Slam |  Outside Woman Blues
62c36092-82a1-3a00-93d1-46196ee77204 | 4          | No One Rides For Free |      Fu Manchu |             Ojo Rojo    
62c36092-82a1-3a00-93d1-46196ee77204 | 2          |          We Must Obey |      Fu Manchu |     Moving in Stereo
62c36092-82a1-3a00-93d1-46196ee77204 | 1          |          Tres Hombres |         ZZ Top |            La Grange
Notice that the rows are now ordered by artist, and then album.  If we had two songs from the same album, then song_order would be next.
So now you might ask "what if I just want to sort by album, and not artist?"  You can sort just by album, but not with this table.  You cannot skip clustering keys in your ORDER BY clause.  In order to sort only by album (and not artist) you'll need to design a different query table.  Sometimes Cassandra data modeling will have you duplicating your data a few times, to be able to serve different queries...and that's ok.
For more detail on how to build data models while leveraging clustering order, check out these two articles on PlanetCassandra:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With