What are the performance implications of sharding?

Question

I am new to sharding and wanted to know what implications sharding has for various queries. For the sample data set named "people":

person_id | person_fname | person_lname | person_dob
----------------------------------------------------
1         | John         | Smith        | 1972-03-04
2         | Sally        | Jones        | 1968-09-14
3         | Phil         | Forrester    | 1976-11-25
4         | Gwen         | Langley      | 1955-04-20
5         | Pedro        | Romero       | 1962-12-21
6         | Gene         | Halford      | 1978-01-11
7         | Juan         | Peza         | 1977-08-07
8         | Pierre       | Henry        | 1980-04-30

The data is sharded equally across four nodes by creating a hash of the surrogate identity "id". However, you need to perform read and write operations on records that potentially span all the nodes such as:

SELECT person_fname, 
       person_lname 
FROM   people 
WHERE  person_dob > '1970-01-01'

Or say you had a further table of "orders", which references "people" on the "person_id" column, and wanted to perform a join...

SELECT    order_id,
          order_amount,
          order_date,
          person_fname,
          person_lname
FROM      orders
LEFT JOIN people
WHERE     order_amount > 50

Is it the case that in effect all of the nodes will run the query in parallel? I am assuming that each server will have less work to do for each step as instead of one instance running the query over eight records, simultaneously, four instances will run the query over two(ish) records, with the further benefit that if the DBMS is able to perform shard selection then the other nodes need not continue executing any further instructions, is this assumption correct?

Are there any known performance implications with sharding and complex joins (beyond that of this simple example)?

Jon Hanna · Accepted Answer

It will indeed allow that to be done in parallel.

It can indeed make joins complex, and hence slower, if they have to cross different shards.

However, with a many-to-one, if you had e.g. orders sharded in such a way that all rows in the orders table where in the same shard as the related row in the people table, then this cross-shard problem doesn't happen.

You need to design your sharding approach so you get lots of cases like that and few (ideally none) where you end up crossing shards.

You also want to have your shard on the key you actually seek on the most. Eg. if you are finding people by username as your starting point to everything else, then you want to shard by username, not id, because when then finding them you already know which single shard to hit, rather than having to hit all of them just to get back zero rows from most.

usr · Answer

Yes, sharding introduces severe changes in performance. It never allows the application to be left unchanged.

The most sane way to shard is if the data model allows for data to be partitioned to be truly independently. Like in a multi-tenant situation where the tenants don't interact at all. In this case joins never span partitions and all is good.

This get very very nasty when sharding with cross-partition interaction. Writing a query that runs against all shards has a cost linear in the number of partitions. This means that you get zero speedup by adding nodes.

What are the performance implications of sharding?

Tags:

performance

database

join

sharding

Stuart Wakefield

2 Answers

Jon Hanna

usr

Recent Activity

Donate For Us

What are the performance implications of sharding?

Tags:

performance

database

join

sharding

Stuart Wakefield

2 Answers

Jon Hanna

usr

Related questions

Recent Activity

Donate For Us