I am new to sharding and wanted to know what implications sharding has for various queries. For the sample data set named "people":
person_id | person_fname | person_lname | person_dob
----------------------------------------------------
1 | John | Smith | 1972-03-04
2 | Sally | Jones | 1968-09-14
3 | Phil | Forrester | 1976-11-25
4 | Gwen | Langley | 1955-04-20
5 | Pedro | Romero | 1962-12-21
6 | Gene | Halford | 1978-01-11
7 | Juan | Peza | 1977-08-07
8 | Pierre | Henry | 1980-04-30
The data is sharded equally across four nodes by creating a hash of the surrogate identity "id". However, you need to perform read and write operations on records that potentially span all the nodes such as:
SELECT person_fname,
person_lname
FROM people
WHERE person_dob > '1970-01-01'
Or say you had a further table of "orders", which references "people" on the "person_id" column, and wanted to perform a join...
SELECT order_id,
order_amount,
order_date,
person_fname,
person_lname
FROM orders
LEFT JOIN people
WHERE order_amount > 50
Is it the case that in effect all of the nodes will run the query in parallel? I am assuming that each server will have less work to do for each step as instead of one instance running the query over eight records, simultaneously, four instances will run the query over two(ish) records, with the further benefit that if the DBMS is able to perform shard selection then the other nodes need not continue executing any further instructions, is this assumption correct?
Are there any known performance implications with sharding and complex joins (beyond that of this simple example)?
It will indeed allow that to be done in parallel.
It can indeed make joins complex, and hence slower, if they have to cross different shards.
However, with a many-to-one, if you had e.g. orders
sharded in such a way that all rows in the orders
table where in the same shard as the related row in the people
table, then this cross-shard problem doesn't happen.
You need to design your sharding approach so you get lots of cases like that and few (ideally none) where you end up crossing shards.
You also want to have your shard on the key you actually seek on the most. Eg. if you are finding people by username as your starting point to everything else, then you want to shard by username, not id, because when then finding them you already know which single shard to hit, rather than having to hit all of them just to get back zero rows from most.
Yes, sharding introduces severe changes in performance. It never allows the application to be left unchanged.
The most sane way to shard is if the data model allows for data to be partitioned to be truly independently. Like in a multi-tenant situation where the tenants don't interact at all. In this case joins never span partitions and all is good.
This get very very nasty when sharding with cross-partition interaction. Writing a query that runs against all shards has a cost linear in the number of partitions. This means that you get zero speedup by adding nodes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With