The JSON documents that we plan to ingest into DocumentDb look as follows…
[
{"id":"id1","LastName": “user1”, "GroupMembership":["g1","g2"]},
{"id":"id2","LastName": “user2”, "GroupMembership":["g1","g4","g5"]},
{"id":"id3","LastName": “user3”, "GroupMembership":["g3","g4","g2"]},
…
]
We want to answer queries such as, get me count of all users who are members of group “g1” or “g2” etc…. The number of users is very large (few millions)… What is the best way to implement this query and use the index and avoid any scans… Should I be using ARRAY_CONTAINS or JOIN (does ARRAY_CONTAINS internally use the index or is it doing a scan)…
Option1)
SELECT VALUE COUNT(1) FROM Users WHERE ARRAY_CONTAINS(Users.GroupMembership, "g1") or ARRAY_CONTAINS(Users.GroupMembership, "g2")
Option2)
SELECT VALUE COUNT(1) FROM Users JOIN Membership in Users.GroupMembership WHERE Membership = "g1" or Membership = "g2"
Both queries should utilize the index the same way, but ARRAY_CONTAINS is likely to provide a better execution time compared to JOIN. You could profile both queries using the Query Metrics as per this article: https://learn.microsoft.com/en-us/azure/cosmos-db/documentdb-sql-query-metrics#query-execution-metrics
Both shall provide same index utilization, however with the JOIN
usage you can get duplicating results per entry and with the ARRAY_CONTAINS
you won't. I guess that difference is very significant. See more about duplicating issue in the replies to Getting duplicate records in select query for the Azure DocumentDB and Cosmos db joins give duplicate results SO question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With