Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongodb: Is it a good idea to create a unique index on web URLs?

My document looks like:

{"url": "http://some-random-url.com/path/to/article"
"likes": 10
}

The url needs to be unique. Is it a good idea to have a unique index on the url? The URL can be long, resulting in larger index size, more memory footprint, and slower overall performance. Is it a good idea to generate a hash from the url (i am thinking about using murmur3) and create a unique index on that instead. I am assuming that the chances of collision are pretty low, as described here: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Does anyone see any drawbacks to this approach? The new document will look like (with a unique index on u_hash instead of url):

{"url": "http://some-random-url.com/path/to/article"
"likes": 10
"u_hash": "<murmur3 hash of url>"
}

UPDATE

I will not be doing regex queries on the url. Will be doing only a complete URL look up. I am more concerned about the performance of this look up, as I believe it will also be used internally by mongodb to maintain unique index, and hence affecting write performance as well (+ longer index). Additionally, my understanding is that mongobd doesn't perform well for long text indexes, as it wasn't designed for that purpose. I may be wrong though, and it could only depend on whether or not that index fits into RAM. Any pointers?

like image 991
amit_saxena Avatar asked Dec 07 '25 10:12

amit_saxena


1 Answers

I'd like to expand on the answer of @AlexRyan. While he is right in general, there are some things which need to be taken into consideration for this use case.

First of all, we have to differentiate between a unique index and the _id field.

When the URL needs to be unique in your use case, there has to be a unique index. What we have to decide is wether to use the URL itself or a hashed value of it. The hashing itself would not help with the search, as the hash sum saved in a field would be treated as a string by MongoDB. It may safe space (URLs may be shorter than their hash value), hereby reducing the memory needed for the index. However, doing so takes away the possibility to search for parts of the URL in the index, for example with

db.collection.find({url:{$regex:/stackoverflow/}})

With a unique index on url, this query would use an index, which will be quite fast. Without such (unique) index, this query will result in a comparably slow collection scan.

Plus, creating the hash each and every time before querying, updating or inserting doesn't make these operations faster.

This leaves us with the fact that creating a hash sum and a unique index on it may save some RAM at the cost of making queries on the actual field slower at orders of magnitude. And it introduces the need of creating a hash sum each and every time. Having a index on both the URL and it's hashed value would not make sense at all.

Now to the question wether it is a good idea to use URL as _id one way or the other. Since URLs usually are distinct by nature (they are supposed to return the same content) and the likes are related to that uniqueness, I would tend to use the URL as the id. Since you need the unique index on _id anyway, it serves two purposes here: you have your id for the document, you ensure uniqueness of the URL and - in case you use the natural representation of the URL - it will even be queryable in an efficient way.

like image 51
Markus W Mahlberg Avatar answered Dec 10 '25 01:12

Markus W Mahlberg



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!