Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to insert quickly to a very large collection

Tags:

mongodb

I have a collection of over 70 million documents. Whenever I add new documents in batches (lets say 2K), the insert operation is really slow. I suspect that is because, the mongo engine is comparing the _id's of all the new documents with all the 70 million to find out any _id duplicate entries. Since the _id based index is disk-resident, it'll make the code a lot slow.

Is there anyway to avoid this. I just want mongo to take new documents and insert it as they are, without doing this check. Is it even possible?

like image 923
VaidAbhishek Avatar asked Oct 24 '25 14:10

VaidAbhishek


1 Answers

Diagnosing "Slow" Performance

Your question includes a number of leading assumptions about how MongoDB works. I'll address those below, but I'd advise you to try to understand any performance issues based on facts such as database metrics (i.e. serverStatus, mongostat, mongotop), system resource monitoring, and information in the MongoDB log on slow queries. Metrics need to be monitored over time so you can identify what is "normal" for your deployment, so I would strongly recommend using a MongoDB-specific monitoring tool such as MMS Monitoring.

A few interesting presentations that provide very relevant background material for performance troubleshooting and debugging are:

  • William Zola: The (Only) Three Reasons for Slow MongoDB Performance
  • Aska Kamsky: Diagnostics and Debugging with MongoDB

Improving efficiency of inserts

Aside from understanding where your actual performance challenges lie and tuning your deployment, you could also improve efficiency of inserts by:

  • removing any unused or redundant secondary indexes on this collection

  • using the Bulk API to insert documents in batches

Assessing Assumptions

Whenever I add new documents in batches (lets say 2K), the insert operation is really slow. I suspect that is because, the mongo engine is comparing the _id's of all the new documents with all the 70 million to find out any _id duplicate entries. Since the _id based index is disk-resident, it'll make the code a lot slow.

If a collection has 70 million entries, that does not mean that an index lookup involves 70 million comparisons. The indexed values are stored in B-trees which allow for a small number of efficient comparisons. The exact number will depend on the depth of the tree and how your indexes are built and the value you're looking up .. but will be on the order of 10s (not millions) of comparisons.

If you're really curious about the internals, there are some experimental storage & index stats you can enable in a development environment: Storage-viz: Storage Visualizers and Commands for MongoDB.

Since the _id based index is disk-resident, it'll make the code a lot slow.

MongoDB loads your working set (portion of data & index entries recently accessed) into available memory.

If you are able to create your ids in an approximately ascending order (for example, the generated ObjectIds) then all the updates will occur at the right side of the B-tree and your working set will be much smaller (FAQ: "Must my working set fit in RAM").

Yes, I can let mongo use the _id for itself, but I don't want to waste a perfectly good index for it. Moreover, even if I let mongo generate _id for itself won't it need to compare still for duplicate key errors?

A unique _id is required for all documents in MongoDB. The default ObjectId is generated based on a formula that should ensure uniqueness (i.e. there is an extremely low chance of returning a duplicate key exception, so your application will not get duplicate key exceptions and have to retry with a new _id).

If you have a better candidate for the unique _id in your documents, then feel free to use this field (or collection of fields) instead of relying on the generated _id. Note that the _id is immutable, so you shouldn't use any fields that you might want to modify later.

like image 164
Stennie Avatar answered Oct 27 '25 05:10

Stennie



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!