I'm trying to fully sort a collection with millions of rows by a single field. As far i know, ObjectId contains 4 bytes of timestamp. And my timestamp is 4 bytes integer indexed field. So i suppose sort by _id and timestamp should be simular, but here's results
db.coll.find().sort("_id", pymongo.ASCENDING)
# takes 25 minutes to run
and
db.coll.find().sort("timestamp", pymongo.ASCENDING)
# takes 2 hours to run
why is this happening, and is here the way to optimize that? Thanks
UPDATE
The timestamp field i'm trying to sort with is already indexed as i pointed
collection stats
"size" : 55881082188,
"count" : 126048972,
"avgObjSize" : 443,
"storageSize" : 16998031360,
"capped" : false,
"nindexes" : 2,
"totalIndexSize" : 2439606272,
and I dedicated to mongodb proccess 4gb of ram (tried to increase to 8gb but speed didn't increased)
UPDATE 2
It's turned out how much sorting on field order follows insertion (natural) order, so much the sorting speed is faster
I tried to
db.new_coll.create_index([("timestamp", pymongo.ASCENDING)])
for el in db.coll.find().sort("timestamp", pymongo.ASCENDING):
del el['_id']
db.new_coll.insert(el)
# and now
db.new_coll.find().sort("timestamp", pymongo.ASCENDING)
# takes 25 minutes vs 2 hours as in previous example
Sorting by _id is faster because of the way _id field value is generated.
Words from Documentation
One of the main reasons ObjectId’s are generated in the fashion mentioned above by the drivers is that is contains a useful behavior due to the way sorting works. Given that it contains a 4 byte timestamp (resolution of seconds) and an incrementing counter as well as some more unique identifiers such as the machine id once can use the _id field to sort documents in the order of creation just by simply sorting on the _id field. This can be useful to save the space needed by an additional timestamp if you wish to track the time of creation of a document.
I have also tried explaining the query and noticed that nscannedObjects and nscannedObjectsAllPlans is 0 when sorting is done using _id.
> db.coll.find({},{_id:1}).sort({_id:1}).explain();
{
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 353,
"nscannedObjects" : 0,
"nscanned" : 353,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 353,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 2,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"_id" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "server",
"filterSet" : false
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With