Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why mongodb sort by _id is much faster than sort by any other indexed field?

I'm trying to fully sort a collection with millions of rows by a single field. As far i know, ObjectId contains 4 bytes of timestamp. And my timestamp is 4 bytes integer indexed field. So i suppose sort by _id and timestamp should be simular, but here's results

db.coll.find().sort("_id", pymongo.ASCENDING)
# takes 25 minutes to run

and

db.coll.find().sort("timestamp", pymongo.ASCENDING)
# takes 2 hours to run

why is this happening, and is here the way to optimize that? Thanks

UPDATE

The timestamp field i'm trying to sort with is already indexed as i pointed

collection stats

"size" : 55881082188,
"count" : 126048972,
"avgObjSize" : 443,
"storageSize" : 16998031360,
"capped" : false,
"nindexes" : 2,
"totalIndexSize" : 2439606272,

and I dedicated to mongodb proccess 4gb of ram (tried to increase to 8gb but speed didn't increased)

UPDATE 2

It's turned out how much sorting on field order follows insertion (natural) order, so much the sorting speed is faster

I tried to

db.new_coll.create_index([("timestamp", pymongo.ASCENDING)])
for el in db.coll.find().sort("timestamp", pymongo.ASCENDING):
    del el['_id']
    db.new_coll.insert(el)

# and now
db.new_coll.find().sort("timestamp", pymongo.ASCENDING)
# takes 25 minutes vs 2 hours as in previous example
like image 786
sspprroo Avatar asked Oct 14 '25 06:10

sspprroo


1 Answers

Sorting by _id is faster because of the way _id field value is generated.

Words from Documentation

One of the main reasons ObjectId’s are generated in the fashion mentioned above by the drivers is that is contains a useful behavior due to the way sorting works. Given that it contains a 4 byte timestamp (resolution of seconds) and an incrementing counter as well as some more unique identifiers such as the machine id once can use the _id field to sort documents in the order of creation just by simply sorting on the _id field. This can be useful to save the space needed by an additional timestamp if you wish to track the time of creation of a document.

I have also tried explaining the query and noticed that nscannedObjects and nscannedObjectsAllPlans is 0 when sorting is done using _id.

> db.coll.find({},{_id:1}).sort({_id:1}).explain();
{
        "cursor" : "BtreeCursor _id_",
        "isMultiKey" : false,
        "n" : 353,
        "nscannedObjects" : 0,
        "nscanned" : 353,
        "nscannedObjectsAllPlans" : 0,
        "nscannedAllPlans" : 353,
        "scanAndOrder" : false,
        "indexOnly" : true,
        "nYields" : 2,
        "nChunkSkips" : 0,
        "millis" : 0,
        "indexBounds" : {
                "_id" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ]
        },
        "server" : "server",
        "filterSet" : false
}
like image 123
Pardeep Singh Avatar answered Oct 17 '25 02:10

Pardeep Singh