Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select only not null values when aggregating with first or last in mongodb?

My data represents a dictionary that receives a bunch of updates and potentially new fields (metadata being added to a post). So something like:

> db.collection.find()
{ _id: ..., 'A': 'apple', 'B': 'banana' },
{ _id: ..., 'A': 'artichoke' },
{ _id: ..., 'B': 'blueberry' },
{ _id: ..., 'C': 'cranberry' }

The challenge - I want to find the first (or last) value for each key ignoring blank values (i.e. I want some kind of conditional group by that works at a field not document level). (Equivalent to the starting or ending version of the metadata after updates).

The problem is that:

db.collection.aggregate([
  { $group: {
    _id: null,
    A: { $last: '$A' },
    B: { $last: '$B' }, 
    C: { $last: '$C' }
  }}
])

fills in the blanks with nulls (rather than skipping them in the result), so I get:

{ '_id': ..., 'A': null, 'B': null, 'C': 'cranberry' }

when I want:

{ '_id': ..., 'A': 'artichoke', 'B': 'blueberry', 'C': cranberry' }
like image 705
user2966041 Avatar asked Jan 25 '26 00:01

user2966041


2 Answers

I don't think this is what you really want, but it does solve the problem you are asking. The aggregation framework cannot really do this, as you are asking for "last results" of different columns from different documents. There is really only one way to do this and it is pretty insane:

db.collection.aggregate([
    { "$group": {
        "_id": null,
        "A": { "$push": "$A" },
        "B": { "$push": "$B" },
        "C": { "$push": "$C" }
    }},
    { "$unwind": "$A" },
    { "$group": {
        "_id": null,
        "A": { "$last": "$A" },
        "B": { "$last": "$B" },
        "C": { "$last": "$C" }
    }},
    { "$unwind": "$B" },
    { "$group": {
        "_id": null,
        "A": { "$last": "$A" },
        "B": { "$last": "$B" },
        "C": { "$last": "$C" }
    }},
    { "$unwind": "$C" },
    { "$group": {
        "_id": null,
        "A": { "$last": "$A" },
        "B": { "$last": "$B" },
        "C": { "$last": "$C" }
    }},
])

Essentially you compact down the documents pushing all of the found elements into arrays. Then each array is unwound and the $last element is taken from there. You need to do this for each field in order to get the last element of each array, which was the last match for that field.

Not real good and certain to explode the BSON 16MB limit on any meaningful collection.

So what you are really after is looking for a "last seen" value for each field. You could brute force this by iterating the collection and keeping values that are not null. You can even do this on the server like this with mapReduce:

db.collection.mapReduce(
   function () {
      if (start == 0)
        emit( 1, "A" );

      start++;
      current = this;

      Object.keys(store).forEach(function(key) {
        if ( current.hasOwnProperty(key) )
          store[key] = current[key];
      });

    },
    function(){},
    {
        "scope": { "start": 0, "store": { "A": null, "B": null, "C": null } },
        "finalize": function(){ return store },
        "out": { "inline": 1 }
    }
)

That will work as well, but iterating the whole collection is nearly as bad as mashing everything together with aggregate.

What you really want in this case is three queries, ideally in parallel to just get the discreet value last seen for each property:

> db.collection.find({ "A": { "$exists": true } }).sort({ "$natural": -1 }).limit(1)
{ "_id" : ObjectId("54b319cd6997a054ce4d71e7"), "A" : "artichoke" }
> db.collection.find({ "B": { "$exists": true } }).sort({ "$natural": -1 }).limit(1)
{ "_id" : ObjectId("54b319cd6997a054ce4d71e8"), "B" : "blueberry" }
> db.collection.find({ "C": { "$exists": true } }).sort({ "$natural": -1 }).limit(1)
{ "_id" : ObjectId("54b319cd6997a054ce4d71e9"), "C" : "cranberry" }

Acutally even better is to create a sparse index on each property and query via $gt and a blank string. This makes sure an index is used and as a sparse index it will only contain documents where the property is present. You'll need to .hint() this, but you still want $natural ordering for the sort:

db.collection.ensureIndex({ "A": -1 },{ "sparse": 1 })
db.collection.ensureIndex({ "B": -1 },{ "sparse": 1 })
db.collection.ensureIndex({ "C": -1 },{ "sparse": 1 })


> db.collection.find({ "A": { "$gt": "" } }).hint({ "A": -1 }).sort({ "$natural": -1 }).limit(1)
{ "_id" : ObjectId("54b319cd6997a054ce4d71e7"), "A" : "artichoke" }
> db.collection.find({ "B": { "$gt": "" } }).hint({ "B": -1 }).sort({ "$natural": -1 }).limit(1)
{ "_id" : ObjectId("54b319cd6997a054ce4d71e8"), "B" : "blueberry" }
> db.collection.find({ "C": { "$gt": "" } }).hint({ "C": -1 }).sort({ "$natural": -1 }).limit(1)
{ "_id" : ObjectId("54b319cd6997a054ce4d71e9"), "C" : "cranberry" }

That's the best way to solve what you are saying here. But as I said, this is how you think you need to solve it. Your real problem likely has another way to approach both storing and querying.

like image 64
Neil Lunn Avatar answered Jan 26 '26 14:01

Neil Lunn


Starting Mongo 3.6, for those using $first or $last as a way to get one value from grouped records (not necessarily the actual first or last), $group's $mergeObjects can be used as a way to find a non-null value from grouped items:

// { "A" : "apple", "B" : "banana" }
// { "A" : "artichoke" }
// { "B" : "blueberry" }
// { "C" : "cranberry" }
db.collection.aggregate([
  { $group: {
      _id: null,
      A: { $mergeObjects: { a: "$A" } },
      B: { $mergeObjects: { b: "$B" } },
      C: { $mergeObjects: { c: "$C" } }
  }}
])
// { _id: null, A: { a: "artichoke" }, B: { b: "blueberry" }, C: { c: "cranberry" } }

$mergeObjects accumulates an object based on each grouped record. And the thing to note is that $mergeObjects will merge in priority values that aren't null. But that requires to modify the accumulated field to an object, thus the "awkward" { a: "$A" }.

If the output format isn't exactly what you expect, one can always use an additional $project stage.

like image 25
Xavier Guihot Avatar answered Jan 26 '26 14:01

Xavier Guihot