Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting arrays by frequency inside documents in MongoDB collection

I'm trying to sort a couple of arrays by frequency inside every document of a MongoDB collection.

My documents right now look something like

{
    "_id": whatever,
    "color": "blue",
    "fruit": ["apple", "banana", "apple", "orange", "apple", "orange", ...],
    "vegetable": ["onion", "lettuce", "spinach", "lettuce", ...],
    "meat": ["pulled pork", "steak", "chicken wings", "pulled pork", "pulled pork", ...]
}

Note: this is not the real data but the properties of the documents are the exact same.

The end goal is to find for each color the most frequent fruit, vegetable and meat so I guessed if I could group by color and get the first element of every array ordered by frequency, that would give me what I need.

I've tried unwinding but my database is too big to unwind for every array (each one has around 50.000 elements so 50.000^3 doesn't seem ideal). I've also looked for a "mode" group function, as MongoDB has a "median" function, but it seems like there is none (v5.0.22). I've also looked at the map and the reduce function for pipelining (db.collection.aggregate({$map //or $reduce...})) but I honestly didn't get far with that since this is kind of new for me, although it seems to me that it's probable the way to go.

Has someone done something similar that might work here? Thanks!!

like image 756
Pablo Pinillos Avatar asked Dec 21 '25 14:12

Pablo Pinillos


1 Answers

The OP states that the docs we see in the question are the result of a $group (ostensibly on color) that likely $pushes the fruit, vegetable, and meat values onto ever-growing arrays (as much as 50,000). If the goal is to get the mode for each color, then $facet can be used as a "multigroup". Suppose each individual docs has this shape (NOTE: just using fruit and meat here to simplify; the approach extends to vegetable and any other field in the doc):

    {"color": "blue", "fruit": "A", "meat": "X"}

The following $facet pipeline will produce the modes we seek:

db.foo.aggregate([
    {$facet: {
        "most_fruit": [
            // Sum by color and fruit name:                                      
            {$group: {_id: {c:'$color', v:'$fruit'}, N: {$sum: 1}}}

            // Reorg by color only....                                           
            ,{$group: {_id: '$_id.c', X:{$push: {v:'$_id.v',N:'$N'}}}}

            // ...and now sort highest-to-lowest and take the highest one.       
            // Nice thing is if you really want, you are already set up to       
            // capture, for example, the highest *nd* the lowest.                
            ,{$project: {
                X: {$first: {$sortArray: {input: '$X', sortBy: {'N':-1} }} }
            }}
        ],

        // Same thing ... but for meat                                           
        "most_meat": [
            {$group: {_id: {c:'$color', v:'$meat'}, N: {$sum: 1}}}

            ,{$group: {_id: '$_id.c', X:{$push: {v:'$_id.v',N:'$N'}}}}
            ,{$project: {
                X: {$first: {$sortArray: {input: '$X', sortBy: {'N':-1} }} }
            }}
        ]
    }}
]);

yields something with this shape:

{                                                                                
  most_fruit: [                                                                  
    {_id: 'blue', X: {v: 'A', N: 2} },                                           
    {_id: 'green',X: {v: 'F', N: 3} }                                            
  ],                                                                             
  most_meat: [                                                                   
    {_id: 'green',X: {v: 'Z', N: 4} },                                           
    {_id: 'blue', X: {x: 'X', N: 3} }                                            
  ]                                                                              
}    

A single document with all information is returned. Although it is not organized by color, there is no further processing that could be done on the database side (using MQL) to make either the grouping or filtering of data more efficient; it's now up to the client-side to set up the information.

Here is a candidate client-side reorg:

var oneDoc = c.next();

function processItem(obj, fname) {
    for(var n = 0; n < obj[fname].length; n++) {
        var cn = oneDoc[fname][n]['_id'];
        if(undefined == color_major[cn]) {
            // Set up ALL the possible modes.  -1 is our way of                  
            // signalling it has not yet been set.                               
            color_major[cn] = {'most_fruit':-1,'most_meat':-1};
        }
        color_major[cn][fname] = oneDoc[fname][n]['X'];
    }
}

var color_major = {};
processItem(oneDoc, 'most_fruit');
processItem(oneDoc, 'most_meat');

print(color_major);

yields:

{
  green: { most_fruit: { v: 'F', N: 3 }, most_meat: { v: 'Z', N: 4 } },
  blue: { most_fruit: { v: 'A', N: 2 }, most_meat: { v: 'X', N: 3 } }
}

If you really want to beat up the DB..

(...but not really because the data coming of the $facet stage above is very small...)

Here is the extra pipeline to reformat the data:

    // Turn $facet field names (lval) into values (rval):                                         
    ,{$project: {X: {$objectToArray: '$$ROOT'} }}

    // Double unwind is OK because there is only #color X 2 (fruit and meat) entries.             
    // Even if 1000 colors and fruit meat and veg and whatevs, still quite doable:                
    ,{$unwind: '$X'}
    ,{$unwind: '$X.v'}

    // Reorg on color:                                                                            
    ,{$group: {_id: '$X.v._id', W: {$push: {k: '$X.k', v: '$X.v.X'}} }}

    // ...and put it all back together:                                                           
    ,{$replaceRoot: { newRoot: {$mergeObjects: [ {color:'$_id'}, {$arrayToObject: '$W'} ]}
     }}

yields:

{
  color: 'green',
  most_fruit: {
    v: 'F',
    N: 3
  },
  most_meat: {
    v: 'Z',
    N: 4
  }
}
{
  color: 'blue',
  most_fruit: {
    v: 'A',
    N: 2
  },
  most_meat: {
    v: 'X',
    N: 3
  }
}
like image 97
Buzz Moschetti Avatar answered Dec 23 '25 04:12

Buzz Moschetti



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!