Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I find the most common number in an input?

Tags:

c++

This is a very abstracted explanation of what I'm doing:

Say I have a text file full of numbers separated by newlines. Right now, I take these numbers and put them in a map<int, int>, where the key is the number, and the value is the frequency.

My end goal is a list of numbers sorted by frequency. How can I go about doing this? Note that a frequency can obviously show up more than once. The way I was thinking was to make a struct containing both a number and its frequency, and define < so that it never returns equality. I would then iterate through the map, putting all the elements into that struct and then into a set.

like image 205
fishman Avatar asked Dec 07 '25 09:12

fishman


2 Answers

Once you've built the frequency map, copy its pairs to a std::vector<std::pair<int, int> > then std::sort the latter with the 3-args version of std::sort, which takes the comparator as the third arg; as a comparator, you can use one that compares the .second fields of the pairs first, and the .first ones (if you want) only to disambiguate the ordering of pairs whose .second fields are equal (but I don't think you really need the last bit since you don't care about the ordering for cases with equal frequency, do you?).

like image 151
Alex Martelli Avatar answered Dec 08 '25 23:12

Alex Martelli


If this is simply an operation you wish to do on it's own (rather than a component you want to use), the most practical way for me would be to do something like this:

sort <filename> | uniq -c | sed 's/^[ \t]*//' | sort -rn

Of course it doesn't scale well compared to a linear algorithm, but it performs reasonably well for every day tasks, and it has the added advantage that you don't need to reinvent the wheel yourself.


EDIT: sort" sorts your file ; "uniq" groups all consecutive lines that are the same into a pair [frequency] [item] pair, for instance

12
12
24
25
25
25

becomes

    2 12
    1 24
    3 25

then I remove the trailing space with "sed" (if this doesn't work, it is because there is sometimes an issue with the way the tab character is inputted), then I sort the output according to frequency (the "n" option asks for a numerical sort, instead of lexicographic; the "r" option asks for a reverse sorting).

If you want to select another field for the sorting (say, because the number you are actually interested in counting the frequency are the THIRD field of a tab delimited line), then you can use "sort"'s selection feature:

sort -t'\t' -k3 <filename> | ...

This says that your input is a tab-delimited list of fields, and that you want to sort according to the third field.


EDIT 2: here is the line to do this according to the fourth field (and to remove all other fields)

cut -d'\t' -f4 <filename> | sort | uniq -c | sed 's/^[ \t]*//' | sort -rn
like image 44
Jérémie Avatar answered Dec 08 '25 21:12

Jérémie