How can I find the most common number in an input?

Question

This is a very abstracted explanation of what I'm doing:

Say I have a text file full of numbers separated by newlines. Right now, I take these numbers and put them in a map<int, int>, where the key is the number, and the value is the frequency.

My end goal is a list of numbers sorted by frequency. How can I go about doing this? Note that a frequency can obviously show up more than once. The way I was thinking was to make a struct containing both a number and its frequency, and define < so that it never returns equality. I would then iterate through the map, putting all the elements into that struct and then into a set.

Alex Martelli · Accepted Answer

Once you've built the frequency map, copy its pairs to a std::vector<std::pair<int, int> > then std::sort the latter with the 3-args version of std::sort, which takes the comparator as the third arg; as a comparator, you can use one that compares the .second fields of the pairs first, and the .first ones (if you want) only to disambiguate the ordering of pairs whose .second fields are equal (but I don't think you really need the last bit since you don't care about the ordering for cases with equal frequency, do you?).

Jérémie · Answer

If this is simply an operation you wish to do on it's own (rather than a component you want to use), the most practical way for me would be to do something like this:

sort <filename> | uniq -c | sed 's/^[ 	]*//' | sort -rn

Of course it doesn't scale well compared to a linear algorithm, but it performs reasonably well for every day tasks, and it has the added advantage that you don't need to reinvent the wheel yourself.

EDIT: sort" sorts your file ; "uniq" groups all consecutive lines that are the same into a pair [frequency] [item] pair, for instance

becomes

    2 12
    1 24
    3 25

then I remove the trailing space with "sed" (if this doesn't work, it is because there is sometimes an issue with the way the tab character is inputted), then I sort the output according to frequency (the "n" option asks for a numerical sort, instead of lexicographic; the "r" option asks for a reverse sorting).

If you want to select another field for the sorting (say, because the number you are actually interested in counting the frequency are the THIRD field of a tab delimited line), then you can use "sort"'s selection feature:

sort -t'	' -k3 <filename> | ...

This says that your input is a tab-delimited list of fields, and that you want to sort according to the third field.

EDIT 2: here is the line to do this according to the fourth field (and to remove all other fields)

cut -d'	' -f4 <filename> | sort | uniq -c | sed 's/^[ 	]*//' | sort -rn

How can I find the most common number in an input?

Tags:

c++

fishman

2 Answers

Alex Martelli

Jérémie

Recent Activity

Donate For Us

How can I find the most common number in an input?

Tags:

c++

fishman

2 Answers

Alex Martelli

Jérémie

Related questions

Recent Activity

Donate For Us