There's a txt file with a word in every line.
"word1"
"word1"
"word2"
"word2"
"word1"
I'd like to get which word occurs the most, but I have no idea how to get that, any ideas?
Note: See bottom for case-insensitive solutions.
A combination of sort, uniq, head, and cut calls is conceptually simplest, and also extensible, but here's a single-pass awk solution that is probably more efficient, although more complex, and limited to finding only the "winner" and with unpredictable ordering in the event of ties:
awk '{ if (++words[$0] > max) { max = words[$0]; maxW=$0 } } END { print maxW }' file
With the sample input, this returns "word2" (including the double quotes).
Use print max, maxW to also output the count.
In the event of a tie, among the words that share the max. count, it is the one whose last occurrence happens to come first in the input file that "wins" (is output).
Here's the multi-utility equivalent, which allows extending the solution to the top N words and also offers predictable ordering among the winners in the event of a tie:
$ sort file | uniq -c | sort -k1,1nr -k2b | head -n 1 | cut -d\" -f2
word2
In the event of a tie, the alphabetically first word among the ones that share the max. count is printed.
Note: For convenience, the above uses cut to extract the word without the enclosing double quotes.
To preserve the double quotes, use awk instead of cut:
$ sort file | uniq -c | sort -k1,1nr -k2b | head -n 1 | awk '{print $NF}'
"word2"
Omitting the last pipeline segment and modifying head's -n 1 option allows you to see how many occurrences of each word were found and to find the top N words (including double quotes); e.g., to see the top 10 (with the sample input, you only get 2):
$ sort file | uniq -c | sort -k1,1nr -k2b | head -n 10
3 "word1"
2 "word2"
A note on the sort call, sort -k1,1nr -k2b:
Explicitly stating the sort fields is good practice - both for efficiency and to avoid unexpected results:
-k1,1nr sorts primarily by 1st whitespace-separated field (k1,1), numerically (-n), in reverse order (r).
-k1,1, as just -k1 would sort everything starting from field 1 through the end of the line.-k2b then sorts secondarily starting with the 2nd whitespace-separated field through the end of the line (-k2), ignoring leading whitespace (b; the whitespace that separates the fields) and performing lexical (alphabetic) sorting.
Newer versions of GNU sort (not the one on macOS, unfortunately) have a helpful --debug option that visualizes how each line is broken into keys during sorting.
Using just sort or sort -nr to sort the whole line is tempting, but doesn't necessarily yield the expected results:
Just sort sorts the whole line lexically (alphabetically), in ascending order; due to the padded fixed-width nature of the word counts in the 1st field the results are still effectively numerically sorted, but in the event of a tie it is the alphabetically last word that is output.
Just sort -rn applies numerical sorting to the whole line, in descending order. With numerical sorting field parsing stops at the longest prefix that can be interpreted as a number, an implicit feature called last-resort comparison (can be turned off with -n) sorts the rest of the line alphabetically (in reverse order, in this case). It is therefore also the alphabetically last word that is output in the event of a tie.
Case-insensitive variants:
Note that the input is transformed to all-lowercase for simplicity.
awkawk '{ $0=tolower($0); if (++wds[$0] > max) { max = wds[$0]; maxW=$0 } } END { print maxW }' file
sort + uniq + head + cut
tr '[:upper:]' '[:lower:]' < file |
sort | uniq -c | sort -k1,1nr -k2b | head -n 1 | cut -d\" -f2
try with something like this: cat test | sort | uniq -c
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With