What are the elegant and effective ways to count the frequency of each "english" word in a file?
First of all, I define letter_only std::locale so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways", "ways." and "ways!" as just the same word "ways", because the stream will ignore punctuations like "." and "!".
struct letter_only: std::ctype<char>
{
letter_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
return &rc[0];
}
};
int main()
{
std::map<std::string, int> wordCount;
ifstream input;
input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
input.open("filename.txt");
std::string word;
while(input >> word)
{
++wordCount[word];
}
for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
{
cout << it->first <<" : "<< it->second << endl;
}
}
struct Counter
{
std::map<std::string, int> wordCount;
void operator()(const std::string & item) { ++wordCount[item]; }
operator std::map<std::string, int>() { return wordCount; }
};
int main()
{
ifstream input;
input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
input.open("filename.txt");
istream_iterator<string> start(input);
istream_iterator<string> end;
std::map<std::string, int> wordCount = std::for_each(start, end, Counter());
for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
{
cout << it->first <<" : "<< it->second << endl;
}
}
Perl is arguably not so elegant, but very effective.
I posted a solution here: Processing huge text files
In a nutshell,
1) If needed, strip punctuation and convert uppercase to lowercase:perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file
2) Count the occurrence of each word. Print results sorted first by frequency, and then alphabetically:perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq
I ran this code on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in under 3 minutes.
Here is working solution.This should work with real text (including punctuation) :
#include <iterator>
#include <iostream>
#include <fstream>
#include <map>
#include <string>
#include <cctype>
std::string getNextToken(std::istream &in)
{
char c;
std::string ans="";
c=in.get();
while(!std::isalpha(c) && !in.eof())//cleaning non letter charachters
{
c=in.get();
}
while(std::isalpha(c))
{
ans.push_back(std::tolower(c));
c=in.get();
}
return ans;
}
int main()
{
std::map<std::string,int> words;
std::ifstream fin("input.txt");
std::string s;
std::string empty ="";
while((s=getNextToken(fin))!=empty )
++words[s];
for(std::map<std::string,int>::iterator iter = words.begin(); iter!=words.end(); ++iter)
std::cout<<iter->first<<' '<<iter->second<<std::endl;
}
Edit: Now my code calling tolower for every letter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With