Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a list of all words in a bunch of files?

I'm dealing with a colleague who has made an enormous amount of copied/pasted spelling errors throughout an entire C# solution.

Instead of using a spelling checker on every individual file, I would like to create a list of all words in the entire solution, launch a spelling checker on that list, and do a complete "find-and-replace" for the found entries.

In order to find all words in a file, I had thought of doing something like:

grep -wo ".*" blabla.txt

But that seems not to be working: instead of showing every individual found word, it still shows the entire lines where the words are found, something like:

this is OK
this is NOK
OK it is
NOK it is
Everything is OK

While I was expecting something like:

this
is
OK
this
is
NOK
...

Once I have the list for one file, I can start working with find ./ -name "*.cs" -exec grep ... {} \; >>output_list and do some sort output_list | uniq in order to get the single words.

But first things first: as grep -ow ".*" does not show me the words, but the entire line, what can I do to show all words in a file using UNIX/Linux commandline? (I added awk as a tag, because this might be a solution? But I'm certainly no awk wizard :-) )

Edit after first answers:
tr indeed seems the way to go. I might simply use tr ' ' '\n', but there's a catch: I tried the following but it didn't work:

find ./ -name "*.cs" -exec cat {} | tr ' ' '\n' >>/mnt/c/Temp_Folder\output.txt \;

The command gives me a > answer (as I'm inside some code editor or so), what am I still doing wrong?

like image 963
Dominique Avatar asked Oct 11 '25 12:10

Dominique


2 Answers

How about using tr to replace every space/tab to line break:

tr '[[:blank:]]' '\n' <file

this
is
OK
this
is
NOK
OK
it
is
NOK
it
is
Everything
is
OK

Based on your edited question, you may use this find + tr solution in bash shell:

while IFS= read -rd '' f; do
   tr ' ' '\n' < "$f"
done < <(find . -name '*.cs' -print0) >/mnt/c/Temp_Folder/output.txt
like image 78
anubhava Avatar answered Oct 14 '25 03:10

anubhava


Use \w instead of . to identify just word-constituent characters instead of any characters, and use {2,} instead of * to only look for strings of 2 or more such characters so your output isn't cluttered up with single characters like a, i, etc.:

$ grep -Eow '\w{2,}' file
this
is
OK
this
is
NOK
OK
it
is
NOK
it
is
Everything
is
OK

I'd suggest you don't try to find/modify 2-letter "words" either though as they're unlikely to be wrong and are easy to understand when wrong anyway, and stick with words that are 3 or more letters:

$ grep -Eow '\w{3,}' file
this
this
NOK
NOK
Everything

When you go to replace them, create a file named bad2good like this mapping bad to good words:

tish this
thsi this
ONK NOK

and then use this GNU awk (for \< and \> word boundaries) script:

awk '
    NR==FNR {
        b2g["\\<" $1 "\\>"] = $2
        next
    }
    {
        for ( bad in b2g ) {
            good = b2g[bad]
            gsub(bad,good)
        }
    }
' bad2good file

Keep backups and be careful! In particular make sure to do a human review of the files after the changes as well as compiling them as this is a dangerous exercise you're undertaking.

like image 31
Ed Morton Avatar answered Oct 14 '25 01:10

Ed Morton