I'm dealing with a colleague who has made an enormous amount of copied/pasted spelling errors throughout an entire C# solution.
Instead of using a spelling checker on every individual file, I would like to create a list of all words in the entire solution, launch a spelling checker on that list, and do a complete "find-and-replace" for the found entries.
In order to find all words in a file, I had thought of doing something like:
grep -wo ".*" blabla.txt
But that seems not to be working: instead of showing every individual found word, it still shows the entire lines where the words are found, something like:
this is OK
this is NOK
OK it is
NOK it is
Everything is OK
While I was expecting something like:
this
is
OK
this
is
NOK
...
Once I have the list for one file, I can start working with find ./ -name "*.cs" -exec grep ... {} \; >>output_list
and do some sort output_list | uniq
in order to get the single words.
But first things first: as grep -ow ".*"
does not show me the words, but the entire line, what can I do to show all words in a file using UNIX/Linux commandline? (I added awk
as a tag, because this might be a solution? But I'm certainly no awk
wizard :-) )
Edit after first answers:
tr
indeed seems the way to go. I might simply use tr ' ' '\n'
, but there's a catch: I tried the following but it didn't work:
find ./ -name "*.cs" -exec cat {} | tr ' ' '\n' >>/mnt/c/Temp_Folder\output.txt \;
The command gives me a >
answer (as I'm inside some code editor or so), what am I still doing wrong?
How about using tr
to replace every space/tab to line break:
tr '[[:blank:]]' '\n' <file
this
is
OK
this
is
NOK
OK
it
is
NOK
it
is
Everything
is
OK
Based on your edited question, you may use this find + tr
solution in bash
shell:
while IFS= read -rd '' f; do
tr ' ' '\n' < "$f"
done < <(find . -name '*.cs' -print0) >/mnt/c/Temp_Folder/output.txt
Use \w
instead of .
to identify just word-constituent characters instead of any characters, and use {2,}
instead of *
to only look for strings of 2 or more such characters so your output isn't cluttered up with single characters like a
, i
, etc.:
$ grep -Eow '\w{2,}' file
this
is
OK
this
is
NOK
OK
it
is
NOK
it
is
Everything
is
OK
I'd suggest you don't try to find/modify 2-letter "words" either though as they're unlikely to be wrong and are easy to understand when wrong anyway, and stick with words that are 3 or more letters:
$ grep -Eow '\w{3,}' file
this
this
NOK
NOK
Everything
When you go to replace them, create a file named bad2good
like this mapping bad to good words:
tish this
thsi this
ONK NOK
and then use this GNU awk (for \<
and \>
word boundaries) script:
awk '
NR==FNR {
b2g["\\<" $1 "\\>"] = $2
next
}
{
for ( bad in b2g ) {
good = b2g[bad]
gsub(bad,good)
}
}
' bad2good file
Keep backups and be careful! In particular make sure to do a human review of the files after the changes as well as compiling them as this is a dangerous exercise you're undertaking.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With