I have several hundred GB of data that I need to paste together using the unix paste utility in Cygwin, but it won't work properly if there are windows EOL characters in the files. The data may or may not have windows EOL characters, and I don't want to spend the time running dos2unix if I don't have to.
So my question is, in Cygwin, how can I figure out whether these files have windows EOL CRLF characters?
I've tried creating some test data and running
sed -r 's/\r\n//' testdata.txt
But that appears to match regardless of whether dos2unix has been run or not.
Thanks.
use a text editor like notepad++ that can help you with understanding the line ends. It will show you the line end formats used as either Unix(LF) or Macintosh(CR) or Windows(CR LF) on the task bar of the tool. you can also go to View->Show Symbol->Show End Of Line to display the line ends as LF/ CR LF/CR.
In Notepad++ go to the View > Show Symbol menu and select Show End of Line. Once you select View > Show Symbol > Show End of Line you can see the CR LF characters visually.
DOS uses carriage return and line feed ("\r\n") as a line ending, which Unix uses just line feed ("\n").
Whereas Windows follows the original convention of a carriage return plus a line feed ( CRLF ) for line endings, operating systems like Linux and Mac use only the line feed ( LF ) character. The history of these two control characters dates back to the era of the typewriter.
The file(1) utility knows the difference:
$ file * | grep ASCII
2:                                       ASCII text
3:                                       ASCII English text
a:                                       ASCII C program text
blah:                                    ASCII Java program text
foo.js:                                  ASCII C++ program text
openssh_5.5p1-4ubuntu5.dsc:              ASCII text, with very long lines
windows:                                 ASCII text, with CRLF line terminators
file(1) has been optimized to try to read as little of a file as possible, so you may be lucky and drastically reduce the amount of disk IO you need to perform when finding and fixing the CRLF terminators.
Note that some cases of CRLF should stay in place: captures of SMTP will use CRLF. But that's up to you. :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With