I have scratched my head for one hour on a perl oneliner failing because the file had CRLF line endings. It has a regex with group match at the end of the line, and the CR got included in the match, making bad stuff with using the backreference for replace.
I ended up specifying the CRLF manually in the regex, but is there a way to get perl handle automatically line-ending whatever they are?
Original command is
perl -pe 's/foo bar(.*)$/foo $1 bar/g' file.txt
"Correct" command is
perl -pe 's/foo bar(.*)\r\n/foo $1 bar\r\n/g' file.txt
I know I can also convert line endings before processing, I'm interested in how to get Perl handle this case gracefully.
Example file (save with CRLF line endings!)
[19:06:57.033] foo barmy
[19:06:57.033] foo baryour
Expected output
[19:06:57.033] foo my bar
[19:06:57.033] foo your bar
Output with original command (bar goes at line beginning because it's matched together with carriage return):
bar:06:57.033] foo my
bar:06:57.033] foo your
is there a way to get perl handle automatically platform-specific line-ending?
Yes. It's actually the default.
The issue is that you're trying to handle Windows line endings on a unix platform.
This will definitely do it:
perl -pe'
BEGIN {
binmode STDIN, ":crlf";
binmode STDOUT, ":crlf";
}
s/foo bar(.*)$/foo $1 bar/g;
' <file.txt
Might I suggest you keep doing it manually?
Alternatively, you could convert the file to a text file and convert it back.
<file.orig dos2unix | perl -pe'...' | unix2dos >file.new
In newer perls, you can use \R in your regex to strip off all end-of-line characters (it includes both \n and \r). See perldoc perlre.
The \R escape sequence Perl v5.10+; see perldoc rebackslash or the documentation online, which matches "generic newlines" (platform-agnostically) can be made to work here (example uses Bash to create the multi-line input string):
$ printf 'foo barmy\r\nfoo baryour\r\n' | perl -pe 's/foo bar(.*?)\R/foo $1 bar\n/gm'
foo my bar
foo your bar
Note that the only difference to Ether's answer is use of a non-greedy construct (.*? rather than just .*), which makes all the difference here.
Read on, if you want to know more.
Background:
It is an example of a pitfall associated with \R, which stems from the fact that it can match one or two characters - either \r\n or, typically, \n:[1]
With the greedy (.*) construct , "my\r" - including the \r - is captured, because the regex engine apparently only backtracks by one character to look for \R, which the remaining \n by itself also satisfies.
By contrast, using the non-greedy (.*?) construct causes \R to match the \r\n sequence, as intended.
[1] \R matches MORE than just \r\n and \n: it matches any single character that is classified as vertical whitespace in Unicode terms, which also includes \v (vertical tab), \f (form feed), \r (by itself), and the following Unicode chars: 0x133 (NEXT LINE), 0x2028 (LINE SEPARATOR), 0x8232 (LINE SEPARATOR) and 0x8233 (PARAGRAPH SEPARATOR)
You can say:
perl -pe 's/foo bar([^\015]*)(\015?\012)/foo $1 bar$2/g' *.txt
The line endings would be preserved, i.e. would be the same as the input file.
You might also want to refer to perldoc perlport.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With