Why does grep treat \n and \\n the same way ?
For example, both match hallo\nworld.
grep("hallo\nworld", pattern="\n")
[1] 1
grep("hallo\nworld", pattern="\\n")
[1] 1
I see that hallo\nworld is parsed into
hallo
world
that is, hallo on one line and world on one line.
So in grep("hallo\nworld", pattern="\n"), is the pattern="\n" a new line or \n literally?
Also note this happens with others; \a \f \n \t \r and \\a \\f \\n \\t \\r are all treated identically. But \d \w \s can't be used! Why not?
I chose different strings to test, and I found the secret in the concept of regular expression.
There are two concepts of escape, one is escape in a string, it is simple to understand; the other is escape in a regular pattern expression string. In R a pattern such as grep(x, pattern=" some string here "), \\n=\n= a newline character. But in common string, \\n !=\n ,the former is literally \n,the latter is a newline character. We can prove this by :
cat("\n")
cat("\\n")
\n>
How to prove this? I'll try with other characters, not just \n, to see if they match in the same way.
special1 <- c( "\a", "\f", "\n", "\t", "\r")
special2 <- c("\\a","\\f","\\n","\\t","\\r")
target <- paste("hallo", special1, "world", sep="")
for (i in 1:5){
cat("i=", i, "\n")
if( grep(target[i], pattern=special1[i]) == 1)
print(paste(target[i], "match", special1[i], "succeed"))
if( grep(target[i], pattern=special2[i]) == 1)
print(paste(target[i], "match", special2[i], "succeed"))
}
output:
i= 1
[1] "hallo\aworld match \a succeed"
[1] "hallo\aworld match `\\a` succeed"
i= 2
[1] "hallo\fworld match \f succeed"
[1] "hallo\fworld match `\\f` succeed"
i= 3
[1] "hallo\nworld match \n succeed"
[1] "hallo\nworld match `\\n` succeed"
i= 4
[1] "hallo\tworld match \t succeed"
[1] "hallo\tworld match `\\t` succeed"
i= 5
[1] "hallo\rworld match \r succeed"
[1] "hallo\rworld match `\\r` succeed"
Note that \a \f \n \t \r and \\a \\f \\n \\t \\r were all treated identically in R regular pattern expression string!
Not only that, you can not write \d \w \s in an R regular expression pattern!
You can write any of these:
pattern="\a" "pattern=\f" "pattern=\n" "pattern=\t" "pattern=\r"
But you can't write any of these!
pattern="\d" "pattern="\w" "pattern=\s" in grep.
I think this is also a bug , as \d \w \s are treated unequally to \a \f \n \t \r.
"\n" matches a newline character.
'\n' means a literal backslash followed by the letter n, whereas "\n" means the newline character. Last, the special variable $/ is the record separator which is "\n" by default, which is why you don't need to specify the separator in the above example.
The reason why \n, \\n and \\\n all match is because of double evaluation of the search pattern. I observed this by running a couple of examples:
grep("hello\nworld", pattern="\n")
[1] 1
grep("hello\nworld", pattern="\\n")
[1] 1
> grep("hello\nworld", pattern="\\\n")
[1] 1
> grep("hello\nworld", pattern="\\\\n")
integer(0)
> grep("hello\\nworld", pattern="\\\\n")
[1] 1
Keep in mind the rules of evaluating backslash escape sequences:
\\ is replaced with a \
\n is replaced with a NEWLINE character\ + NEWLINE is replaced with a NEWLINE character?regex for more details)With this in mind, if you evaluate the pattern twice, you get:
\n => NEWLINE => NEWLINE
\\n => \n => NEWLINE
\\\n => \ + NEWLINE => NEWLINE
\\\\n => \\n => \n
\\\\\n => \\ + NEWLINE => \ + NEWLINE
\\\\\\n => \\\n => \ + NEWLINE
\\\\\\\n => \\\ + NEWLINE => \ + NEWLINE
\\\\\\\\n => \\\\n => \\n
And so on. Examples 1-3 all evaluate to a single NEWLINE, that's why these patterns will match. (At the same time, the string you're trying to match against the pattern is evaluated only once.)
A discussion on the R mailing list posted by @Aaron explains the double evaluation like this:
There are two levels [of evaluation] because backslashes are escape characters both to R strings and regular expressions.
Note that other languages don't evaluate patterns like this. Take for example Python:
import re
>>> re.search(r'\n', 'hello\nworld') is not None
True
>>> re.search(r'\\n', 'hello\nworld') is not None
False
Or Perl:
$ perl -e 'print "hello\nworld" =~ /\n/ || 0, "\n"'
1
$ perl -e 'print "hello\nworld" =~ /\\n/ || 0, "\n"'
0
And we could go on. So the double evaluation in R seems unusual. Why is it implemented this way? I think the ultimate answer lies with R-devel.
ACKNOWLEDGEMENTS
I thank @Aaron whose critical comments helped improving this answer.
Note that the backslash itself is special, you have to escape the backslash with a backslash.
The \\n means "I really want to match a newline character, not literal \n"
grep("hallo\nworld", pattern = "\\n")
[1] 1
grep("hallo\\nworld", pattern = "\\\\n")
[1] 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With