Is the regex \L supported in R 3.5.0?

Question

I am experiencing difficulty with the perl expression \L\1 in very particular circumstances on R-dev (2017-06-06 and 2017-06-16 r72796 builds):

bib <- readLines("https://raw.githubusercontent.com/HughParsonage/TeXCheckR/master/tests/testthat/lint_bib_in.bib", encoding = "UTF-8")

leading_spaces <- 2

is_field <- grepl("=", bib, fixed = TRUE)
field_width <- nchar(trimws(gsub("[=].*$", "", bib, perl = TRUE)))

widest_field <- max(field_width[is_field])

out <- bib

# Vectorized gsub:
for (line in seq_along(bib)){
  # Replace every field line with
  # two spaces + field name + spaces required for widest field + space
  if (is_field[line]){
    spaces_req <- widest_field - field_width[line]
    out[line] <-
      gsub("^\s*(\w+)\s*[=]\s*\{",
           paste0(paste0(rep(" ", leading_spaces), collapse = ""),
                  "\L\1",
                  paste0(rep(" ", spaces_req), collapse = ""),
                  " = {"),
           bib[line],
           perl = TRUE)
  }
}

# Add commas: 
out[is_field] <- gsub("\}$", "\},", out[is_field], perl = TRUE)

out[9]
#> R-dev   "  author"
#> R 3.4.0 "  author      = {Tony Wood and Amélie Hunter and Michael O'Toole and Prasana Venkataraman and Lucy Carter},"

To reproduce, it is necessary:

To readLines from a file, and specify the encoding. (Using dput won't reproduce)
To use \L or \U in the perl regex.
To use a vector of characters
To have an element of that vector that requires UTF-8 (é in Amélie in the above)

Is this a change in R 3.5.0, or have I been misusing \L in this instance?

Wiktor Stribiżew · Accepted Answer

UPDATE

The patch correcting this behaviour was applied in r74274.

ORIGINAL ANSWER

There is clearly some unexpected behavior.

When referring to \1, it works outputting:

[1] "  author      = {Tony Wood and Amélie Hunter and Michael O'Toole and Prasana Venkataraman and Lucy Carter},"

However, whenever a \U or \L is used with \1,the second backreference gets removed.

"\U\1": [1] " AUTHOR"
"\U\1\E\2": [1] " AUTHOR"

A gsubfn solution still works (here, an example with toupper()):

library(gsubfn)
bib <- readLines("https://raw.githubusercontent.com/HughParsonage/TeXCheckR/master/tests/testthat/lint_bib_in.bib", encoding = "UTF-8")
leading_spaces <- 2
is_field <- grepl("=", bib, fixed = TRUE)
field_width <- nchar(trimws(gsub("[=].*$", "", bib, perl = TRUE)))
widest_field <- max(field_width[is_field])
out <- bib

# Vectorized gsub:
for (line in seq_along(bib)){
  # Replace every field line with
  # two spaces + field name + spaces required for widest field + space
  if (is_field[line]){
    spaces_req <- widest_field - field_width[line]
    out[line] <-
      gsubfn("^\s*(\w+)\s*=\s*\{", 
             function(y) paste0(
                  paste0(rep(" ", leading_spaces), collapse = ""),
                  toupper(y),
                  paste0(rep(" ", spaces_req), collapse = ""),
                  " = {"
             ),
           bib[line], engine="R"
      )
  }
}
# Add commas: 
out[is_field] <- gsub("\}$", "},", out[is_field], perl = TRUE)

out[9]

Output:

[1] "  AUTHOR      = {Tony Wood and Amélie Hunter and Michael O'Toole and Prasana Venkataraman and Lucy Carter},"

My sessionInfo details:

> sessionInfo()
R Under development (unstable) (2017-06-19 r72808)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gsubfn_0.6-6 proto_1.0.0 

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0    tcltk_3.5.0

Is the regex \\L supported in R 3.5.0?

Tags:

regex

r

Hugh

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us