I have a column in a dataframe with street addresses like this:
df_col <- c("100 W 10th St", "200 Drury Ln 2a", "300 W 10th St", "400b Drury Ln")
I want to capitalize any single lower-case letters that immediately follow digits like this:
df_col <- c("100 W 10th St", "200 Drury Ln 2A", "300 W 10th St", "400B Drury Ln")
I have been able to use str_detect from the stringr package to detect substrings with digits followed by a single lower-case letter:
df %>%
filter(str_detect(df_col, "\\b\\d+[a-z]\\b"))
This is my first time writing regex, explained as followed:
\\b matches the boundary of a word (or substring)
\\d matches any digit and the + is to match additional digits that follow the first digit if applicable
[a-z] matches one lower-case letter (any letter)
However, I am struggling to figure out how to replace each of these substrings with the same substring but a capitalized letter.
I have tried using str_replace_all, but I cannot figure out the third argument. I thought I could do something like this, but it is replacing each substring with the literal regex.
df %>%
mutate(df_col = str_replace_all(df_col, "\\b\\d+[a-z]\\b", "\\b\\d+[A-Z]\\b"))
I tried using gsub with mutate but could not figure that out either. I would prefer to learn a solution for str_replace_all, but other ways of solving the problem are welcome.
Aiming for a simpler solution, this will match a number followed by a single word character and a word boundary and run toupper()
on the match to capitalize it. Since toupper()
will have no effect on the numeric part of the string, we don't have to worry about look ahead/behind or anything. more complicated.
library(stringr)
str_replace_all(
df_col,
pattern = "\\d\\w\\b",
replacement = toupper
)
# [1] "100 W 10th St" "200 Drury Ln 2A" "300 W 10th St" "400B Drury Ln"
You can use gsub
like this:
gsub("(\\d\\p{Ll})\\b", "\\U\\1", df_col, perl=TRUE)
## Or, if you must ensure the matches are whole words starting with a number and then a letter:
gsub("\\b(\\d+\\p{Ll})\\b", "\\U\\1", df_col, perl=TRUE)
See the online R demo.
Details
(\d\p{Ll})\b
matches and captured into Group 1 a digit (\d
) and then a lowercase letter (\p{Ll}
) that is not followed by a letter, digit or underscore (\b
)perl=TRUE
enables the \U
operator in the replacement pattern (that turned the replacement text to upper case) and also the use of Unicode category classes in the regex (like \p{X}
)If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With