I'm currently trying to extract data from strings that are always in the same format (scraped from social sites with no API support)
example of strings
53.2k Followers, 11 Following, 1,396 Posts
5m Followers, 83 Following, 1.1m Posts
I'm currently using the following regex expression: "[0-9]{1,5}([,.][0-9]{1,4})?" to get the numeric sections, preserving the comma and dot separators.
It yields results like
53.2, 11, 1,396 
5, 83, 1.1
I really need a regular expression that will also grab the character after the numeric sections, even if it's a white-space. i.e.
53.2k, 11 , 1,396
5m, 83 , 1.1m
Any help is greatly appreciated
R code for reproduction
  library(stringr)
  string1 <- ("536.2k Followers, 83 Following, 1,396 Posts")
  string2 <- ("5m Followers, 83 Following, 1.1m Posts")
  info <- str_extract_all(string1,"[0-9]{1,5}([,.][0-9]{1,4})?")
  info2 <- str_extract_all(string2,"[0-9]{1,5}([,.][0-9]{1,4})?")
  info 
  info2 
You can use regular expressions to achieve this task. In order to verify that the string only contains letters, numbers, underscores and dashes, we can use the following regex: "^[A-Za-z0-9_-]*$".
\s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a carriage return, a line feed, or a form feed.
In the regular expression above, each '\\d' means a digit, and '. ' can match anything in between (look at the number 1 in the list of expressions in the beginning). So we got the digits, then a special character in between, three more digits, then special characters again, then 4 more digits.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
I would suggest the following regex pattern:
[0-9]{1,3}(?:,[0-9]{3})*(?:\\.[0-9]+)?[A-Za-z]*
This pattern generates the outputs you expect. Here is an explanation:
[0-9]{1,3}      match 1 to 3 initial digits
(?:,[0-9]{3})*  followed by zero or more optional thousands groups
(?:\\.[0-9]+)?  followed by an optional decimal component
[A-Za-z]*       followed by an optional text unit
I tend to lean towards base R solutions whenever possible, and here is one using gregexpr and regmatches:
txt <- "53.2k Followers, 11 Following, 1,396 Posts"
m <- gregexpr("[0-9]{1,3}(?:,[0-9]{3})*(?:\\.[0-9]+)?[A-Za-z]*", txt)
regmatches(txt, m)
[[1]]
[1] "53.2k"   "11"   "1,396"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With