Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create new column based on last 2 digits of values in another column

Tags:

string

regex

r

Should be simple enough but it's become a difficult issue to solve. I have data that are grouped by their trailing decimals (a product of an upstream data source). For example, the data can be grouped for group "3" as 0.00003 while the data for group "10" is 24.00010. However, when I run both my regexpr code and my str_sub code it's as if R doesn't treat the last 0 as important.


Example Data

df <- data.frame(a = c(0.00003, 0.00010, 24.00003, 24.00010))

print(df)
         a
1  0.00003
2  0.00010
3 24.00003
4 24.00010

Desired Output

         a   group
1  0.00003 group03
2  0.00010 group10
3 24.00003 group03
4 24.00010 group10

Failed Attempt 1

df %>% mutate(group = paste0("group", regmatches(a, regexpr("(\\d{2}$)", a))))         
         a   group
1  0.00003 group03
2  0.00010 group01
3 24.00003 group03
4 24.00010 group01

This failure is peculiar as this works when I check it on: https://regexr.com/, using (\d{2}$)


Failed Attempt 2

df %>% mutate(group = paste0("group", str_sub(a, start = -2)))
         a   group
1  0.00003 group03
2  0.00010 group01
3 24.00003 group03
4 24.00010 group01
like image 280
TheSciGuy Avatar asked Nov 29 '25 16:11

TheSciGuy


2 Answers

The key here is that when you substring or extract with regex, you are converting the number into a string. The string, however does not keep the format you are expecting.

library(tidyverse)

tibble(a = c(0.00003, 0.00010, 24.00003, 24.00010)) %>%
  mutate(group1 = paste0("group", str_extract(sprintf("%.5f", a), "\\d{2}$")),
         group2 = paste0("group", str_extract(a, "\\d{2}$")),
         sprint_char = sprintf("%.5f", a),
         char = as.character(a))
#> # A tibble: 4 x 5
#>          a group1  group2  sprint_char char    
#>      <dbl> <chr>   <chr>   <chr>       <chr>   
#> 1  0.00003 group03 group05 0.00003     3e-05   
#> 2  0.0001  group10 group04 0.00010     1e-04   
#> 3 24.0     group03 group03 24.00003    24.00003
#> 4 24.0     group10 group01 24.00010    24.0001

See here that as.character(a) does not maintain the same structure as a. You can instead set the formatting with sprintf, and then extract the text that you want.

like image 74
AndS. Avatar answered Dec 02 '25 05:12

AndS.


We can convert to character and use str_sub. Also, make sure the options are set

options(scipen = 999)
library(stringr)
library(dplyr)
df %>% 
   mutate(group = paste0("group", str_sub(sprintf("%2.5f", a), start = -2)))
#        a   group
#1  0.00003 group03
#2  0.00010 group10
#3 24.00003 group03
#4 24.00010 group10
like image 39
akrun Avatar answered Dec 02 '25 05:12

akrun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!