Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to separate and extract numerical values from 1 column in a dataframe?

Tags:

dataframe

r

I am new to working with data frames and R. I am looking for a way to manipulate and extract information from one of the columns. See below for an example data frame:

enter image description here

Column 3 "Info" contains AF, GF, and DT. I need the number from AF and the number after the comma in GF. Then I want to divide the number from GF by the number from AF to get a new variable XX which I would want to incorporate back into the DF as a new column. Maybe the visual below will help clarify what I need.

From A, I would take 20/10 =2, B would be 20/20 =1, C would be 7/10 = 0.7 etc. I want to generate a new column using these values called XX and add it to the original data frame:

enter image description here

I was approaching my issue more manually, like I said I am not familiar with R.

    library(dplyr)  
    library(tidyr)  
    library(stringr)  
    DF1 = data.frame(ID=c(A,B,C,D,E), TagNo=c(10,20,20,10,50),    Info=c(AF=10;GF=5,20;DT=10,AF=20;GF=4,20;DT=30,AF=10;GF=3,7;DT=40,AF=5;GF=2,15;DT=8,AF=2;GF=0,6;DT=10),Year=c(2022,2023,2023,2022,2022)  
    DF1_col3 = separate(DF1,c(1),into =c('AF', 'GF', 'DT'),sep=';')  

Once I do this, it starts getting really messy. My original file has many more columns and milIions of rows. I appreciate any ideas.

like image 446
user20330606 Avatar asked Dec 13 '25 11:12

user20330606


2 Answers

strcapture with regex capture groups and some basic vectorised operations for the final calculation:

DF1$XX <- with(strcapture("AF=(\\d+).+GF=.+,(\\d+)", 
                          DF1$Info, proto=list(x=1L,y=1L)), y/x)

#  ID TagNo                Info Year  XX
#1  A    10 AF=10;GF=5,20;DT=10 2022 2.0
#2  B    20 AF=20;GF=4,20;DT=30 2023 1.0
#3  C    20  AF=10;GF=3,7;DT=40 2023 0.7
#4  D    10   AF=5;GF=2,15;DT=8 2022 3.0
#5  E    50   AF=2;GF=0,6;DT=10 2022 3.0
like image 55
thelatemail Avatar answered Dec 16 '25 23:12

thelatemail


We can extract all numbers from info into a numeric vector we can subset with brackets, then do the calculations with elements #1 and #3. Must be a rowwise() operation

library(dplyr)
library(stringr)
df |>
    rowwise() |>
    mutate(xx = str_extract_all(Info,
                                '\\d+',
                                simplify = TRUE
                                ) |>
               as.numeric() |>
               {\(x) (x[3])/(x[1])}()
           ) |>
    ungroup()

# A tibble: 5 × 5
  ID    TagNo Info                 Year    xx
  <chr> <dbl> <chr>               <dbl> <dbl>
1 A        10 AF=10;GF=5,20;DT=10  2022   2  
2 B        20 AF=20;GF=4,20;DT=30  2023   1  
3 C        20 AF=10;GF=3,7;DT=40   2023   0.7
4 D        10 AF=5;GF=2,15;DT=8    2022   3  
5 E        50 AF=2;GF=0,6;DT=10    2022   3  

Alternatively, we may extract the values into proper variables, which may facilitate other downstream analyses. I recommend using the new tidyr::separate_wider_regex for that, as in the example below. This should be faster in large datasets, because the mutate() operation is vectorized (much faster than rowwise())

DF1 |>
    separate_wider_regex(cols = Info,
                         patterns = c('AF\\=',
                                      AF = '\\d+',
                                      ';',
                                      'GF\\=',
                                      GF = '\\d+',
                                      ',',
                                      other_number = '\\d+',
                                      ';',
                                      'DT\\=',
                                      DT = '\\d+'
                                      ),
                            
                         ) |>
    type.convert(as.is = TRUE) |>
    mutate(xx = other_number / AF)

# A tibble: 5 × 8
  ID    TagNo    AF    GF other_number    DT  Year    xx
  <chr> <dbl> <dbl> <dbl>        <dbl> <dbl> <dbl> <dbl>
1 A        10    10     5           20    10  2022   2  
2 B        20    20     4           20    30  2023   1  
3 C        20    10     3            7    40  2023   0.7
4 D        10     5     2           15     8  2022   3  
5 E        50     2     0            6    10  2022   3  

data

structure(list(ID = c("A", "B", "C", "D", "E"), TagNo = c(10, 
20, 20, 10, 50), Info = c("AF=10;GF=5,20;DT=10", "AF=20;GF=4,20;DT=30", 
"AF=10;GF=3,7;DT=40", "AF=5;GF=2,15;DT=8", "AF=2;GF=0,6;DT=10"
), Year = c(2022, 2023, 2023, 2022, 2022)), class = "data.frame", row.names = c(NA, 
-5L))

  ID TagNo                Info Year
1  A    10 AF=10;GF=5,20;DT=10 2022
2  B    20 AF=20;GF=4,20;DT=30 2023
3  C    20  AF=10;GF=3,7;DT=40 2023
4  D    10   AF=5;GF=2,15;DT=8 2022
5  E    50   AF=2;GF=0,6;DT=10 2022
like image 30
GuedesBF Avatar answered Dec 16 '25 21:12

GuedesBF