I am new to working with data frames and R. I am looking for a way to manipulate and extract information from one of the columns. See below for an example data frame:

Column 3 "Info" contains AF, GF, and DT. I need the number from AF and the number after the comma in GF. Then I want to divide the number from GF by the number from AF to get a new variable XX which I would want to incorporate back into the DF as a new column. Maybe the visual below will help clarify what I need.
From A, I would take 20/10 =2, B would be 20/20 =1, C would be 7/10 = 0.7 etc. I want to generate a new column using these values called XX and add it to the original data frame:

I was approaching my issue more manually, like I said I am not familiar with R.
library(dplyr)
library(tidyr)
library(stringr)
DF1 = data.frame(ID=c(A,B,C,D,E), TagNo=c(10,20,20,10,50), Info=c(AF=10;GF=5,20;DT=10,AF=20;GF=4,20;DT=30,AF=10;GF=3,7;DT=40,AF=5;GF=2,15;DT=8,AF=2;GF=0,6;DT=10),Year=c(2022,2023,2023,2022,2022)
DF1_col3 = separate(DF1,c(1),into =c('AF', 'GF', 'DT'),sep=';')
Once I do this, it starts getting really messy. My original file has many more columns and milIions of rows. I appreciate any ideas.
strcapture with regex capture groups and some basic vectorised operations for the final calculation:
DF1$XX <- with(strcapture("AF=(\\d+).+GF=.+,(\\d+)",
DF1$Info, proto=list(x=1L,y=1L)), y/x)
# ID TagNo Info Year XX
#1 A 10 AF=10;GF=5,20;DT=10 2022 2.0
#2 B 20 AF=20;GF=4,20;DT=30 2023 1.0
#3 C 20 AF=10;GF=3,7;DT=40 2023 0.7
#4 D 10 AF=5;GF=2,15;DT=8 2022 3.0
#5 E 50 AF=2;GF=0,6;DT=10 2022 3.0
We can extract all numbers from info into a numeric vector we can subset with brackets, then do the calculations with elements #1 and #3. Must be a rowwise() operation
library(dplyr)
library(stringr)
df |>
rowwise() |>
mutate(xx = str_extract_all(Info,
'\\d+',
simplify = TRUE
) |>
as.numeric() |>
{\(x) (x[3])/(x[1])}()
) |>
ungroup()
# A tibble: 5 × 5
ID TagNo Info Year xx
<chr> <dbl> <chr> <dbl> <dbl>
1 A 10 AF=10;GF=5,20;DT=10 2022 2
2 B 20 AF=20;GF=4,20;DT=30 2023 1
3 C 20 AF=10;GF=3,7;DT=40 2023 0.7
4 D 10 AF=5;GF=2,15;DT=8 2022 3
5 E 50 AF=2;GF=0,6;DT=10 2022 3
Alternatively, we may extract the values into proper variables, which may facilitate other downstream analyses. I recommend using the new tidyr::separate_wider_regex for that, as in the example below. This should be faster in large datasets, because the mutate() operation is vectorized (much faster than rowwise())
DF1 |>
separate_wider_regex(cols = Info,
patterns = c('AF\\=',
AF = '\\d+',
';',
'GF\\=',
GF = '\\d+',
',',
other_number = '\\d+',
';',
'DT\\=',
DT = '\\d+'
),
) |>
type.convert(as.is = TRUE) |>
mutate(xx = other_number / AF)
# A tibble: 5 × 8
ID TagNo AF GF other_number DT Year xx
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 10 10 5 20 10 2022 2
2 B 20 20 4 20 30 2023 1
3 C 20 10 3 7 40 2023 0.7
4 D 10 5 2 15 8 2022 3
5 E 50 2 0 6 10 2022 3
data
structure(list(ID = c("A", "B", "C", "D", "E"), TagNo = c(10,
20, 20, 10, 50), Info = c("AF=10;GF=5,20;DT=10", "AF=20;GF=4,20;DT=30",
"AF=10;GF=3,7;DT=40", "AF=5;GF=2,15;DT=8", "AF=2;GF=0,6;DT=10"
), Year = c(2022, 2023, 2023, 2022, 2022)), class = "data.frame", row.names = c(NA,
-5L))
ID TagNo Info Year
1 A 10 AF=10;GF=5,20;DT=10 2022
2 B 20 AF=20;GF=4,20;DT=30 2023
3 C 20 AF=10;GF=3,7;DT=40 2023
4 D 10 AF=5;GF=2,15;DT=8 2022
5 E 50 AF=2;GF=0,6;DT=10 2022
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With