Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split long string in three variables

I have a data frame that looks like this:

df<-structure(list(string = c(" Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3", 
" Automatic data processing machines and units thereof ............................................ E ....................... 15.0", 
" Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6", 
" Optical instruments and apparatus .............................................................................. E ....................... 14.1", 
" Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3", 
" Silk .................................................................................................................................. A ....................... 13.2", 
" Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1", 
" Articles of materials described in division 58 ............................................................. D ....................... 13.1"
), id = c("1 ", "2 ", "3 ", "4 ", "5 ", "6 ", "7 ", "8 "), SH3 = c("776 ", 
"752 ", "759 ", "871 ", "553 ", "261 ", "846 ", "893 ")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))


# that looks like this

  string                                                                                                                                                                    id    SH3   
  <chr>                                                                                                                                                                     <chr> <chr> 
1 " Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3"                                          "1 "  "776 "
2 " Automatic data processing machines and units thereof ............................................ E ....................... 15.0"                                       "2 "  "752 "
3 " Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6"                               "3 "  "759 "
4 " Optical instruments and apparatus .............................................................................. E ....................... 14.1"                        "4 "  "871 "
5 " Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3"                              "5 "  "553 "
6 " Silk .................................................................................................................................. A ....................... 13.2" "6 "  "261 "
7 " Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1"                          "7 "  "846 "
8 " Articles of materials described in division 58 ............................................................. D ....................... 13.1"                            "8 "  "893 "

I would like to split the string variable into three separate variables. string has 3 parts separated by a series of dots (...)

1) The first part consists of some text: eg. in row 1 "Thermionic, cold and photo-cathode valves, tubes, and parts"

2) The second part is the capital letter: eg. in row 1: "E"

3) The last part is a number: e.g. in row 1 is "16.3".

I would like to split my string and to create three variables out of it. The issue is that the number of dots varies for every line. Does anyone know how an efficient way to do it?

An efficient manner to isolate the capital letter (part 2) would be sufficient towards my end.

Thank you very much in advance for your help

like image 638
Alex Avatar asked Mar 24 '26 14:03

Alex


1 Answers

You can use a regex that looks for dots [.] of length 2 or more {2,}:

strsplit(df$string, "[.]{2,}")[1:3]
# [[1]]
# [1] " Thermionic, cold and photo-cathode valves, tubes, and parts "
# [2] " E "                                                          
# [3] " 16.3"                                                        
# [[2]]
# [1] " Automatic data processing machines and units thereof " " E "                                                   
# [3] " 15.0"                                                 
# [[3]]
# [1] " Parts of and accessories suitable for 751, 752 " " E "                                             
# [3] " 14.6"                                           

With this, you can convert it into a frame with:

data.frame(do.call(rbind, strsplit(df$string, "[.]{2,}")), stringsAsFactors = FALSE)
#                                                              X1  X2    X3
# 1  Thermionic, cold and photo-cathode valves, tubes, and parts   E   16.3
# 2         Automatic data processing machines and units thereof   E   15.0
# 3               Parts of and accessories suitable for 751, 752   E   14.6
# 4                            Optical instruments and apparatus   E   14.1
# 5                 Perfumery, cosmetics and toilet preparations   E   13.3
# 6                                                         Silk   A   13.2
# 7                          Undergarments, knitted or crocheted   B   13.1
# 8               Articles of materials described in division 58   D   13.1

You'll want to rename and likely trimws and as.numeric certain columns, as the strsplit did not trim strings.

If all you need is the second column, then

trimws(sapply(strsplit(df$string, "[.]{2,}"), `[[`, 2))
# [1] "E" "E" "E" "E" "E" "A" "B" "D"
like image 171
r2evans Avatar answered Mar 26 '26 11:03

r2evans