Split long string in three variables

Question

I have a data frame that looks like this:

df<-structure(list(string = c(" Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3", 
" Automatic data processing machines and units thereof ............................................ E ....................... 15.0", 
" Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6", 
" Optical instruments and apparatus .............................................................................. E ....................... 14.1", 
" Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3", 
" Silk .................................................................................................................................. A ....................... 13.2", 
" Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1", 
" Articles of materials described in division 58 ............................................................. D ....................... 13.1"
), id = c("1 ", "2 ", "3 ", "4 ", "5 ", "6 ", "7 ", "8 "), SH3 = c("776 ", 
"752 ", "759 ", "871 ", "553 ", "261 ", "846 ", "893 ")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))


# that looks like this

  string                                                                                                                                                                    id    SH3   
  <chr>                                                                                                                                                                     <chr> <chr> 
1 " Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3"                                          "1 "  "776 "
2 " Automatic data processing machines and units thereof ............................................ E ....................... 15.0"                                       "2 "  "752 "
3 " Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6"                               "3 "  "759 "
4 " Optical instruments and apparatus .............................................................................. E ....................... 14.1"                        "4 "  "871 "
5 " Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3"                              "5 "  "553 "
6 " Silk .................................................................................................................................. A ....................... 13.2" "6 "  "261 "
7 " Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1"                          "7 "  "846 "
8 " Articles of materials described in division 58 ............................................................. D ....................... 13.1"                            "8 "  "893 "

I would like to split the string variable into three separate variables. string has 3 parts separated by a series of dots (...)

1) The first part consists of some text: eg. in row 1 "Thermionic, cold and photo-cathode valves, tubes, and parts"

2) The second part is the capital letter: eg. in row 1: "E"

3) The last part is a number: e.g. in row 1 is "16.3".

I would like to split my string and to create three variables out of it. The issue is that the number of dots varies for every line. Does anyone know how an efficient way to do it?

An efficient manner to isolate the capital letter (part 2) would be sufficient towards my end.

Thank you very much in advance for your help

r2evans · Accepted Answer

You can use a regex that looks for dots [.] of length 2 or more {2,}:

strsplit(df$string, "[.]{2,}")[1:3]
# [[1]]
# [1] " Thermionic, cold and photo-cathode valves, tubes, and parts "
# [2] " E "                                                          
# [3] " 16.3"                                                        
# [[2]]
# [1] " Automatic data processing machines and units thereof " " E "                                                   
# [3] " 15.0"                                                 
# [[3]]
# [1] " Parts of and accessories suitable for 751, 752 " " E "                                             
# [3] " 14.6"

With this, you can convert it into a frame with:

data.frame(do.call(rbind, strsplit(df$string, "[.]{2,}")), stringsAsFactors = FALSE)
#                                                              X1  X2    X3
# 1  Thermionic, cold and photo-cathode valves, tubes, and parts   E   16.3
# 2         Automatic data processing machines and units thereof   E   15.0
# 3               Parts of and accessories suitable for 751, 752   E   14.6
# 4                            Optical instruments and apparatus   E   14.1
# 5                 Perfumery, cosmetics and toilet preparations   E   13.3
# 6                                                         Silk   A   13.2
# 7                          Undergarments, knitted or crocheted   B   13.1
# 8               Articles of materials described in division 58   D   13.1

You'll want to rename and likely trimws and as.numeric certain columns, as the strsplit did not trim strings.

If all you need is the second column, then

trimws(sapply(strsplit(df$string, "[.]{2,}"), `[[`, 2))
# [1] "E" "E" "E" "E" "E" "A" "B" "D"

Split long string in three variables

Tags:

string

split

r

data-manipulation

Alex

1 Answers

r2evans

Recent Activity

Donate For Us

Split long string in three variables

Tags:

string

split

r

data-manipulation

Alex

1 Answers

r2evans

Related questions

Recent Activity

Donate For Us