I have a data frame that looks like this:
df<-structure(list(string = c(" Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3",
" Automatic data processing machines and units thereof ............................................ E ....................... 15.0",
" Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6",
" Optical instruments and apparatus .............................................................................. E ....................... 14.1",
" Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3",
" Silk .................................................................................................................................. A ....................... 13.2",
" Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1",
" Articles of materials described in division 58 ............................................................. D ....................... 13.1"
), id = c("1 ", "2 ", "3 ", "4 ", "5 ", "6 ", "7 ", "8 "), SH3 = c("776 ",
"752 ", "759 ", "871 ", "553 ", "261 ", "846 ", "893 ")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
# that looks like this
string id SH3
<chr> <chr> <chr>
1 " Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3" "1 " "776 "
2 " Automatic data processing machines and units thereof ............................................ E ....................... 15.0" "2 " "752 "
3 " Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6" "3 " "759 "
4 " Optical instruments and apparatus .............................................................................. E ....................... 14.1" "4 " "871 "
5 " Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3" "5 " "553 "
6 " Silk .................................................................................................................................. A ....................... 13.2" "6 " "261 "
7 " Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1" "7 " "846 "
8 " Articles of materials described in division 58 ............................................................. D ....................... 13.1" "8 " "893 "
I would like to split the string variable into three separate variables. string has 3 parts separated by a series of dots (...)
1) The first part consists of some text: eg. in row 1 "Thermionic, cold and photo-cathode valves, tubes, and parts"
2) The second part is the capital letter: eg. in row 1: "E"
3) The last part is a number: e.g. in row 1 is "16.3".
I would like to split my string and to create three variables out of it. The issue is that the number of dots varies for every line. Does anyone know how an efficient way to do it?
An efficient manner to isolate the capital letter (part 2) would be sufficient towards my end.
Thank you very much in advance for your help
You can use a regex that looks for dots [.] of length 2 or more {2,}:
strsplit(df$string, "[.]{2,}")[1:3]
# [[1]]
# [1] " Thermionic, cold and photo-cathode valves, tubes, and parts "
# [2] " E "
# [3] " 16.3"
# [[2]]
# [1] " Automatic data processing machines and units thereof " " E "
# [3] " 15.0"
# [[3]]
# [1] " Parts of and accessories suitable for 751, 752 " " E "
# [3] " 14.6"
With this, you can convert it into a frame with:
data.frame(do.call(rbind, strsplit(df$string, "[.]{2,}")), stringsAsFactors = FALSE)
# X1 X2 X3
# 1 Thermionic, cold and photo-cathode valves, tubes, and parts E 16.3
# 2 Automatic data processing machines and units thereof E 15.0
# 3 Parts of and accessories suitable for 751, 752 E 14.6
# 4 Optical instruments and apparatus E 14.1
# 5 Perfumery, cosmetics and toilet preparations E 13.3
# 6 Silk A 13.2
# 7 Undergarments, knitted or crocheted B 13.1
# 8 Articles of materials described in division 58 D 13.1
You'll want to rename and likely trimws and as.numeric certain columns, as the strsplit did not trim strings.
If all you need is the second column, then
trimws(sapply(strsplit(df$string, "[.]{2,}"), `[[`, 2))
# [1] "E" "E" "E" "E" "E" "A" "B" "D"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With