I'm fairly new to the R language. So I have this vector containing the following:
> head(sampleVector)
[1] "| txt01 |   100 |         200 |       123.456 |           0.12345 |"
[2] "| txt02 |   300 |         400 |       789.012 |           0.06789 |"
I want to extract the lines and break each into separate pieces, with a data value per piece.
I want to get a list resultListthat eventually would print out the following:
> head(resultList)`
[[1]]`  
[1] ""   "txt01"    "100"       "200"     "123.456"        "0.12345" 
[[2]]`  
[1] ""   "txt02"    "300"       "400"     "789.012"        "0.06789"
I am struggling with the strsplit() notation and I have tried and got the following code so far: 
resultList  <- strsplit(sampleVector,"\\s+[|] | [|]\\s+ | [\\s+]")`          
#would give me the following output`
# [[1]]`    
# [1] "| txt01"    "100"       "200"     "123.456"        "0.12345 |" 
Anyway I can get the output the one strsplit call? I am guessing my notation to distinguish the delimiter + whitespace is wrong. Any help on this would be good.
Another strsplit option which I nearly missed:
strsplit(test,"[| ]+")
#[[1]]
#[1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
# 
#[[2]]
#[1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"
...and my original answer because regmatches is my favourite function of late:
regmatches(test,gregexpr("[^| ]+",test))
#[[1]]
#[1] "txt01"   "100"     "200"     "123.456" "0.12345"
#
#[[2]]
#[1] "txt02"   "300"     "400"     "789.012" "0.06789"
To break it down as requested:
[| ]+ is a regex searching for single or repeated instances + of a space   or a pipe  |[^| ]+ is a regex searching for single or repeated instances + of any character not ^ a space   or a pipe  |gregexpr finds all the instances of this pattern and returns the start locations and length of the matching patterns.regmatches extracts all the patterns from test that are matched by gregexpr
Here's one way.  This first removes the | from the vector with gsub. Then it uses strsplit on the spaces (or any number of spaces). Probably a bit easier that way.
strsplit(gsub("|", "", sampleVector, fixed=TRUE), "\\s+")
# [[1]]
# [1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
#
# [[2]]
# [1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"
Here's an interesting alternative using scan that might be useful, and will probably be quite fast.
lapply(sampleVector, function(y) {
    s <- scan(text = y, what = character(), sep = "|", quiet = TRUE)
    (g <- gsub("\\s+", "", s))[-length(g)]
})
# [[1]]
# [1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
#
# [[2]]
# [1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With