I am trying to parse tab-delimited data, which has been saved as a text file with extraneous data. I would like this to be an R data.table/data.frame.
The tab-delimited format is the following:
A 1092 - 1093 + 1X
B 1093 HRDCPMRFYT
A 1093 + 1094 - 1X
B 1094 BSZSDFJRVF
A 1094 + 1095 + 1X
B 1095 SSTFCLEPVV
...
There are only two types of rows, A and B. A consistently has 5 columns, e.g. for the first row,
1092 - 1093 + 1X
B consistently has two columns:
1093 HRDCPMRFYT
Question: How do you parse a file with "alternating" rows with different formats?
Let's say that this was a text file which was only of this format, alternating rows of A and B, with 5 columns and 2 columns respectively. How do you parse this into an R data.table? My idea how be to create the following format:
1092 - 1093 + 1X 1093 HRDCPMRFYT
1093 + 1094 - 1X 1094 BSZSDFJRVF
1094 + 1095 + 1X 1095 SSTFCLEPVV
...
You can run shell commands using fread. In Win10, you can even run some linux utilities such as sed
Hence, you can simply do
fread("sed '$!N;s/\\n/ /' test.tab")
# V1 V2 V3 V4 V5 V6
# 1: 1092 - 1093 + 1X 1093 HRDCPMRFYT
# 2: 1093 + 1094 - 1X 1094 BSZSDFJRVF
# 3: 1094 + 1095 + 1X 1095 SSTFCLEPVV
(sed syntax taken from here)
Data
text <- "1092 - 1093 + 1X
1093 HRDCPMRFYT
1093 + 1094 - 1X
1094 BSZSDFJRVF
1094 + 1095 + 1X
1095 SSTFCLEPVV"
# Saving it as tab separated file on disk
write(gsub(" +", "\t", text), file = "test.tab")
One way to go is to read in your data with readLines, pull out the bits you want, and pass to read.table to form the dataframe. So if the rows are alternating then:
txt <-
'1092 - 1093 + 1X
1093 HRDCPMRFYT
1093 + 1094 - 1X
1094 BSZSDFJRVF
1094 + 1095 + 1X
1095 SSTFCLEPVV'
rd <- readLines(textConnection(txt))
data.frame(read.table(text=rd[c(TRUE, FALSE)]),
read.table(text=rd[c(FALSE, TRUE)]))
Change textConnection(txt) to your file path
Another way is to read in only once and then post-process
r <- read.table(text=txt, fill=TRUE, stringsAsFactors=FALSE, na.strings = "")
d <- cbind(r[c(TRUE, FALSE),], r[c(FALSE, TRUE),])
d[ colSums(is.na(d)) < nrow(d)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With