Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse complex text file in R

I am looking to parse a text file in R to be loaded as a data.frame. I have a long text file with fixed width data, seperated by sections (ID) and subsections (SUB). The length of each section is variable. I'm looking to create two data frames, one for the ID section and one for the SUB section. The example data is as follows:

Header 1
METRIC       0.30    10.00
ID      K0107050 Aa
    0.06    15.24    14.40    14.40     7.13     0.13     0.19  1
    0.17    14.35    13.57    13.57     6.40     0.12     0.18  1

SUB
    1.000   1.000  0.093  0.11  0.11 301
    1.000   1.000  0.093  0.11  0.11  61
ID      K0129050 Aa
    0.06    26.35    24.90    24.90    10.88     0.62     0.88  1
    0.15    25.35    23.96    23.96    10.93     0.55     0.74  1

SUB
    3.000   3.000  0.506  0.53  0.53 102
    4.000   4.000  0.514  0.55  0.55 118

The dataframe(s) I would like are:

DF1

Header 1    K0107050    Aa    0.06    15.24    14.40    14.40     7.13     0.13     0.19  1
Header 1    K0107050    Aa    0.17    14.35    13.57    13.57     6.40     0.12     0.18  1
Header 1    K0129050    Aa    0.06    26.35    24.90    24.90    10.88     0.62     0.88  1
Header 1    K0129050    Aa    0.15    25.35    23.96    23.96    10.93     0.55     0.74  1

DF2

Header 1    K0107050    Aa  1.000   1.000  0.093  0.11  0.11 301
Header 1    K0107050    Aa  1.000   1.000  0.093  0.11  0.11  61
Header 1    K0129050    Aa  3.000   3.000  0.506  0.53  0.53 102
Header 1    K0129050    Aa  4.000   4.000  0.514  0.55  0.55 118

I've gotten so far as to use the readLines() but get stuck after that, given the different sections in the text file. Thank you

like image 444
user2325155 Avatar asked Dec 01 '25 06:12

user2325155


1 Answers

Here is the start (sorry time to bed...):

x <- readLines("myFile.txt")

library(dplyr)

bind_rows(
  lapply(split(x, cumsum(grepl("Header|Metric|ID|SUB", x))), function(i){
    i1 <- i[ i != "" ]  
    nums <- unlist(strsplit(tail(i1, -1), " "))
    res <- cbind.data.frame(Grp = i1[1],
                            matrix(na.omit(as.numeric(nums)),
                                   nrow = length(i1) - 1, byrow = TRUE),
                            stringsAsFactors = FALSE)

    res
  })
)

#                   Grp    1     2      3     4     5      6    7  8
# 1            Header 1 0.30 10.00     NA    NA    NA     NA   NA NA
# 2 ID      K0107050 Aa 0.06 15.24 14.400 14.40  7.13   0.13 0.19  1
# 3 ID      K0107050 Aa 0.17 14.35 13.570 13.57  6.40   0.12 0.18  1
# 4                 SUB 1.00  1.00  0.093  0.11  0.11 301.00   NA NA
# 5                 SUB 1.00  1.00  0.093  0.11  0.11  61.00   NA NA
# 6 ID      K0129050 Aa 0.06 26.35 24.900 24.90 10.88   0.62 0.88  1
# 7 ID      K0129050 Aa 0.15 25.35 23.960 23.96 10.93   0.55 0.74  1
# 8                 SUB 3.00  3.00  0.506  0.53  0.53 102.00   NA NA
# 9                 SUB 4.00  4.00  0.514  0.55  0.55 118.00   NA NA
like image 142
zx8754 Avatar answered Dec 03 '25 23:12

zx8754



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!