Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract chapters from text

Tags:

regex

dataframe

r

Similar to my question here, I want to extract char sequences within a string with Regex in R. I want to extract sections from a text document, resulting in a data frame where each sub-section is treated as its own vector, for further Text Mining. This is my sample data:

chapter_one <- c("One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
1 Introduction
He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. 
1.1 Futher
The bedding was hardly able to cover it and seemed ready to slide off any moment. 
1.1.1 This Should be Part of One Point One
His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked.
1.2 Futher Fuhter
'What's happened to me?' he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls.")

This is my expected output:

chapter_id <- (c("1 Introduction", "1.1 Futher", "1.2 Futher Futher")) 
text <- (c("He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.", "The bedding was hardly able to cover it and seemed ready to slide off any moment. His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked.", "'What's happened to me?' he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls."))

chapter_one_df <- data.frame(chapter_id, text)

What I tried so far is something like this:

library(stringr)

regex_chapter_heading <- regex("
          [:digit:]     # Digit number 
                        # MISSING: Optional dot and optional second digit number 
          \\s           # Space
          ([[:alpha:]]) # Alphabetic characters (MISSING: can also contain punctuation, as in 'Introduction - A short introduction')
                     ", comments = TRUE)

read.table(text=gsub(regex_chapter_heading,"\\1:",chapter_one),sep=":")

So far, this does not produce the expected output - because, as indicated, parts of the Regex are still missing. Any help is highly appreciated!

like image 212
Paul-Simon Avatar asked Dec 18 '25 13:12

Paul-Simon


1 Answers

You may try the following approach: 1) replace all lines starting with three dot separated numbers (as these are continutations of the previous bullet points), and 2) extract the parts using the number + optional dot+number as a separator pattern while capturing the first lines and the lines to follow into separate capturing groups:

library(stringr)
# Replace lines starting with N.N.N+ with space
chapter_one <- gsub("\\R\\d+(?:\\.\\d+){2,}\\s+[A-Z].*\\R?", " ", chapter_one, perl=TRUE)
# Split into IDs and Texts
data <- str_match_all(chapter_one, "(?sm)^(\\d+(?:\\.\\d+)?\\s+[A-Z][^\r\n]*)\\R(.*?)(?=\\R\\d+(?:\\.\\d+)?\\s+[A-Z]|\\z)")
# Get the chapter ID column
chapter_id <- trimws(data[[1]][,2])
# Get the text ID column
text <- trimws(data[[1]][,3])
# Create the target DF
chapter_one_df <- data.frame(chapter_id, text)

Output:

         chapter_id
1    1 Introduction
2        1.1 Futher
3 1.2 Futher Fuhter
                                                                                                                                                                                              text
1                                       He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.
2 The bedding was hardly able to cover it and seemed ready to slide off any moment.  His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked.
3                               'What's happened to me?' he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls.

The \R\d+(?:\.\d+){2,}\s+[A-Z].*\R? pattern is used to replace the lines you want to "exclude" with a space:

  • \R - line break
  • \d+ - 1+ digits
  • (?:\.\d+){2,} - two or more repetitions of . and 1+ digits
  • \s+ - 1+ whitespaces (replace with \h to matcha single horizontal whitespace, or \h+ to match them 1 or more)
  • [A-Z] - an uppercase letter
  • .* - any 0+ chars other than line break chars, as many as possible, up to the end of the line
  • \R? - an optional line break char sequence.

The second regex is rather complex:

(?sm)^(\d+(?:\.\d+)?\s+[A-Z][^\r\n]*)\R(.*?)(?=\R\d+(?:\.\d+)?\s+[A-Z]|\z)

See the regex demo.

Details

  • (?sm) - s makes . match any chars and m makes ^ match start of a line
  • ^ - start of a line
  • (\d+(?:\.\d+)?\s+[A-Z][^\r\n]*) - Group 1: one or more digits, then 1 or 0 repetitions of . and 1+ digits, 1+ whitespaces, an uppercase letter, any 0+ chars other than CR and LF symbols, as many as possible,
  • \R - line break
  • (.*?) - Group 2: any 0+ chars, as few as possible, up to the first occurrence of
    • \R\d+(?:\.\d+)?\s+[A-Z] - line break, one or more digits, then 1 or 0 repetitions of . and 1+ digits, 1+ whitespaces, an uppercase letter
    • | - or
    • \z - end of string.
like image 131
Wiktor Stribiżew Avatar answered Dec 21 '25 01:12

Wiktor Stribiżew