I need to get the first N sentences from a text where the last char of the sentence can be a period, a colon, or a semicolon. For example, given this text:
Lorem ipsum, dolor sit amet. consectetur adipisicing elit; sed do eiusmod tempor.
incididunt ut labore: et dolore magna aliqua. Ut enim ad. minim veniam.
The first 4 sentences would be,
Lorem ipsum, dolor sit amet. consectetur adipisicing elit; sed do eiusmod tempor.
incididunt ut labore:
Currently, my code is splitting the string using .
,:
, and ;
as the delimiter and then join the results.
import re
sentences = re.split('\. |: |;', text)
summary = ' '.join(sentences[:4])
But it will remove the delimiters from the result. I'm open to regex or basic string manipulation.
>>> import re
>>> text = "Lorem ipsum, dolor sit amet. consectetur adipisicing elit; sed do eiusmod tempor. incididunt ut labore: et dolore magna aliqua. Ut enim ad. minim veniam."
>>> ' '.join(re.split(r'(?<=[.:;])\s', text)[:4])
'Lorem ipsum, dolor sit amet. consectetur adipisicing elit; sed do eiusmod tempor. incididunt ut labore:'
So, I know this question was about using regex to find sentences, but, for the same reason that regex is not the right choice for parsing html, (different classes of grammars), regex is an even worse choice for problems that involve Natural Language.
If your goal is to actually delineate sentences you've got to look for other tools. Personally I would recommend the Punkt sentence tokenizer provided by nltk. Below is an example showing why this is a fundamentally better choice than regex for this task.
Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark
sentence boundaries. And sometimes sentences can start with non-capitalized
words. i is a good variable name.
http://nltk.org/api/nltk.tokenize.html for more info.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With