How to get the first N sentences from text?

Question

I need to get the first N sentences from a text where the last char of the sentence can be a period, a colon, or a semicolon. For example, given this text:

Lorem ipsum, dolor sit amet. consectetur adipisicing elit; sed do eiusmod tempor.
incididunt ut labore: et dolore magna aliqua. Ut enim ad. minim veniam.

The first 4 sentences would be,

Lorem ipsum, dolor sit amet. consectetur adipisicing elit; sed do eiusmod tempor.
incididunt ut labore:

Currently, my code is splitting the string using .,:, and ; as the delimiter and then join the results.

import re
sentences = re.split('\. |: |;', text)
summary = ' '.join(sentences[:4])

But it will remove the delimiters from the result. I'm open to regex or basic string manipulation.

jamylak · Accepted Answer

>>> import re
>>> text = "Lorem ipsum, dolor sit amet. consectetur adipisicing elit; sed do eiusmod tempor. incididunt ut labore: et dolore magna aliqua. Ut enim ad. minim veniam."
>>> ' '.join(re.split(r'(?<=[.:;])\s', text)[:4])
'Lorem ipsum, dolor sit amet. consectetur adipisicing elit; sed do eiusmod tempor. incididunt ut labore:'

Slater Victoroff · Answer

So, I know this question was about using regex to find sentences, but, for the same reason that regex is not the right choice for parsing html, (different classes of grammars), regex is an even worse choice for problems that involve Natural Language.

If your goal is to actually delineate sentences you've got to look for other tools. Personally I would recommend the Punkt sentence tokenizer provided by nltk. Below is an example showing why this is a fundamentally better choice than regex for this task.

Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark 
sentence boundaries.  And sometimes sentences can start with non-capitalized 
words.  i is a good variable name.

http://nltk.org/api/nltk.tokenize.html for more info.

How to get the first N sentences from text?

Tags:

python

string

regex

flowfree

2 Answers

jamylak

Slater Victoroff

Recent Activity

Donate For Us

How to get the first N sentences from text?

Tags:

python

string

regex

flowfree

2 Answers

jamylak

Slater Victoroff

Related questions

Recent Activity

Donate For Us