Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the first N sentences from text?

I need to get the first N sentences from a text where the last char of the sentence can be a period, a colon, or a semicolon. For example, given this text:

Lorem ipsum, dolor sit amet. consectetur adipisicing elit; sed do eiusmod tempor.
incididunt ut labore: et dolore magna aliqua. Ut enim ad. minim veniam.

The first 4 sentences would be,

Lorem ipsum, dolor sit amet. consectetur adipisicing elit; sed do eiusmod tempor.
incididunt ut labore:

Currently, my code is splitting the string using .,:, and ; as the delimiter and then join the results.

import re
sentences = re.split('\. |: |;', text)
summary = ' '.join(sentences[:4])

But it will remove the delimiters from the result. I'm open to regex or basic string manipulation.

like image 677
flowfree Avatar asked Sep 07 '25 05:09

flowfree


2 Answers

>>> import re
>>> text = "Lorem ipsum, dolor sit amet. consectetur adipisicing elit; sed do eiusmod tempor. incididunt ut labore: et dolore magna aliqua. Ut enim ad. minim veniam."
>>> ' '.join(re.split(r'(?<=[.:;])\s', text)[:4])
'Lorem ipsum, dolor sit amet. consectetur adipisicing elit; sed do eiusmod tempor. incididunt ut labore:'
like image 66
jamylak Avatar answered Sep 09 '25 16:09

jamylak


So, I know this question was about using regex to find sentences, but, for the same reason that regex is not the right choice for parsing html, (different classes of grammars), regex is an even worse choice for problems that involve Natural Language.

If your goal is to actually delineate sentences you've got to look for other tools. Personally I would recommend the Punkt sentence tokenizer provided by nltk. Below is an example showing why this is a fundamentally better choice than regex for this task.

Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark 
sentence boundaries.  And sometimes sentences can start with non-capitalized 
words.  i is a good variable name.

http://nltk.org/api/nltk.tokenize.html for more info.

like image 24
Slater Victoroff Avatar answered Sep 09 '25 15:09

Slater Victoroff