Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting paragraph into sentences

I'm using the following Python code (which I found online a while ago) to split paragraphs into sentences.

def splitParagraphIntoSentences(paragraph):
  import re
  sentenceEnders = re.compile(r"""
      # Split sentences on whitespace between them.
      (?:               # Group for two positive lookbehinds.
        (?<=[.!?])      # Either an end of sentence punct,
      | (?<=[.!?]['"])  # or end of sentence punct and quote.
      )                 # End group of two positive lookbehinds.
      (?<!  Mr\.   )    # Don't end sentence on "Mr."
      (?<!  Mrs\.  )    # Don't end sentence on "Mrs."
      (?<!  Jr\.   )    # Don't end sentence on "Jr."
      (?<!  Dr\.   )    # Don't end sentence on "Dr."
      (?<!  Prof\. )    # Don't end sentence on "Prof."
      (?<!  Sr\.   )    # Don't end sentence on "Sr."."
    \s+               # Split on whitespace between sentences.
    """, 
    re.IGNORECASE | re.VERBOSE)
  sentenceList = sentenceEnders.split(paragraph)
  return sentenceList

I works just fine for my purpose, but now I need the exact same regex in Javascript (to make sure that the outputs are consistent) and I'm struggling to translate this Python regex into one compatible with Javascript.

like image 349
chrisvdb Avatar asked Nov 20 '25 09:11

chrisvdb


1 Answers

It is not regex for direct split, but kind of workaround:

(?!Mrs?\.|Jr\.|Dr\.|Sr\.|Prof\.)(\b\S+[.?!]["']?)\s

DEMO

You can replace matched fragment with for example: $1# (or other char not occuring in text, instead of #), and then split it with # DEMO. However it is not too elegant solution.

like image 189
m.cekiera Avatar answered Nov 21 '25 23:11

m.cekiera



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!