Divide too long text in similar chunks considering punctuation

Question

I have a list of strings that have to be not more than X characters. Each string can contain many sentences (separated by punctuation like dots). I need to separate longer sentences than X characters with this logic:

I have to divide them into the minimum number of parts (starting from 2), in order to have all the chunks with a lower length than X as similar as possible (possibly identical), but considering the punctuation (example: if I have Hello. How are you?, I can't divide it into Hello. Ho and w are you? but in Hello. and How are you? because it's the most similar way to divide it into two equal parts, without loosing the sense of the sentences)

max = 10
strings = ["Hello. How are you? I'm fine", "other string containg dots", "another string containg dots"]
for string in string:
   if len(string) > max:
       #algorithm to chunck it

In this case, I will have to divide the first string Hello. How are you? I'm fine into 3 parts because with 2 parts, I'll have one of the 2 chunks longer than 10 characters (max).

Is there a smart existing solution? Or does anyone know how to do that?

Ori Yarden · Accepted Answer

An example function for chunking string (within the character minimum and maximum lengths) by punctuation (e.g. ".", ",", ";", "?"); in other words, prioritizing punctuation over character length:

import numpy as np
def chunkingStringFunction(strings, charactersDefiningChunking = [".", ",", ";", "?"], numberOfMaximumCharactersPerChunk = None, numberOfMinimumCharactersPerChunk = None, **kwargs):
  if numberOfMaximumCharactersPerChunk is None:
    numberOfMaximumCharactersPerChunk = 100
  if numberOfMinimumCharactersPerChunk is None:
    numberOfMinimumCharactersPerChunk = 2
  storingChunksOfString = []
  for string in strings:
    chunkingStartingAtThisIndex = 0
    indexingCharactersInStrings = 0
    while indexingCharactersInStrings < len(string) - 1:
      indexingCharactersInStrings += 1
      currentChunk = string[chunkingStartingAtThisIndex:indexingCharactersInStrings + 1]
      if len(currentChunk) >= numberOfMinimumCharactersPerChunk and len(currentChunk) <= numberOfMaximumCharactersPerChunk:
        indexesForStops = []
        for indexingCharacterDefiningChunking in range(len(charactersDefiningChunking)):
          indexesForStops.append(currentChunk.find(charactersDefiningChunking[indexingCharacterDefiningChunking]) + chunkingStartingAtThisIndex)
        indexesForStops = np.max(indexesForStops, axis = None)
        addChunk = string[chunkingStartingAtThisIndex:indexesForStops + 1]
        if len(addChunk) > 1 and addChunk != " ":
          storingChunksOfString.append(addChunk)
          chunkingStartingAtThisIndex = indexesForStops + 1
          indexingCharactersInStrings = chunkingStartingAtThisIndex
  return storingChunksOfString

Alternatively, to prioritize character length; as in, if we want to consider our (average) character length and from there, find out where our defined characters for chunking are:

import numpy as np
def chunkingStringFunction(strings, charactersDefiningChunking = [".", ",", ";", "?"], averageNumberOfCharactersPerChunk = None, **kwargs):
  if averageNumberOfCharactersPerChunk is None:
    averageNumberOfCharactersPerChunk = 10
  storingChunksOfString = []
  for string in strings:
    lastIndexChunked = 0
    for indexingCharactersInString in range(1, len(string), 1):
      chunkStopsAtADefinedCharacter = False
      if indexingCharactersInString - lastIndexChunked == averageNumberOfCharactersPerChunk:
        indexingNumberOfCharactersAwayFromAverageChunk = 1
        while chunkStopsAtADefinedCharacter == False:
          indexingNumberOfCharactersAwayFromAverageChunk += 1
          for thisCharacter in charactersDefiningChunking:
            findingAChunkCharacter = string[indexingCharactersInString - indexingNumberOfCharactersAwayFromAverageChunk:indexingCharactersInString + (indexingNumberOfCharactersAwayFromAverageChunk + 1)].find(thisCharacter)
            if findingAChunkCharacter > -1 and len(string[lastIndexChunked:indexingCharactersInString - indexingNumberOfCharactersAwayFromAverageChunk + findingAChunkCharacter + 1]) != 0:
              storingChunksOfString.append(string[lastIndexChunked:indexingCharactersInString - indexingNumberOfCharactersAwayFromAverageChunk + findingAChunkCharacter + 1])
              lastIndexChunked = indexingCharactersInString - indexingNumberOfCharactersAwayFromAverageChunk + findingAChunkCharacter + 1
              chunkStopsAtADefinedCharacter = True
      elif indexingCharactersInString == len(string) - 1 and lastIndexChunked != len(string) - 1 and len(string[lastIndexChunked:indexingCharactersInString + 1]) != 0:
        storingChunksOfString.append(string[lastIndexChunked:indexingCharactersInString + 1])
  return storingChunksOfString

Divide too long text in similar chunks considering punctuation

Tags:

python

chunks

maxlength

punctuation

sentence

Paolo Magnani

1 Answers

Ori Yarden

Recent Activity

Donate For Us

Divide too long text in similar chunks considering punctuation

Tags:

python

chunks

maxlength

punctuation

sentence

Paolo Magnani

1 Answers

Ori Yarden

Related questions

Recent Activity

Donate For Us