根据标点符号将过长的文本分成相似的块

3
我有一串字符串,每个字符串的长度不能超过X个字符。每个字符串可以包含许多句子(由标点符号如句号分隔)。我需要按以下逻辑将长度超过X个字符的长句子分割开:

我必须将它们分成最少的部分(从2开始),以便所有的分块长度都小于X,并且尽可能相似(甚至完全相同),但考虑到标点符号(例如:如果我有Hello. How are you?,我不能把它分成Hello. How are you?,而是要分成Hello.How are you?,因为这是将它分成两个相等部分的最相似的方式,不会失去句子的意义)。

max = 10
strings = ["Hello. How are you? I'm fine", "other string containg dots", "another string containg dots"]
for string in string:
   if len(string) > max:
       #algorithm to chunck it

在这种情况下,我将不得不将第一个字符串Hello. How are you? I'm fine分成3部分,因为如果只分成2部分,则其中一个块将比10个字符(最大值)更长。
是否有一个聪明的现有解决方案?或者有人知道如何做到这一点吗?

也许 https://www.nltk.org/ 可以帮助您,除此之外,您可以开始编写一个函数,在特定的标点符号处切割给定字符串,并查看是否可以在大小限制内进行调整,并返回列表或其他结果。 - Copperfield
1个回答

1

一个将字符串按标点符号(例如“。”、“,”、“;”、“?”)分块的示例函数(在字符最小和最大长度之间);换句话说,优先考虑标点符号而不是字符长度:

import numpy as np
def chunkingStringFunction(strings, charactersDefiningChunking = [".", ",", ";", "?"], numberOfMaximumCharactersPerChunk = None, numberOfMinimumCharactersPerChunk = None, **kwargs):
  if numberOfMaximumCharactersPerChunk is None:
    numberOfMaximumCharactersPerChunk = 100
  if numberOfMinimumCharactersPerChunk is None:
    numberOfMinimumCharactersPerChunk = 2
  storingChunksOfString = []
  for string in strings:
    chunkingStartingAtThisIndex = 0
    indexingCharactersInStrings = 0
    while indexingCharactersInStrings < len(string) - 1:
      indexingCharactersInStrings += 1
      currentChunk = string[chunkingStartingAtThisIndex:indexingCharactersInStrings + 1]
      if len(currentChunk) >= numberOfMinimumCharactersPerChunk and len(currentChunk) <= numberOfMaximumCharactersPerChunk:
        indexesForStops = []
        for indexingCharacterDefiningChunking in range(len(charactersDefiningChunking)):
          indexesForStops.append(currentChunk.find(charactersDefiningChunking[indexingCharacterDefiningChunking]) + chunkingStartingAtThisIndex)
        indexesForStops = np.max(indexesForStops, axis = None)
        addChunk = string[chunkingStartingAtThisIndex:indexesForStops + 1]
        if len(addChunk) > 1 and addChunk != " ":
          storingChunksOfString.append(addChunk)
          chunkingStartingAtThisIndex = indexesForStops + 1
          indexingCharactersInStrings = chunkingStartingAtThisIndex
  return storingChunksOfString

另外,如果我们想要考虑平均字符长度,并从中找出分块的定义字符,则可以优先考虑字符长度。

import numpy as np
def chunkingStringFunction(strings, charactersDefiningChunking = [".", ",", ";", "?"], averageNumberOfCharactersPerChunk = None, **kwargs):
  if averageNumberOfCharactersPerChunk is None:
    averageNumberOfCharactersPerChunk = 10
  storingChunksOfString = []
  for string in strings:
    lastIndexChunked = 0
    for indexingCharactersInString in range(1, len(string), 1):
      chunkStopsAtADefinedCharacter = False
      if indexingCharactersInString - lastIndexChunked == averageNumberOfCharactersPerChunk:
        indexingNumberOfCharactersAwayFromAverageChunk = 1
        while chunkStopsAtADefinedCharacter == False:
          indexingNumberOfCharactersAwayFromAverageChunk += 1
          for thisCharacter in charactersDefiningChunking:
            findingAChunkCharacter = string[indexingCharactersInString - indexingNumberOfCharactersAwayFromAverageChunk:indexingCharactersInString + (indexingNumberOfCharactersAwayFromAverageChunk + 1)].find(thisCharacter)
            if findingAChunkCharacter > -1 and len(string[lastIndexChunked:indexingCharactersInString - indexingNumberOfCharactersAwayFromAverageChunk + findingAChunkCharacter + 1]) != 0:
              storingChunksOfString.append(string[lastIndexChunked:indexingCharactersInString - indexingNumberOfCharactersAwayFromAverageChunk + findingAChunkCharacter + 1])
              lastIndexChunked = indexingCharactersInString - indexingNumberOfCharactersAwayFromAverageChunk + findingAChunkCharacter + 1
              chunkStopsAtADefinedCharacter = True
      elif indexingCharactersInString == len(string) - 1 and lastIndexChunked != len(string) - 1 and len(string[lastIndexChunked:indexingCharactersInString + 1]) != 0:
        storingChunksOfString.append(string[lastIndexChunked:indexingCharactersInString + 1])
  return storingChunksOfString

谢谢!我尝试了这两种方法。基于平均数的函数工作得很好,即使对于我的目的来说,最大值函数(你提供的第一个函数)可能更好。不幸的是,用那个方法,切片长度恰好是1个句子(从一个句点到另一个句点),所以可能有些需要改进,但这是一个很好的起点! - Paolo Magnani

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接