将n-gram合并或反转为单个字符串

Question

将n-gram合并或反转为单个字符串

3

我该如何将下面的双字母词组合并成一个字符串？

_bigrams=['the school', 'school boy', 'boy is', 'is reading']
_split=(' '.join(_bigrams)).split()
_newstr=[]
_filter=[_newstr.append(x) for x in _split if x not in _newstr]
_newstr=' '.join(_newstr)
print _newstr

输出结果：'the school boy is reading'......虽然这是期望的输出结果，但由于数据量很大，所以这种方法过于冗长且效率不佳。其次，这种方法不能支持最终字符串中的重复单词，即'the school boy is reading, is he?'。在这种情况下，只允许一个 'is' 出现在最终字符串中。

有什么好的建议可以让这个工作变得更好吗？谢谢。

- Tiger1

3个回答

2

如果你真的想要一行代码，类似这样的语句可以起作用：

' '.join(val.split()[0] for val in (_bigrams)) + ' ' +  _bigrams[-1].split()[-1]

- M4rtini

1

这样可以吗？它只是简单地取第一个词直到最后一次输入。

_bigrams=['the school', 'school boy', 'boy is', 'is reading']

clause = [a.split()[0] if a != _bigrams[-1] else a for a in _bigrams]

print ' '.join(clause)

输出

the school boy is reading

然而，就性能而言，Amber的解决方案可能是一个不错的选择。

- embert

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Amber · Accepted Answer

# Multi-for generator expression allows us to create a flat iterable of words
all_words = (word for bigram in _bigrams for word in bigram.split())

def no_runs_of_words(words):
    """Takes an iterable of words and returns one with any runs condensed."""
    prev_word = None
    for word in words:
        if word != prev_word:
            yield word
        prev_word = word

final_string = ' '.join(no_runs_of_words(all_words))

这利用生成器进行惰性求值，并且在生成最终字符串之前不会将整个单词集保存在内存中。