如何从马尔科夫链输出创建段落?

6
我想修改以下脚本,使其将脚本生成的随机数量的句子合并成段落。换句话说,在添加换行符之前,连接一定数量(如1-5)的句子。
该脚本目前运行正常,但输出是由换行符分隔的短句子。我想将一些句子组成段落。
有什么最佳实践的想法吗?谢谢。
"""
    from:  http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python
"""

import random;
import sys;

stopword = "\n" # Since we split on whitespace, this can never be a word
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
sentencesep  = "\n" #String used to seperate sentences


# GENERATE TABLE
w1 = stopword
w2 = stopword
table = {}

for line in sys.stdin:
    for word in line.split():
        if word[-1] in stopsentence:
            table.setdefault( (w1, w2), [] ).append(word[0:-1])
            w1, w2 = w2, word[0:-1]
            word = word[-1]
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word
# Mark the end of the file
table.setdefault( (w1, w2), [] ).append(stopword)

# GENERATE SENTENCE OUTPUT
maxsentences  = 20

w1 = stopword
w2 = stopword
sentencecount = 0
sentence = []

while sentencecount < maxsentences:
    newword = random.choice(table[(w1, w2)])
    if newword == stopword: sys.exit()
    if newword in stopsentence:
        print ("%s%s%s" % (" ".join(sentence), newword, sentencesep))
        sentence = []
        sentencecount += 1
    else:
        sentence.append(newword)
    w1, w2 = w2, newword

编辑01:

好的,我已经拼凑出了一个简单的“段落包装器”,它可以很好地将句子组成段落,但它会影响句子生成器的输出——例如,我会得到过多重复的首个单词等问题。

但这个设想是可行的;我只需要弄清楚为什么句子循环的功能会受到段落循环的添加而受到影响。如果您能看到问题,请提供建议:

###
#    usage: $ python markov_sentences.py < input.txt > output.txt
#    from:  http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python
###

import random;
import sys;

stopword = "\n" # Since we split on whitespace, this can never be a word
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
paragraphsep  = "\n\n" #String used to seperate sentences


# GENERATE TABLE
w1 = stopword
w2 = stopword
table = {}

for line in sys.stdin:
    for word in line.split():
        if word[-1] in stopsentence:
            table.setdefault( (w1, w2), [] ).append(word[0:-1])
            w1, w2 = w2, word[0:-1]
            word = word[-1]
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word
# Mark the end of the file
table.setdefault( (w1, w2), [] ).append(stopword)

# GENERATE PARAGRAPH OUTPUT
maxparagraphs = 10
paragraphs = 0 # reset the outer 'while' loop counter to zero

while paragraphs < maxparagraphs: # start outer loop, until maxparagraphs is reached
    w1 = stopword
    w2 = stopword
    stopsentence = (".", "!", "?",)
    sentence = []
    sentencecount = 0 # reset the inner 'while' loop counter to zero
    maxsentences = random.randrange(1,5) # random sentences per paragraph

    while sentencecount < maxsentences: # start inner loop, until maxsentences is reached
        newword = random.choice(table[(w1, w2)]) # random word from word table
        if newword == stopword: sys.exit()
        elif newword in stopsentence:
            print ("%s%s" % (" ".join(sentence), newword), end=" ")
            sentencecount += 1 # increment the sentence counter
        else:
            sentence.append(newword)
        w1, w2 = w2, newword
    print (paragraphsep) # newline space
    paragraphs = paragraphs + 1 # increment the paragraph counter


# EOF

编辑 02:

根据下面的答案,将sentence = []添加到elif语句中。如下所示:

        elif newword in stopsentence:
            print ("%s%s" % (" ".join(sentence), newword), end=" ")
            sentence = [] # I have to be here to make the new sentence start as an empty list!!!
            sentencecount += 1 # increment the sentence counter

EDIT 03:

这是此脚本的最终版本。感谢grieve在解决问题方面的帮助。我希望其他人也能对此感到有趣,我知道我会。 ;)

顺便说一句:有一个小瑕疵-如果您使用此脚本,可能需要清除一个额外的段落结尾空格。但是,除此之外,这是马尔可夫链文本生成的完美实现。

###
#    usage: python markov_sentences.py < input.txt > output.txt
#    from:  http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python
###

import random;
import sys;

stopword = "\n" # Since we split on whitespace, this can never be a word
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
sentencesep  = "\n" #String used to seperate sentences


# GENERATE TABLE
w1 = stopword
w2 = stopword
table = {}

for line in sys.stdin:
    for word in line.split():
        if word[-1] in stopsentence:
            table.setdefault( (w1, w2), [] ).append(word[0:-1])
            w1, w2 = w2, word[0:-1]
            word = word[-1]
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word
# Mark the end of the file
table.setdefault( (w1, w2), [] ).append(stopword)

# GENERATE SENTENCE OUTPUT
maxsentences  = 20

w1 = stopword
w2 = stopword
sentencecount = 0
sentence = []
paragraphsep = "\n"
count = random.randrange(1,5)

while sentencecount < maxsentences:
    newword = random.choice(table[(w1, w2)]) # random word from word table
    if newword == stopword: sys.exit()
    if newword in stopsentence:
        print ("%s%s" % (" ".join(sentence), newword), end=" ")
        sentence = []
        sentencecount += 1 # increment the sentence counter
        count -= 1
        if count == 0:
            count = random.randrange(1,5)
            print (paragraphsep) # newline space
    else:
        sentence.append(newword)
    w1, w2 = w2, newword


# EOF
2个回答

3

你需要复制

sentence = [] 

回到IT技术方面,

继续进行


elif newword in stopsentence:

clause.

So

while paragraphs < maxparagraphs: # start outer loop, until maxparagraphs is reached
    w1 = stopword
    w2 = stopword
    stopsentence = (".", "!", "?",)
    sentence = []
    sentencecount = 0 # reset the inner 'while' loop counter to zero
    maxsentences = random.randrange(1,5) # random sentences per paragraph

    while sentencecount < maxsentences: # start inner loop, until maxsentences is reached
        newword = random.choice(table[(w1, w2)]) # random word from word table
        if newword == stopword: sys.exit()
        elif newword in stopsentence:
            print ("%s%s" % (" ".join(sentence), newword), end=" ")
            sentence = [] # I have to be here to make the new sentence start as an empty list!!!
            sentencecount += 1 # increment the sentence counter
        else:
            sentence.append(newword)
        w1, w2 = w2, newword
    print (paragraphsep) # newline space
    paragraphs = paragraphs + 1 # increment the paragraph counter

编辑

以下是一种不使用外部循环的解决方案。

"""
    from:  http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/?in=lang-python
"""

import random;
import sys;

stopword = "\n" # Since we split on whitespace, this can never be a word
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
sentencesep  = "\n" #String used to seperate sentences


# GENERATE TABLE
w1 = stopword
w2 = stopword
table = {}

for line in sys.stdin:
    for word in line.split():
        if word[-1] in stopsentence:
            table.setdefault( (w1, w2), [] ).append(word[0:-1])
            w1, w2 = w2, word[0:-1]
            word = word[-1]
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word
# Mark the end of the file
table.setdefault( (w1, w2), [] ).append(stopword)

# GENERATE SENTENCE OUTPUT
maxsentences  = 20

w1 = stopword
w2 = stopword
sentencecount = 0
sentence = []
paragraphsep == "\n\n"
count = random.randrange(1,5)

while sentencecount < maxsentences:
    newword = random.choice(table[(w1, w2)])
    if newword == stopword: sys.exit()
    if newword in stopsentence:
        print ("%s%s" % (" ".join(sentence), newword), end=" ")
        sentence = []
        sentencecount += 1
        count -= 1
        if count == 0:
            count = random.randrange(1,5)
            print (paragraphsep)
    else:
        sentence.append(newword)
    w1, w2 = w2, newword

糟糕!是啊,我肯定在某个时候把它拔掉了,然后忘记再次放回去。感谢您的见解!这解决了问题 - 几乎。似乎句子循环重复使用每个句子的相同起始词。有什么想法可以混合选择用于生成句子的第一个词吗? - Spider M. Mann
我添加了一个单独的解决方案,不需要外部循环。 - grieve
我目前没有安装Python 3,所以您可能需要调整第二个解决方案的语法。 - grieve
太好了。谢谢你,grieve!那个完美地解决了问题。需要进行一些小的编辑,但没有什么大问题。请参考原帖获取最终代码。我无法感谢你的足够 - 我都快抓狂了。做得非常好。 - Spider M. Mann

1

你能理解这段代码吗?我敢打赌,你可以找到打印句子的部分,并将其更改为一次性打印多个句子,而不需要回车。你可以在句子的部分周围添加另一个while循环来获得多个段落。

语法提示:

print 'hello'
print 'there'
hello
there

print 'hello',
print 'there'
hello there

print 'hello',
print 
print 'there'

问题在于,在打印语句的末尾加上逗号会防止换行,而一个空的打印语句则会打印一个换行符。

是的,我明白。问题是,使用print语句尝试的一切都没帮助将句子收集到段落中(除非你计算所有换行符,制作一个巨大的段落)。我想到了while循环,但不太确定如何包装句子部分。我尝试的一切都导致各种错误,所以我想请教专家。告诉它“生成x(例如1-5)个句子,然后插入一个换行符,重复此操作,直到达到maxsentences的数量”,这样做的最佳方法是什么? - Spider M. Mann

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接