import re
sections = []
current = []
with open("Aberdeen2005.txt") as f:
for line in f:
if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):
sections.append("".join(current))
current = [line]
else:
current.append(line)
print(len(sections))
现在,文章由表达式sections
表示。
接下来我想做的是将文章分成两组。包含词语:economy OR economic AND uncertainty OR uncertain AND tax OR policy的文章,用数字1来标识。
而包含以下词语的文章:economy OR economic AND uncertain OR uncertainty AND regulation OR spending,用数字2来标识。这是我目前尝试过的:
for i in range(len(sections)):
group1 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[tax|policy]", , sections[i])
group2 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[regulation|spending]", , sections[i])
然而,它似乎没有起作用。有什么想法为什么会这样?