正则表达式匹配第一个和最后一个单词或任意单词

Question

正则表达式匹配第一个和最后一个单词或任意单词

3

我有一个包含大量数据列表的文件，类似于这样：

 #fabulous       7.526   2301    2
 #excellent      7.247   2612    3
 #superb 7.199   1660    2
 #perfection     7.099   3004    4
 #terrific       6.922   629     1

我有一个文件，里面包含了一系列这样的句子：

Terrific Theo Walcott is still shit, watch Rafa and Johnny deal with him on Saturday.
its not that I'm a GSP fan, fabulous
Iranian general says Israel's Iron Dome can't deal with their missiles 
with J Davlar 11th. Main rivals are team Poland.

我想用正则表达式检查以下内容：

每个句子的第一个单词是否与文件中的任何单词匹配。例如，如果Terrific、its、Iranian出现在文件中或不出现。
每个句子的最后一个单词是否与文件中的任何单词匹配。例如，如果saturday、fabulous、missiles、Poland出现在文件中或不出现。
句子中各个单词的前缀和后缀（2或3个字符）是否与文件中的前缀和后缀（2或3个字符）匹配。例如，如果Ter、its、Ira、wi与文件中任何单词的2或3个前缀匹配或不匹配。后缀同理。

我对正则表达式非常陌生，但我能想到这种方式，但没有得到结果：term2.lower()是文件中的第一列。

    wordanalysis["trail"] = found if re.match(sentence[-1],term2.lower()) else not(found)
    wordanalysis["lead"] = found  if re.match(sentence[0],term2.lower()) else not(found)

- fscore

嗨@r3mus，请检查我的编辑。 - fscore

我想检查第一个单词是否与文件中的单词列表匹配。为什么会有问题？我正在开发一个项目。 - fscore

@r3mus 对不起，我这样做了。是的，你说得对。请查看我的编辑以获取示例。 - fscore

更新的答案，现在已经可以正常工作（已测试）。 - brandonscript

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- brandonscript · Accepted Answer

更新: 感谢@justhalf的出色建议，无需使用正则表达式来拆分单词。如果您想进行大小写敏感匹配，请删除.lower()。

这将匹配数据列表中第一个单词和最后一个单词（不包括任何标点符号或尾随空格）：

(^\s?\w+\b|(\b\w+)[\.?!\s]*$)

匹配结果：

MATCH 1-1. Terrific
MATCH 2-1. Saturday.
        2. Saturday
MATCH 3-1. its
MATCH 4-1. fabulous
        2. fabulous
MATCH 5-1. Iranian
MATCH 6-1. missiles 
        2. missiles
MATCH 7-1. with
MATCH 8-1. Poland. 
        2. Poland

实现:

import re, string

sentences = open("sentences.txt").read().splitlines()
data = open("data.txt").read()
pattern = re.compile(r"(^\s?\w+\b|(\b\w+)[\.?!\s]*$)")
for line in sentences:
    words = line.strip().split()
    first = words[0].lower()
    last = words[-1].translate(None, string.punctuation).lower()
    if (re.search(first, data, re.I)):
        print "Found " + first + " in data.txt"
    if (re.search(last, data, re.I)):
        print "Found " + last + " in data.txt"

这可能不是最优雅的方法，但你可以理解。

代码已经过测试并且可行，输出为：

Found Terrific in data.txt
Found fabulous in data.txt

此外，这并没有达到您的第三个标准，请尝试测试一下，看看目前是否有效。