Python：删除包含列表中单词的行

Question

Python：删除包含列表中单词的行

python

3

我正在编写一段Python脚本，但似乎无法得到正确的结果。它使用了两个输入参数：

数据文件
停用词文件

数据文件由4个以制表符分隔的列组成，已经排序过。停用词文件则是一个已排序的单词列表。

该脚本的目标是：

如果数据文件第1列中的字符串与停用词文件中的某个字符串匹配，则删除整行。

以下是数据文件的示例：

abandonment-n   after+n-the+n-a-j   stop-n  1
abandonment-n   against+n-the+ns    leave-n 1
cake-n  against+n-the+vg    rest-v  1
abandonment-n   as+n-a+vd   require-v   1
abandonment-n   as+n-a-j+vg-up  use-v   1

以下是停止文件的示例：

apple-n
banana-n
cake-n
pigeon-n

以下是我目前的代码：

with open("input1", "rb") as oIndexFile:
        for line in oIndexFile: 
            lemma = line.split()
            #print lemma

with open ("input2", "rb") as oSenseFile:
    with open("output", "wb") as oOutFile:
        for line in oSenseFile:
            concept, slot, filler, freq = line.split()
            nounsInterest = [concept, slot, filler, freq]
            #print concept
            if concept != lemma:
                outstring = '\t'.join(nounsInterest)
                oOutFile.write(outstring + '\n')
            else: 
                pass

期望的输出如下：

abandonment-n   after+n-the+n-a-j-stop-n    1
abandonment-n   against+n-the+ns-leave-n    1
abandonment-n   as+n-a+vd-require-v 1
abandonment-n   as+n-a-j+vg-up-use-v    1

有什么见解吗？

目前我得到的输出如下，基本上只是我一直在做的打印：

abandonment-n   after+n-the+n-a-j   stop-n  1
abandonment-n   against+n-the+ns    leave-n 1
cake-n  against+n-the+vg    rest-v  1
abandonment-n   as+n-a+vd   require-v   1
abandonment-n   as+n-a-j+vg-up  use-v   1

我尝试过的但仍未奏效的方法有：

将 if concept != lemma: 改为 if concept not in lemma:

结果和之前的输出相同。

我还怀疑该函数没有调用第一个输入文件，但即使将其纳入代码中：

with open ("input2", "rb") as oSenseFile:
    with open("tinput1", "rb") as oIndexFile:
        for line in oIndexFile: 
            lemma = line.split()
            with open("out", "wb") as oOutFile:
                for line in oSenseFile:
                    concept, slot, filler, freq = line.split()
                    nounsInterest = [concept, slot, filler, freq]
                    if concept not in lemma:
                        outstring = '\t'.join(nounsInterest)
                        oOutFile.write(outstring + '\n')
                    else: 
                        pass

这段代码生成了一个空白的输出文件。

我还尝试了一种不同的方法，参考链接如下：

filename = "input1.txt" 
filename2 = "input2.txt"
filename3 = "output1"

def fixup(filename): 
    fin1 = open(filename) 
    fin2 = open(filename2, "r")
    fout = open(filename3, "w") 
    for word in filename: 
        words = word.split()
    for line in filename2:
        concept, slot, filler, freq = line.split()
        nounsInterest = [concept, slot, filler, freq]
        if True in [concept in line for word in toRemove]:
            pass
        else:
            outstring = '\t'.join(nounsInterest)
            fout.write(outstring + '\n')
    fin1.close() 
    fin2.close() 
    fout.close()

这段内容是从这里抽取的，但未能成功。在这种情况下，输出根本没有产生。

请问有人能指导我如何解决这个任务吗？虽然示例文件很小，但我必须在一个大文件上运行它。

感谢任何帮助。

- owwoow14

1

每次 line.split() 都会生成一个新的列表。在你的情况下，循环后 lemma 是 ["pigeon"]。这就是输出结果不如预期的原因。 - flyingfoxlee

可能是在Python中从大文件中搜索单词列表的重复问题。 - moooeeeep

@moooeeeep，我已经查看了相关内容并纳入了一些见解--但仍然未能实现期望的输出。感谢提供信息！ - owwoow14

3个回答

1

我还没有检查你的逻辑，但是你正在覆盖每一行的lemma。也许应该将它追加到列表中？

lemma = []
for line in oIndexFile:
    lemma.append(line.strip())  #strips everything except the text

或者，正如@gnibbler所建议的那样，您可以使用set来获得轻微的效率优势：

lemma = set()
for line in oIndexFile:
    lemma.add(line.strip())

编辑：看起来你不想分割它，而是去掉换行符。是的，你的逻辑几乎正确。

第二部分应该像这样：

with open ("data_php.txt", "rb") as oSenseFile:
    with open("out_FILTER_LINES", "wb") as oOutFile:
        for line in oSenseFile:
            concept, slot, filler, freq = line.split()
            nounsInterest = [concept, slot, filler, freq]
            #print concept
            if concept not in lemma: #check if the concept exists in lemma
                outstring = '\t'.join(nounsInterest)
                oOutFile.write(outstring + '\n')
            else: 
                pass

- aIKid

1

最好使用set来存储lemma。 - John La Rooy

1

如果您确定数据文件中的行没有以空格开头，则我们不需要拆分该行。这是对@gnibbler答案的轻微调整。

with open('input1', 'rb') as indexfile:
    lemma = {x.strip() for x in indexfile}

with open('input2', 'rb') as sensefile, open('output', 'wb') as outfile:
    for line in sensefile:
        if not any([line.startswith(x) for x in lemma]):
            outfile.write(line)

- flyingfoxlee

1

@gnibblerçš„ç”æ¡ˆçš„å…³é”®ç‚¹æ˜¯ä½¿ç”¨in setï¼Œè¿™æ˜¯é«˜æ•ˆçš„ã€‚ - georg

在样本数据上：我对@flyingfoxlee和@gnibbler的答案进行了时间戳，@gnibbler的速度略快。#python flyingfoxlee.py #starting: 2013-11-13 11:50:43.533743 #Finish 2013-11-13 11:50:43.534602 #Difference: 0.000859 vs. #python gnibbler.py #starting: 2013-11-13 11:51:21.671065 #Finish: 2013-11-13 11:51:21.671921 #Difference: 0.000856` 这很重要，因为我将在相当大的文件上使用它。我正在对更大的数据进行一些测试。 - owwoow14

1

@gnibbler的回答非常好，这里我只想提供另一种答案，以防数据文件不以空格开头。我不确定哪个更有效率。 - flyingfoxlee

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- John La Rooy · Accepted Answer

我想你正在尝试做类似于这样的事情。

with open('input1', 'rb') as indexfile:
    lemma = {x.strip() for x in indexfile}

with open('input2', 'rb') as sensefile, open('output', 'wb') as outfile:
    for line in sensefile:
        nouns_interest = concept, slot, filler, freq = line.split()
        if concept not in lemma:
            outfile.write('\t'.join(nouns_interest) + '\n')

您希望的输出似乎是在slot和filler之间加入连字符，因此您可能希望使用以下方法：

            outfile.write('{}\t{}-{}\t{}\n'.format(*nouns_interest))