Python:删除包含列表中单词的行

3
我正在编写一段Python脚本,但似乎无法得到正确的结果。它使用了两个输入参数:
  1. 数据文件
  2. 停用词文件
数据文件由4个以制表符分隔的列组成,已经排序过。停用词文件则是一个已排序的单词列表。
该脚本的目标是:
  • 如果数据文件第1列中的字符串与停用词文件中的某个字符串匹配,则删除整行。
以下是数据文件的示例:
abandonment-n   after+n-the+n-a-j   stop-n  1
abandonment-n   against+n-the+ns    leave-n 1
cake-n  against+n-the+vg    rest-v  1
abandonment-n   as+n-a+vd   require-v   1
abandonment-n   as+n-a-j+vg-up  use-v   1

以下是停止文件的示例:
apple-n
banana-n
cake-n
pigeon-n

以下是我目前的代码:

with open("input1", "rb") as oIndexFile:
        for line in oIndexFile: 
            lemma = line.split()
            #print lemma

with open ("input2", "rb") as oSenseFile:
    with open("output", "wb") as oOutFile:
        for line in oSenseFile:
            concept, slot, filler, freq = line.split()
            nounsInterest = [concept, slot, filler, freq]
            #print concept
            if concept != lemma:
                outstring = '\t'.join(nounsInterest)
                oOutFile.write(outstring + '\n')
            else: 
                pass

期望的输出如下:

abandonment-n   after+n-the+n-a-j-stop-n    1
abandonment-n   against+n-the+ns-leave-n    1
abandonment-n   as+n-a+vd-require-v 1
abandonment-n   as+n-a-j+vg-up-use-v    1

有什么见解吗?

目前我得到的输出如下,基本上只是我一直在做的打印:

abandonment-n   after+n-the+n-a-j   stop-n  1
abandonment-n   against+n-the+ns    leave-n 1
cake-n  against+n-the+vg    rest-v  1
abandonment-n   as+n-a+vd   require-v   1
abandonment-n   as+n-a-j+vg-up  use-v   1

我尝试过的但仍未奏效的方法有:

if concept != lemma: 改为 if concept not in lemma:

结果和之前的输出相同。

我还怀疑该函数没有调用第一个输入文件,但即使将其纳入代码中:

with open ("input2", "rb") as oSenseFile:
    with open("tinput1", "rb") as oIndexFile:
        for line in oIndexFile: 
            lemma = line.split()
            with open("out", "wb") as oOutFile:
                for line in oSenseFile:
                    concept, slot, filler, freq = line.split()
                    nounsInterest = [concept, slot, filler, freq]
                    if concept not in lemma:
                        outstring = '\t'.join(nounsInterest)
                        oOutFile.write(outstring + '\n')
                    else: 
                        pass

这段代码生成了一个空白的输出文件。

我还尝试了一种不同的方法,参考链接如下:

filename = "input1.txt" 
filename2 = "input2.txt"
filename3 = "output1"

def fixup(filename): 
    fin1 = open(filename) 
    fin2 = open(filename2, "r")
    fout = open(filename3, "w") 
    for word in filename: 
        words = word.split()
    for line in filename2:
        concept, slot, filler, freq = line.split()
        nounsInterest = [concept, slot, filler, freq]
        if True in [concept in line for word in toRemove]:
            pass
        else:
            outstring = '\t'.join(nounsInterest)
            fout.write(outstring + '\n')
    fin1.close() 
    fin2.close() 
    fout.close()

这段内容是从这里抽取的,但未能成功。在这种情况下,输出根本没有产生。

请问有人能指导我如何解决这个任务吗?虽然示例文件很小,但我必须在一个大文件上运行它。

感谢任何帮助。


1
每次 line.split() 都会生成一个新的列表。在你的情况下,循环后 lemma 是 ["pigeon"]。这就是输出结果不如预期的原因。 - flyingfoxlee
可能是在Python中从大文件中搜索单词列表的重复问题。 - moooeeeep
@moooeeeep,我已经查看了相关内容并纳入了一些见解--但仍然未能实现期望的输出。感谢提供信息! - owwoow14
3个回答

4
我想你正在尝试做类似于这样的事情。
with open('input1', 'rb') as indexfile:
    lemma = {x.strip() for x in indexfile}

with open('input2', 'rb') as sensefile, open('output', 'wb') as outfile:
    for line in sensefile:
        nouns_interest = concept, slot, filler, freq = line.split()
        if concept not in lemma:
            outfile.write('\t'.join(nouns_interest) + '\n')

您希望的输出似乎是在slotfiller之间加入连字符,因此您可能希望使用以下方法:
            outfile.write('{}\t{}-{}\t{}\n'.format(*nouns_interest))

1
我还没有检查你的逻辑,但是你正在覆盖每一行的lemma。也许应该将它追加到列表中?
lemma = []
for line in oIndexFile:
    lemma.append(line.strip())  #strips everything except the text

或者,正如@gnibbler所建议的那样,您可以使用set来获得轻微的效率优势:
lemma = set()
for line in oIndexFile:
    lemma.add(line.strip())

编辑:看起来你不想分割它,而是去掉换行符。是的,你的逻辑几乎正确。
第二部分应该像这样:
with open ("data_php.txt", "rb") as oSenseFile:
    with open("out_FILTER_LINES", "wb") as oOutFile:
        for line in oSenseFile:
            concept, slot, filler, freq = line.split()
            nounsInterest = [concept, slot, filler, freq]
            #print concept
            if concept not in lemma: #check if the concept exists in lemma
                outstring = '\t'.join(nounsInterest)
                oOutFile.write(outstring + '\n')
            else: 
                pass

1
最好使用set来存储lemma - John La Rooy

1
如果您确定数据文件中的行没有以空格开头,则我们不需要拆分该行。这是对@gnibbler答案的轻微调整。
with open('input1', 'rb') as indexfile:
    lemma = {x.strip() for x in indexfile}

with open('input2', 'rb') as sensefile, open('output', 'wb') as outfile:
    for line in sensefile:
        if not any([line.startswith(x) for x in lemma]):
            outfile.write(line)

1
@gnibbler的答案的关键点是使用in set,这是高效的。 - georg
在样本数据上:我对@flyingfoxlee和@gnibbler的答案进行了时间戳,@gnibbler的速度略快。#python flyingfoxlee.py #starting: 2013-11-13 11:50:43.533743 #Finish 2013-11-13 11:50:43.534602 #Difference: 0.000859 vs. #python gnibbler.py #starting: 2013-11-13 11:51:21.671065 #Finish: 2013-11-13 11:51:21.671921 #Difference: 0.000856` 这很重要,因为我将在相当大的文件上使用它。我正在对更大的数据进行一些测试。 - owwoow14
1
@gnibbler的回答非常好,这里我只想提供另一种答案,以防数据文件不以空格开头。我不确定哪个更有效率。 - flyingfoxlee

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接