如何从一个列表中创建一个新的列表,但是去除重复项?

3

我有一个列表

carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37311114/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>',
 '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']

现在我想要摆脱重复的项目。问题是,这些相同的项目只有在字符串的特定点i[38:]处才不同。

我的想法是创建一个for循环:

new_list = []
for i in carner_list:
       if i[38:] in new_list:
           print("found")
       else:
           new_list = new_list + [i]
           print("not")

但是这并没有起作用。

语法有问题还是我完全走错了方向?

最好的,拉塞尔


你需要独特文本的项目吗?蜻蜓阿兰岛 - Rakesh
你的当前代码有什么错误输出? - Prune
当您检查carner_list中的项目是否也在new_list中时,这将始终评估为False,因为new_list为空。 - monsieuralfonse64
5个回答

1
我打了一个小函数叫做listContains,我认为它可以解决你的问题。你的代码不起作用是因为你在new_list中搜索值i[38:],而在new_list中您附加了整个值i
所以您还应该对列表中的每个值应用[38:]规则。
我认为下面的代码可以更好地解释我的意思:
carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37311114/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>',
 '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']
new_list = []

def listContains(myList, toSearch):
  for val in myList:
    if val[38:] == toSearch:
      return True
  return False

for i in carner_list:
  if listContains(new_list, i[38:]):
    print("found")
  else:
    new_list.append(i)
    print("not")
print(new_list)

如果你想测试它,你可以从这里进行测试。


1
非常感谢你,Giovanni。这正是问题所在。我已经解决了它。 - Russgo
我很高兴能够解决你的问题,请选择我的答案作为你的解决方案,这样未来的用户也可以立即使用。谢谢! - Giovanni

1
从索引38到结尾的字符串部分(用于确定重复)并不是您实际存储在列表中的内容,因此in操作符不起作用。
相反,您可以使用字典来存储去重后的字符串,其中您关心的字符串部分作为键,以便in操作符可以正常工作:
new = {}
for i in carner_list:
    key = i[38:]
    if key not in new:
        new[key] = i
print(list(new.values()))

这个输出:

['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>', '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>', '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']

1

目前您的搜索方式是,检查子字符串是否与 new_list 中的任何内容相等。这永远不会成立,因为它是一个子字符串。

您可以使用 lambda 函数,过滤出真实结果以查看该项是否在新列表中。然后将其转换为列表,并检查该列表的长度是否不等于 0。

len(list(filter(lambda x: i[38:] in x, new_list))) != 0

最终代码。
carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37311114/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>',
 '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']


new_list = []

for i in carner_list:
    if len(list(filter(lambda x: i[38:] in x, new_list))) != 0:
        print("found")
    else:
        new_list.append(i)
        print("not")

1
使用BeautifulSoup解析HTML,然后进行检查。 示例:
from bs4 import BeautifulSoup

carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37311114/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>',
 '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']

new_list = []
check_val = set()
for i in carner_list:
    s = BeautifulSoup(i, "html.parser")
    if s.text not in check_val:    #check for text
        new_list.append(i)
        check_val.add(s.text)
print(new_list)

输出:

['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of '
 'Arran</a>',
 '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the '
 'Morning</a>']

1
为什么不使用正则表达式?
import re
carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37311114/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>',
 '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']

print({re.findall(r'"([^"]*)"', x)[0].split("/")[4]: x for x in carner_list })

#Below is the output generated 
'''
{'Damselfly': '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>', 'The+Isle+of+Arran': '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>', 'Mean+It+in+the+Morning': '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>'}
'''

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接