我似乎无法处理 Python 中正则表达式（re.search）返回的空结果，要么得到重复结果，要么得不到任何结果？

Question

我似乎无法处理 Python 中正则表达式（re.search）返回的空结果，要么得到重复结果，要么得不到任何结果？

pythonregexpandasbeautifulsouppython-requests-html

3

我正在尝试从https://www.ourcommons.ca/Parliamentarians/en/members?view=List获取个人名单。一旦我获得了名单，我会逐个成员的链接查找他们的电子邮件地址。

由于某些成员没有电子邮件地址，因此代码失败了。我尝试添加代码来处理匹配结果为空的情况，但在这种情况下我会得到重复的结果。

我使用以下逻辑进行匹配：

mat = re.search(r'mailto:\w*\.\w*@parl.gc.ca',ln1.get('href'))
    if mat:
        email.append(mat.group())
    else:
        email.append("No Email Found")

if条件是问题所在。当我使用else时，每一行都会显示“未找到电子邮件”。

weblinks=[]
email=[]

page = requests.get('https://www.ourcommons.ca/Parliamentarians/en/members?view=ListAll')
soup = BeautifulSoup(page.content, 'lxml')


for ln in soup.select(".personName > a"):
    weblinks.append("https://www.ourcommons.ca" + ln.get('href'))
    if(len(weblinks)==10):
        break

提取电子邮件

for elnk in weblinks:
    pagedet = requests.get(elnk)
    soupdet = BeautifulSoup(pagedet.content, 'lxml')
    for ln1 in soupdet.select(".caucus > a"):
        mat = re.search(r'mailto:\w*\.\w*@parl.gc.ca',ln1.get('href'))
        if mat:
            email.append(mat.group())
        else:
            email.append("No Email Found")

print("Len Email:",len(email))

预期结果：对于拥有电子邮件的页面显示电子邮件，对于没有电子邮件的页面显示空白。

- Mohit Nayar

你的代码在我这里似乎可以工作。你使用的Python和beautifulsoup版本是什么？ - Matt Pitkin

你说的重复结果是什么意思？是指当匹配成功时，你会收到两封相同的电子邮件，而当匹配失败时，你会收到两个“未找到电子邮件”的提示吗？ - r.ook

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- KunduK · Answer 1

如果检查页面的DOM，会发现有两个相似的元素，这就是为什么你会得到多个值的原因。你需要设置条件来解决这个问题。请尝试以下代码。

weblinks=[]
email=[]

page = requests.get('https://www.ourcommons.ca/Parliamentarians/en/members?view=ListAll')
soup = BeautifulSoup(page.content, 'lxml')


for ln in soup.select(".personName > a"):
    weblinks.append("https://www.ourcommons.ca" + ln.get('href'))
    if(len(weblinks)==10):
        break


for elnk in weblinks:
    pagedet = requests.get(elnk)
    soupdet = BeautifulSoup(pagedet.content, 'lxml')
    if len(soupdet.select(".caucus > a"))> 1:
       for ln1 in soupdet.select(".caucus > :not(a[target])"):
          mat = re.search(r'mailto:\w*\.\w*@parl.gc.ca',ln1.get('href'))
          if mat:
            email.append(mat.group())
          else:
            email.append("No Email Found")
    else:
       for ln1 in soupdet.select(".caucus > a"):
         mat = re.search(r'mailto:\w*\.\w*@parl.gc.ca', ln1.get('href'))
         if mat:
             email.append(mat.group())
         else:
             email.append("No Email Found")

print(email)
print("Len Email:",len(email))

输出：

['mailto:Ziad.Aboultaif@parl.gc.ca', 'mailto:Dan.Albas@parl.gc.ca', 'mailto:harold.albrecht@parl.gc.ca', 'mailto:John.Aldag@parl.gc.ca', 'mailto:Omar.Alghabra@parl.gc.ca', 'mailto:Leona.Alleslev@parl.gc.ca', 'mailto:dean.allison@parl.gc.ca', 'No Email Found', 'No Email Found', 'mailto:Gary.Anand@parl.gc.ca']

Len Email: 10