使用BeautifulSoup解析网页-跳过404错误页面

3
我正在使用以下代码获取网站的标题。
from bs4 import BeautifulSoup
import urllib2

line_in_list = ['www.dailynews.lk','www.elpais.com','www.dailynews.co.zw']

for websites in line_in_list:
    url = "http://" + websites
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    site_title = soup.find_all("title")
    print site_title

如果网站列表中包含“坏”(不存在的)网站/网页,或者该网站存在某种错误,例如“404页面未找到”等,则脚本将中断和停止。
我该如何让脚本忽略/跳过“坏”(不存在的)和有问题的网站/网页?
1个回答

7
line_in_list = ['www.dailynews.lk','www.elpais.com',"www.no.dede",'www.dailynews.co.zw']

for websites in line_in_list:
    url = "http://" + websites
    try:
       page = urllib2.urlopen(url)
    except Exception as e:
        print(e)
        continue

    soup = BeautifulSoup(page.read())
    site_title = soup.find_all("title")
    print(site_title)

[<title>Popular News Items | Daily News Online : Sri Lanka's National News</title>]
[<title>EL PAÍS: el periódico global</title>]
<urlopen error [Errno -2] Name or service not known>
[<title>
DailyNews - Telling it like it is
</title>]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接