美丽汤文章抓取

Question

美丽汤文章抓取

3

我正在尝试获取文章正文中所有的 p 标签。请问有人能解释一下我的代码有什么问题，以及如何改进它？以下是文章的 URL 和相关代码。感谢您提供的任何见解。

URL： http://www.france24.com/en/20140310-libya-seize-north-korea-crude-oil-tanker-rebels-port-rebels/

import urllib2
from bs4 import BeautifulSoup

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

soup = BeautifulSoup(urllib2.urlopen(url).read())

# retrieve all of the paragraph tags
body = soup.find("div", {'class':'bd'}).get_text()
for tag in body:
    p = soup.find_all('p')
    print str(p) + '\n' + '\n'

- user3285763

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- alecxe · Accepted Answer

问题在于页面上有多个class为“bd”的div标签。看起来你需要包含实际文章的那一个 - 它在article标签内部：

import urllib2
from bs4 import BeautifulSoup

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

soup = BeautifulSoup(urllib2.urlopen(url))

# retrieve all of the paragraph tags
paragraphs = soup.find('article').find("div", {'class': 'bd'}).find_all('p')
for paragraph in paragraphs:
    print paragraph.text

打印：

Libyan government forces on Monday seized a North Korea-flagged tanker after...
...

希望这有所帮助。