解析HTML页面以获取和标签的内容

Question

解析HTML页面以获取和标签的内容

5

有许多HTML页面的结构都是由这样一系列组成：

<p>
   <b> Keywords/Category:</b>
   "keyword_a, keyword_b"
</p>

这些页面的地址像这样：https://some.page.org/year/0001，https://some.page.org/year/0002等。

我如何从每个页面中分别提取关键词？我尝试使用BeautifulSoup，但没有成功。我只编写了一个程序来打印组标题（在和之间）。

from bs4 import BeautifulSoup
from urllib2 import urlopen
import re
html_doc = urlopen('https://some.page.org/2018/1234').read()
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
    print 'https://some.page.org'+link.get('href')
for node in soup.findAll('b'):
    print ''.join(node.findAll(text=True))

- Rinat Shakirov

看起来数据在 p 标签里，但是你的代码选择了 b 标签。我认为你应该选择 p 标签。 - t.m.adam

2

+1，不使用正则表达式！ - Eb946207

4个回答

1

你需要使用/将字符串（在此情况下为url）拆分成多个部分。

然后，你可以选择你想要的部分。

例如，如果url是https://some.page.org/year/0001，我使用split函数将url与/符号分开。

它将把它转换为数组，然后我选择我需要的内容，并再次使用''.join()方法将其转换为字符串。你可以在链接中了解有关split方法的更多信息。

- Mohammad Ansari

1

有不同的方法可以从这种HTML结构中解析所需的类别和关键字，但以下是使用“BeautifulSoup”之一的方法:

查找以:结尾的文本的b元素
使用.next_sibling获取包含关键字的下一个文本节点

工作示例:

from bs4 import BeautifulSoup


data = """
<div>
    <p>
       <b> Category 1:</b>
       "keyword_a, keyword_b"
    </p>
    <p>
       <b> Category 2:</b>
       "keyword_c, keyword_d"
    </p>
</div>
"""

soup = BeautifulSoup(data, "html.parser")

for category in soup('b', text=lambda text: text and text.endswith(":")):
    keywords = category.next_sibling.strip('" \n').split(", ")

    print(category.get_text(strip=True), keywords)

输出：

Category 1: ['keyword_a', 'keyword_b']
Category 2: ['keyword_c', 'keyword_d']

- alecxe

0

假设每个块

<p>
   <b> Keywords/Category:</b>
   "keyword_a, keyword_b"
</p>

您想要从每个 关键词/类别 中提取 关键词_a 和 关键词_b。一个示例可能是：

 <p>
    <b>Mammals</b>
    "elephant, rhino"
 </p>
 <p>
    <b>Birds</b>
    "hummingbird, ostrich"
 </p>

一旦你有了HTML代码，你可以做以下事情：

from bs4 import BeautifulSoup

html = '''<p>
    <b>Mammals</b>
    "elephant, rhino"
    </p>
    <p>
    <b>Birds</b>
    "hummingbird, ostrich"
    </p>'''

soup = BeautifulSoup(html, 'html.parser')

p_elements = soup.find_all('p')
for p_element in p_elements:
    b_element = soup.find_all('b')[0]
    b_element.extract()
    category = b_element.text.strip()
    keywords = p_element.text.strip()
    keyword_a, keyword_b = keywords[1:-1].split(', ')
    print('Category:', category)
    print('Keyword A:', keyword_a)
    print('Keyword B:', keyword_b)

这将打印：

Category: Mammals
Keyword A: elephant
Keyword B: rhino
Category: Birds
Keyword A: hummingbird
Keyword B: ostrich

- finefoot

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Danielle M. · Accepted Answer

我不知道实际的源代码格式，无法测试这个问题，但是看起来你想要获取  标签的文本值：

for node in soup.findAll('p'):
    print(node.text)
    # or: keywords = node.text.split(', ')
    # print(keywords)

解析HTML页面以获取<p>和<b>标签的内容