使用Python解析HTML

Question

使用Python解析HTML

3

我正在使用BeautifulSoup从此网站的搜索结果中提取一些数据：http://www.cpso.on.ca/docsearch/default.aspx 以下是经过.prettify()处理的HTML代码示例：

<tr>
 <td>
  <a class="doctor" href="details.aspx?view=1&amp;id= 72374">
   Smith, Jane
  </a>
  (#72374)
 </td>
 <td>
  Suite 042
  <br />
  21 Jump St
  <br />
  Toronto&nbsp;ON&nbsp;&nbsp;M4C 5T2
  <br />
  Phone:&nbsp;(555) 555-5555
  <br />
  Fax:&nbsp;(555) 555-555
 </td>
 <td align="center">
 </td>
</tr>

基本上每个块都有3个块。

我希望输出结果为:

Smith, Jane Suite 042 21 Jump St Toronto ON M4C 5T2

我还需要将条目分隔为新行。

我在编写第二个块中存储的地址时遇到了问题。

我也将其写入文件中。

到目前为止，这是我的代码... 它不起作用 :p

for tr in soup.findAll('tr'):
    #td1 = tr.td
    td2 = tr.td.nextSibling.nextSibling 

    for a in tr.findAll('a'):
        target.write(a.string)
        target.write(" ")

    for i in range(len(td2.contents)):
        if i != None:
            target.write(td2.contents[i].string)
            target.write(" ")
    target.write("\n")

- KylePDM

1

你的第一个 for 循环缺少 :，而且内部循环没有缩进。那是实际的代码还是发布时的错误？ - Jacob

是的，我的错误。Python只是我为了快速解析HTML而学习的一种语言。 - KylePDM

我也在考虑甚至不循环 td 和 a，而只是在循环 tr 时制作 2 个临时 td 值。 - KylePDM

你说，“不起作用”。它具体表现为什么？ - Michael Lorton

1

文本“Suite 042 ...”未包含在<a></a>中，为什么您期望代码将其打印出来？ - Francis Avila

显示剩余2条评论

3个回答

1

这应该可以满足你大部分的需求：

import os
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

with open('output.txt', 'wb') as stream:
    for tr in soup.findAll('tr')[1:]: # [1:] skips the header
        columns = tr.findAll('td')
        line = [columns[0].a.string.strip()]
        for item in (item.strip() for item in columns[1].findAll(text=True)):
            if (item and not item.startswith('Phone:')
                and not item.startswith('Fax:')):
                line.append(item)
        stream.write(' '.join(line).encode('utf-8'))
        stream.write(os.linesep)

更新

添加了一些代码，展示如何将姓名和地址写入文件。

还更改了输出方式，使得姓名和地址在同一行上显示，电话和传真号码被省略。

- ekhumoro

当我将打印方法更改为.write时，我遇到了UnicodeEncodeError错误：'ascii'编解码器无法编码字符u'\xa0'。 - KylePDM

还是不行。根据http://www.crummy.com/software/BeautifulSoup/documentation.html#Why can't Beautiful Soup print out the non-ASCII characters I gave it?所说，可能是我的Python安装或者是BeautifulSoup本身的问题。根据这个来看，应该是BeautifulSoup的问题。 - KylePDM

@KylePDM。Python或BeautifulSoup都很少出现问题。你读取HTML和编写输出的代码是什么样子的？ - ekhumoro

感谢您关注我的问题！我将尝试删除电话/传真，并将每位医生的信息放在一行上。 - KylePDM

如果我有足够的声望来给你点赞，我一定会这么做。出于某种原因，没有任何内容被写入输出文件。但我认为我可以调试并解决它 :D - KylePDM

1

我会尝试这样做：

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_html, 
               convertEntities=BeautifulSoup.HTML_ENTITIES)

for tr in soup.findAll('tr'):
   td = tr.findAll('td')

   target.write(td[0].a.string)
   target.write(' ')

   target.write(' '.join(text.strip() for text in td[1].findAll(text = True)[:-2]))) #finds all text subnodes, except 2 last ones (phone number), and joins them with ' ' separator
   target.write("\n")

- soulcheck

这只是给我一个由换行符分隔的名称列表。 - KylePDM

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Derek Litz · Accepted Answer

In [243]: soup.getText(' ').replace('&nbsp;', ' ').strip()
Out[243]: u'Smith, Jane (#72374)  Suite 042 21 Jump St Toronto ON  M4C 5T2 Phone: (555) 555-5555 Fax: (555) 555-555'

要得到你想要的东西：

In [246]: address = soup.getText(' ').replace('&nbsp;', ' ').strip()
In [247]: import re
In [248]: address = re.sub(r' Phone.*$', '', address)
In [249]: address = address.replace('  ', ' ')
In [250]: address = re.sub(r' \(.*?\)', '', address)
In [251]: print address
Smith, Jane Suite 042 21 Jump St Toronto ON M4C 5T2