为什么Python正则表达式不能在格式化后的HTML字符串上工作？

Question

为什么Python正则表达式不能在格式化后的HTML字符串上工作？

4

from bs4 import BeautifulSoup
import urllib
import re

soup = urllib.urlopen("http://atlanta.craigslist.org/cto/")
soup = BeautifulSoup(soup)
souped = soup.p
print souped
m = re.search("\\$.",souped)
print m.group(0)

我可以成功下载并打印HTML，但当我添加最后两行时，它总是出错。

我收到了这个错误：

Traceback (most recent call last):
  File "C:\Python27\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 323, in RunScript
    debugger.run(codeObject, __main__.__dict__, start_stepping=0)
  File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\__init__.py", line 60, in run
    _GetCurrentDebugger().run(cmd, globals,locals, start_stepping)
  File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\debugger.py", line 655, in run
    exec cmd in globals, locals
  File "C:\Users\Zack\Documents\Scripto.py", line 1, in <module>
    from bs4 import BeautifulSoup
  File "C:\Python27\lib\re.py", line 142, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

感谢许多！

- user1232812

3个回答

1

您可以将正则表达式作为搜索条件传递给.find()方法：

>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen # from urllib.request import urlopen
>>> import re
>>> page = urlopen("http://atlanta.craigslist.org/cto/")
>>> soup = BeautifulSoup(page)
>>> soup.find('p', text=re.compile(r"\$."))
' -\n\t\t\t $7500'

soup.p 返回一个 Tag 对象。您可以使用str() 或 unicode() 将其转换为字符串：

>>> p = soup.p
>>> str(p)
'<p class="row">\n<span class="ih" id="images:5Nb5I85J83N73p33H6
c2pd3447d5bff6d1757.jpg">\xa0</span>\n<a href="http://atlanta.cr
aigslist.org/nat/cto/2870295634.html">2000 Lexus RX 300</a> -\n\
t\t\t $7500<font size="-1"> (Buford)</font> <span class="p"> pic
\xa0img</span><br class="c" />\n</p>'
>>> re.search(r"\$.", str(p)).group(0)
'$7'

- jfs

1

因为souped是一个对象，将其print转换为文本。但如果您想在另一个上下文中使用它（比如作为文本），您应该先进行转换，例如str(souped)或者如果它是Unicode字符串，则使用unicode(souped)。

- Zsolt Botykai

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Roman Bodnarchuk · Accepted Answer

6

您可能需要使用re.search("\\$.", str(souped))。

- Roman Bodnarchuk

进一步说，BeautifulSoup对象有一个__str __()方法将它们转换为字符串，这样它们可以被漂亮地打印出来（因为print会自动完成），但它们实际上不是字符串，而re.search()需要一个字符串。因此，您必须显式地将HTML转换为字符串，以便可以搜索它。 - kindall

+1，如果可能的话，我会使用unicode()而不是str。并且添加re.U标志。 - Bite code