优化BeautifulSoup（Python）代码

Question

优化BeautifulSoup（Python）代码

6

我有一些使用BeautifulSoup库进行解析的代码，但速度非常慢。代码编写的方式不支持使用线程。

我使用BeautifulSoup进行解析，然后保存到数据库中。如果我注释掉save语句，仍然需要很长时间，所以数据库没有问题。

def parse(self,text):                
    soup = BeautifulSoup(text)
    arr = soup.findAll('tbody')                

    for i in range(0,len(arr)-1):
        data=Data()
        soup2 = BeautifulSoup(str(arr[i]))
        arr2 = soup2.findAll('td')

        c=0
        for j in arr2:                                       
            if str(j).find("<a href=") > 0:
                data.sourceURL = self.getAttributeValue(str(j),'<a href="')
            else:  
                if c == 2:
                    data.Hits=j.renderContents()

            #and few others...

            c = c+1

            data.save()

有什么建议吗？

注意： 我已经在这里提出了这个问题，但由于信息不完整而被关闭。

- developer

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- interjay · Accepted Answer

soup2 = BeautifulSoup(str(arr[i]))
arr2 = soup2.findAll('td')

不要这样做：直接调用arr2 = arr[i].findAll（'td'）。

这也会很慢：

if str(j).find("<a href=") > 0:
    data.sourceURL = self.getAttributeValue(str(j),'<a href="')

假设 getAttributeValue 给出了 href 属性，使用以下代码替代：

a = j.find('a', href=True)       #find first <a> with href attribute
if a:
    data.sourceURL = a['href']
else:
    #....

通常情况下，如果你只需要解析和提取值，就不需要将BeautifulSoup对象转换回字符串。由于find和findAll方法会返回可搜索的对象，因此可以通过在结果上调用find/findAll/等方法来继续搜索。