有没有类似于readability.js的Python工具？

Question

有没有类似于readability.js的Python工具？

javascriptpythonhtml-content-extractionheuristics

16

我正在寻找一个与Arc90的readability.js大致相当的Python包/模块/函数等。

http://lab.arc90.com/experiments/readability

http://lab.arc90.com/experiments/readability/js/readability.js

我希望你能翻译一下这段内容：so that I can give it some input.html and the result is cleaned up version of that html page's "main text". 我需要的是清理后的HTML页面"主要文本"，这样我就可以在服务器端使用它（不像JS版本只能在浏览器端运行）。

有什么想法吗？

PS：我尝试过Rhino + env.js，这种组合确实可行，但性能无法接受，清理大部分HTML内容需要几分钟时间 :( （仍然找不到为什么会有如此大的性能差异）。

- Emre Sevinç

6个回答

4

我们在repustate.com上推出了一款新的自然语言处理API。使用REST API，您可以清理任何HTML或PDF，并仅获取文本部分。我们的API是免费的，所以请放心使用。它是用Python实现的。试试并将结果与readability.js进行比较 - 我认为你会发现它们几乎完全相同。

- Martin

嗯，看起来很有前途！;-) 我会试一试的。有硬性限制吗？我每天能处理多少页等等？ - Emre Sevinç

哇，我刚刚使用了你的网站输入一些URL，它完美地提取了文章。 - IgorGanapolsky

2

hn.py 通过 Readability博客实现。可以使用App Engine应用程序的可读Feed。我已经将其捆绑成pip-installable模块，位于此处：http://github.com/srid/readability。

- Sridhar Ratnakumar

1

这个版本的可读性似乎非常老，与现在可用的版本相比：0.4 vs. 1.7.1。有更新的机会吗？ - Emil Stenström

1

我曾经对此进行过一些研究，并最终在Python中实现了这种方法[pdf]。我实现的最终版本还进行了一些清理工作，例如删除头部/脚本/iframe元素、隐藏元素等，但这是其核心。

这里有一个函数，它具有“链接列表”鉴别器的（非常）天真实现，试图删除具有重文本链接比率的元素（即导航栏、菜单、广告等）：

def link_list_discriminator(html, min_links=2, ratio=0.5):
    """Remove blocks with a high link to text ratio.

    These are typically navigation elements.

    Based on an algorithm described in:
        http://www.psl.cs.columbia.edu/crunch/WWWJ.pdf

    :param html: ElementTree object.
    :param min_links: Minimum number of links inside an element
                      before considering a block for deletion.
    :param ratio: Ratio of link text to all text before an element is considered
                  for deletion.
    """
    def collapse(strings):
        return u''.join(filter(None, (text.strip() for text in strings)))

    # FIXME: This doesn't account for top-level text...
    for el in html.xpath('//*'):
        anchor_text = el.xpath('.//a//text()')
        anchor_count = len(anchor_text)
        anchor_text = collapse(anchor_text)
        text = collapse(el.xpath('.//text()'))
        anchors = float(len(anchor_text))
        all = float(len(text))
        if anchor_count > min_links and all and anchors / all > ratio:
            el.drop_tree()

在我使用的测试语料库上，它实际上工作得非常好，但要实现高可靠性需要进行大量调整。

- Alec Thomas

0

为什么不尝试使用Google V8/Node.js而不是Rhino呢？它应该会快得可以接受。

- Vinay Sajip

env.js能否在V8/Node.js上运行，以便我拥有类似浏览器的环境？ - Emre Sevinç

-3

我认为BeautifulSoup是Python中最好的HTML解析器。但你仍然需要找出网站的“主要”部分。

如果你只解析单个域名，那么这相当简单，但是找到适用于任何网站的模式并不容易。

也许你可以将readability.js方法移植到Python中？

- eikes

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Yuri Baburov · Accepted Answer

请尝试使用我修改过的分支 https://github.com/buriy/python-readability，它速度快，并具备最新 JavaScript 版本的所有功能。