使用BeautifulSoup解析文档时不解析<code>标记的内容

Question

使用BeautifulSoup解析文档时不解析<code>标记的内容

10

我正在使用Django编写博客应用程序。我希望启用评论作者使用一些标签（如<strong>，a等），但禁用所有其他标签。

此外，我想让他们在<code>标签中放置代码，并让pygments解析它们。

例如，有人可能会编写以下评论：

I like this article, but the third code example <em>could have been simpler</em>:

<code lang="c">
#include <stdbool.h>
#include <stdio.h>

int main()
{
    printf("Hello World\n");
}
</code>

问题是，当我使用BeautifulSoup解析注释以剥离不允许的HTML标签时，它也会解析<code>块的内容，并将<stdbool.h>和<stdio.h>视为HTML标签。

我该如何告诉BeautifulSoup不要解析<code>块？也许有其他更适合这项工作的HTML解析器？

- Dor

请看下面的参考资料。它涉及到了你所面临的同样问题。 - pyfunc

5个回答

1

问题在于<code>按照HTML标记的一般规则进行处理，<code>标签内的内容仍然是HTML（该标记主要用于驱动CSS格式，而不是改变解析规则）。

您正在尝试创建一种非常相似但并非完全相同于HTML的不同标记语言。简单的解决方案是假定某些规则，例如"<code>和</code>必须单独出现在一行上"，并自己进行一些预处理。

一个非常简单但不是100%可靠的技巧是将^<code>$替换为<code><![CDATA[，将^</code>$替换为]]></code>。这并不完全可靠，因为如果代码块包含]]>，那么事情会变得非常糟糕。
更安全的选择是在代码块内替换危险字符（<、>和&可能足够）为它们的等效字符实体引用（<、>和&）。您可以通过将每个识别到的代码块传递给cgi.escape(code_block)来完成此操作。

完成预处理后，像往常一样将结果提交给BeautifulSoup。

- Marcelo Cantos

第二个选项似乎是赢家。我该怎么做呢？使用正则表达式，还是一些复杂的字符串处理算法？ - Dor

@Dor：我已经修改了我的答案来涵盖这个问题。 - Marcelo Cantos

我尝试过这个，但是很显然 cgi.escape 要求的是一个字符串，而不是一个 BeautifulSoup 标签对象 :) 我该如何在解析之前转义标签的内容？ - Dor

1

你应该按照我的原始答案提取<code>和</code>之间的文本，通过cgi.escape进行转义并将其全部连接在一起。然后（仅在此之后），将整个内容传递给BeautifulSoup。 - Marcelo Cantos

Marcelo Cantos：这是问题的主要部分 - 如何？- @Dor Oct 24 '10 at 15:47 - jfs

0

编辑：

使用python-markdown2来处理输入，并让用户缩进代码区域。

>>> print html
I like this article, but the third code example <em>could have been simpler</em>:

    #include <stdbool.h>
    #include <stdio.h>

    int main()
    {
        printf("Hello World\n");
    }

>>> import markdown2
>>> marked = markdown2.markdown(html)
>>> marked
u'<p>I like this article, but the third code example <em>could have been simpler</em>:</p>\n\n<pre><code>#include &lt;stdbool.h&gt;\n#include &lt;stdio.h&gt;\n\nint main()\n{\n    printf("Hello World\\n");\n}\n</code></pre>\n'
>>> print marked
<p>I like this article, but the third code example <em>could have been simpler</em>:</p>

<pre><code>#include &lt;stdbool.h&gt;
#include &lt;stdio.h&gt;

int main()
{
    printf("Hello World\n");
}
</code></pre>

如果您仍需要使用BeautifulSoup进行导航和编辑，请按照以下操作进行。如果需要重新插入“<”和“>”（而不是“<”和“>”），请包括实体转换。

soup = BeautifulSoup(marked, 
                     convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> soup
<p>I like this article, but the third code example <em>could have been simpler</em>:</p>
<pre><code>#include <stdbool.h>
#include <stdio.h>

int main()
{
    printf("Hello World\n");
}
</code></pre>


def thickened(soup):
    """
    <code>
    blah blah <entity> blah
        blah
    </code>
    """
    codez = soup.findAll('code') # get the code tags
    for code in codez:
        # take all the contents inside of the code tags and convert
        # them into a single string
        escape_me = ''.join([k.__str__() for k in code.contents])
        escaped = cgi.escape(escape_me) # escape them with cgi
        code.replaceWith('<code>%s</code>' % escaped) # replace Tag objects with escaped string
    return soup

- BenjaminGolder

它会产生类似于</stdbool.h>和</stdio.h>这样的构件。 - jfs

@J.F.Sebastian：你说得完全正确，对我来说它是有效的，而我刚意识到区别——我已经通过Markdown传递了它。重新编写我的答案。 - BenjaminGolder

0

不幸的是，BeautifulSoup无法被阻止解析代码块。

想要实现你想要达到的目标的一个解决方案是：

1）移除代码块。

soup = BeautifulSoup(unicode(content))
code_blocks = soup.findAll(u'code')
for block in code_blocks:
    block.replaceWith(u'<code class="removed"></code>')

2) 进行通常的解析以剥离不允许的标签。

3) 重新插入代码块并重新生成HTML。

stripped_code = stripped_soup.findAll(u"code", u"removed")
# re-insert pygment formatted code

我本来可以用一些代码来回答，但最近我读了一篇优雅地解决这个问题的博客。

http://iboris.com/page/add-source-code-syntax-highlighting-your-django-content-pygments.html

- pyfunc

2

当我首次解析字符串时，BeautifulSoup会插入闭合标签 </stdbool.h> 和 </stdio.h>。因此，即使我使用这种技术，在我的代码块中仍然会得到这些闭合标签。 - Dor

0

如果<code>元素中的代码未转义包含<，&，>字符，则它不是有效的HTML。 BeautifulSoup将尝试将其转换为有效的HTML。这可能不是您想要的。

为了将文本转换为有效的HTML，您可以调整{{link1：从HTML中删除标签的正则表达式}}以从<code>块中提取文本并替换为cgi.escape()版本。如果没有嵌套的<code>标记，则应该可以正常工作。之后，您可以向BeautifulSoup提供经过消毒的HTML。

- jfs

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- N 1.1 · Accepted Answer

来自Python维基百科

>>>import cgi
>>>cgi.escape("<string.h>")
>>>'&lt;string.h&gt;'

>>>BeautifulSoup('&lt;string.h&gt;', 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)