Pandoc - HTML 转 Markdown - 移除所有属性

Question

Pandoc - HTML 转 Markdown - 移除所有属性

markdownpandoc

14

这似乎是一件简单的事情，但我一直找不到答案。我正在使用Pandoc将HTML转换为Markdown，我想要从HTML中删除所有属性，例如"class"和"id"。

Pandoc有没有这样的选项呢？

- trajan

1

你可以编写一个 Pandoc 过滤器来实现这个功能。如果你使用 panflute，在过滤器中，可以像这样做：elem.identifier = ''，elem.classes = []，elem.attributes = {}。由于只有少数元素具有属性，因此可以将其包装在 try 子句中（或使用 slots 来查找元素是否具有属性）。 - Sergio Correia

3

您可以尝试禁用扩展功能 pandoc -t markdown-header_attributes-link_attributes-native_divs-native_spans 等等... 或者，是的，编写一个 pandoc 过滤器。 - mb21

3个回答

6

您可以使用 Lua 过滤器来移除所有属性和类。将以下内容保存到文件 remove-attr.lua 中，并使用 --lua-filter=remove-attr.lua 调用 pandoc。

function remove_attr (x)
  if x.attr then
    x.attr = pandoc.Attr()
    return x
  end
end

return {{Inline = remove_attr, Block = remove_attr}}

- tarleb

试图从源自Microsoft Office产品的HTML中丢弃文物，其中包含每个单元格值周围的<span style =“font-family：" Arial＆quot;，sans-serif; color：black”> value </ span>。这与-t gfm-raw_html一样有效。谢谢！ - TheDudeAbides

0

我也很惊讶这个看似简单的操作在网络搜索中没有产生任何结果。最终参考了BeautifulSoup文档和其他SO答案的示例使用方法，编写了以下代码。

下面的代码还会删除script和style HTML标签。除此之外，它将保留任何src和href属性。这两个属性应该可以让您灵活地适应您的需求（即适应任何需求，然后使用pandoc将返回的HTML转换为Markdown）。

# https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree
from bs4 import BeautifulSoup, NavigableString

def unstyle_html(html):
    soup = BeautifulSoup(html, features="html.parser")

    # remove all attributes except for `src` and `href`
    for tag in soup.descendants:
        keys = []
        if not isinstance(tag, NavigableString):
            for k in tag.attrs.keys():
                if k not in ["src", "href"]:
                    keys.append(k)
            for k in keys:
                del tag[k]

    # remove all script and style tags
    for tag in soup.find_all(["script", "style"]):
        tag.decompose()

    # return html text
    return soup.prettify()

- wiz_lee

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Clément · Accepted Answer

请考虑 input.html 文件：

<h1 class="test">Hi!</h1>
<p><strong id="another">This is a test.</strong></p>

然后，运行命令pandoc input.html -t gfm-raw_html -o output.md，生成output.md文件。

# Hi!

**This is a test.**

如果不使用-t gfm-raw_html选项，您将得到以下结果：

# Hi! {#hi .test}

**This is a test.**

这个问题实际上类似于这个问题。我不认为pandoc会保留id属性。

Translated:

这个问题与这个问题相似。我认为pandoc不会保留id属性。