将HTML列表(<li>)转换为制表符(即缩进)

Question

将HTML列表(<li>)转换为制表符(即缩进)

pythonhtmlregextabs

3

我曾经使用过几十种编程语言，但是对Python还不熟悉。

这可能是我第一次（或者第二次）在这里提问，请温柔一点...

我试图高效地将类似HTML的Markdown文本转换为Wiki格式（具体来说，是将Linux Tomboy/GNote笔记转换为Zim格式），但卡在了转换列表上。

对于这样的两层无序列表...

第一层
- 第二层

Tomboy/GNote使用的是类似于...

<list><list-item>第一层<list><list-item>第二层</list-item></list></list-item></list>

然而，Zim个人Wiki需要的是...

* First level
  * Second level

有前导制表符的文本。

我已经研究过正则表达式模块函数re.sub()、re.match()、re.search()等，发现了Python的一个很酷的功能，即将重复的文本编码为...

 count * "text"

因此，看起来应该有一种方式来做类似于...的事情。

 newnote = re.sub("<list>", LEVEL * "\t", oldnote)

LEVEL是笔记中<list>的序数（出现次数）。因此，第一个遇到的<list>的LEVEL为0，第二个为1，以此类推。

每次遇到</list>时，LEVEL将递减。

<list-item>标签会转换为带有星号的项目符号（根据需要在前面加上换行符），</list-item>标签将被删除。

最后...问题来了...

如何获取LEVEL的值并将其用作制表符的乘数？

- DocSalvager

从头开始想，可以使用像BeautifulSoup或xml.dom.minidom这样的html/xml解析器，使用递归函数或使用堆栈/队列来打开/关闭标签和计算表格级别。基本上，您要将标记文本转换为可用数据，然后将此代码友好型数据转换为其他风格的标记。 - Joel Cornett

2

不要使用 re。它无法很好地处理嵌套标签。 - Joel Cornett

可能相关：https://dev59.com/X3I-5IYBdhLWcg3wq6do ;) - mensi

你尝试过使用http://www.aaronsw.com/2002/html2text/吗？ - Katriel

我会学习html2text.py程序的技术，但我要转换的内容实际上并不是HTML。 - DocSalvager

2个回答

2

使用Beautiful Soup，它允许您在标签中进行迭代，即使它们是自定义的。非常适用于执行此类型的操作。

from BeautifulSoup import BeautifulSoup
tags = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
soup = BeautifulSoup(tags)
print [[ item.text for item in list_tag('list-item')]  for list_tag in soup('list')]

Output : [[u'First level'], [u'Second level']]

我使用了一个嵌套的列表推导，但您也可以使用嵌套的for循环。

for list_tag in soup('list'):
     for item in list_tag('list-item'):
         print item.text

希望这能对您有所帮助。

在我的例子中，我使用了BeautifulSoup 3，但是这个例子应该可以与BeautifulSoup4一起使用，只需要更改导入方式即可。

from bs4 import BeautifulSoup

- Rachid

这看起来很不错！我想点赞，但是我的声望还不够高。我会试一试的。 - DocSalvager

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jadkik94 · Accepted Answer

你确实应该使用XML解析器来完成这项任务，但是为了回答你的问题：

import re

def next_tag(s, tag):
    i = -1
    while True:
        try:
            i = s.index(tag, i+1)
        except ValueError:
            return
        yield i

a = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"

a = a.replace("<list-item>", "* ")

for LEVEL, ind in enumerate(next_tag(a, "<list>")):
    a = re.sub("<list>", "\n" + LEVEL * "\t", a, 1)

a = a.replace("</list-item>", "")
a = a.replace("</list>", "")

print a

这将适用于您的示例，仅限于您的示例。使用XML解析器。您可以使用xml.dom.minidom（它已包含在Python中（至少是2.7），无需下载任何内容）：

import xml.dom.minidom

def parseList(el, lvl=0):
    txt = ""
    indent = "\t" * (lvl)
    for item in el.childNodes:
        # These are the <list-item>s: They can have text and nested <list> tag
        for subitem in item.childNodes:
            if subitem.nodeType is xml.dom.minidom.Element.TEXT_NODE:
                # This is the text before the next <list> tag
                txt += "\n" + indent + "* " + subitem.nodeValue
            else:
                # This is the next list tag, its indent level is incremented
                txt += parseList(subitem, lvl=lvl+1)
    return txt

def parseXML(s):
    doc = xml.dom.minidom.parseString(s)
    return parseList(doc.firstChild)

a = "<list><list-item>First level<list><list-item>Second level</list-item><list-item>Second level 2<list><list-item>Third level</list-item></list></list-item></list></list-item></list>"
print parseXML(a)

输出：

* First level
    * Second level
    * Second level 2
        * Third level