如何从非结构化文本创建Python字典?

3

我有一组损坏链接检查结果存在一个文本文件中:

Getting links from: https://www.foo.com/
├───OK─── http://www.this.com/
├───OK─── http://www.is.com/
├─BROKEN─ http://www.broken.com/
├───OK─── http://www.set.com/
├───OK─── http://www.one.com/
5 links found. 0 excluded. 1 broken.

Getting links from: https://www.bar.com/
├───OK─── http://www.this.com/
├───OK─── http://www.is.com/
├─BROKEN─ http://www.broken.com/
3 links found. 0 excluded. 1 broken.

Getting links from: https://www.boo.com/
├───OK─── http://www.this.com/
├───OK─── http://www.is.com/
2 links found. 0 excluded. 0 broken.

我正在尝试编写一个脚本来读取文件并创建一个字典列表,其中每个根链接都是键,其子链接(包括摘要行)都是值。
我想要实现的输出如下:
{"Getting links from: https://www.foo.com/": ["├───OK─── http://www.this.com/", "├───OK─── http://www.is.com/", "├─BROKEN─ http://www.broken.com/", "├───OK─── http://www.set.com/", "├───OK─── http://www.one.com/", "5 links found. 0 excluded. 1 broken."], 
"Getting links from: https://www.bar.com/": ["├───OK─── http://www.this.com/", "├───OK─── http://www.is.com/", "├─BROKEN─ http://www.broken.com/", "3 links found. 0 excluded. 1 broken."],
"Getting links from: https://www.boo.com/": ["├───OK─── http://www.this.com/", "├───OK─── http://www.is.com/", "2 links found. 0 excluded. 0 broken."] }

以下是我目前的内容:

result_list = []

with open('link_checker_result.txt', 'r') as f:
    temp_list = f.readlines()
    for line in temp_list:
        result_list.append(line)

这将给我输出:

['Getting links from: https://www.foo.com/', '├───OK─── http://www.this.com/', '├───OK─── http://www.is.com/', '├─BROKEN─ http://www.broken.com/', '├───OK─── http://www.set.com/', '├───OK─── http://www.one.com/', '5 links found. 0 excluded. 1 broken.', 'Getting links from: https://www.bar.com/', '├───OK─── http://www.this.com/', '├───OK─── http://www.is.com/', '...'  ]

我认识到这些集合中有一些共同的特点,例如它们之间有一个空行,或者以“Getting…”开头。在写入字典之前,我是否应该尝试将其拆分?

我对Python还比较新,所以我承认我甚至不确定自己是否走在正确的方向上。非常感谢专家们的帮助!提前致谢!


我不明白这是如何“非结构化”的。这里有一个标题,有一些以 ├─ 开头的数据行,还有摘要,它是_不以_ ├─ 开头的。这怎么能更加有结构呢? - ForceBru
你能修改生成文本文件的代码吗?在该代码内部将数据加载到字典中会更直接,而不是从该代码的产物中加载。 - Timothy Jannace
2个回答

3
这实际上可以非常简短,只需要4行代码:
finalDict = {}
with open('link_checker_result.txt', 'r') as f:
    lines = list(map(lambda line: line.split('\n'),f.read().split('\n\n')))
    finalDict = dict((elem[0],elem[1:]) for elem in lines)
print(finalDict)

输出:

{'Getting links from: https://www.foo.com/': ['+---OK--- http://www.this.com/', '+---OK--- http://www.is.com/', '+-BROKEN- http://www.broken.com/', '+---OK--- http://www.set.com/', '+---OK--- http://www.one.com/'], 'Getting links from: https://www.bar.com/': ['+---OK--- http://www.this.com/', '+---OK--- http://www.is.com/', '+-BROKEN- http://www.broken.com/'], 'Getting links from: https://www.boo.com/': ['+---OK--- http://www.this.com/', '+---OK--- http://www.is.com/']}

上述代码的作用是读取输入文件,并使用两个连续的换行符"\n"来分割它,以获取每个网址的链接。
最后,它创建元组的第一个元素和每个列表的其余部分,并将它们转换为键值对存储在finalDict字典中。
更易理解的方式如下:
finalDict = {}
with open('link_checker_result.txt', 'r') as f:
    # Getting data and splitting in order to get each url and its links as a unique list element.
    data = f.read().split('\n\n')
    # Splitting each of the above created elements and discarding the last one which is redundant.
    links = [line.split('\n') for line in data]
    # Transforming these elements into key-value pairs and inserting them in the dictionary.
    finalDict = dict((elem[0],elem[1:]) for elem in links)
print(finalDict)

搞定了!谢谢同志! - lane

0
这将会产生你想要的结果:
result = {}

with open('link_checker_result.txt', 'r') as f:
    temp_list = f.readlines()
    key = ''
    value = []
    for line in temp_list:
        if not line:
            result[key] = value
            key = ''
            value = []
        elif not key:
            key = line
        else:
            value.append(line)

    if key:
      result[key] = value

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接