如何将日记条目制作成词典?

3
我正在尝试匹配日记条目日期的正则表达式,如果匹配成功,将日期作为键,随后的条目作为值。
首先,我打算将其拆分成一个数组,并将每个奇数索引作为键,每个偶数索引作为值。
来源:https://archive.org/stream/AnneFrankTheDiaryOfAYoungGirl_201606/Anne-Frank-The-Diary-Of-A-Young-Girl_djvu.txt
file = open(r"C:\Users\mmcgown\Desktop\School\MSDS452\FinalProject\TheDiaryOfAYoungGirl.txt","r")
s = file.read()

import re
r = '(SUNDAY|MONDAY|TUESDAY|WEDNESDAY|THURSDAY|FRIDAY|SATURDAY), (JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER) \d{1,2}, 19\d{2}\s*\n'
l = re.split(r,s)

l

然而,这只是在正则表达式之前和之后进行分割。因此,分割不是正确的方法... 因为它也会在列表中给出某些原因上的日期和月份。
'',
 'SUNDAY',
 'JUNE',
 'I\'ll begin from the ...

什么是最简单的方法来分割这些如下的日记条目?
{ 'SUNDAY, JUNE 14, 1942' : 'I'll begin from the ...' },
{ 'MONDAY, JUNE 15, 1942' : 'I had my birthday ...'},
etc.

顺便提一下,我也尝试了逐行处理文件的方法,但越来越难看,所以我想寻求正确解决方案的建议(下面的代码我没写完)。

file = open(r"C:\Users\mmcgown\Desktop\School\MSDS452\FinalProject\TheDiaryOfAYoungGirl.txt","r")
dia = {}
for line in file:
    i = 0
    if re.match(r,line) and i == 0:
        dia = {line.rstrip() : ''}
    elif not re.match(r,line):
        line = last_line + line
    elif re.match(r,line) and (i != 0):
        dia.update({line: last_line})
    i = i + 1
    last_line = line

嗨,为什么第二种方法这么糟糕呢? - nonamer92
它可以使用正则表达式来实现 - 你的正则表达式有问题。我不是一个正则表达式专家,所以需要时间为您提供确切的正则表达式,但之所以将其分开按天和月分割而不是整体查找,是因为您要求这样做。我建议先单独处理获取适当的正则表达式,以便检测各种完整/不完整的日/月/年组合,然后再考虑其余部分。 - logicOnAbstractions
2个回答

1
你可以使用以下示例(我使用了OrderedDict将日期按顺序存储在字典中,sample.txt是你问题中的文本文件):
import re
from collections import OrderedDict

with open('sample.txt', 'r') as f_in:
    data = f_in.read()

data = re.findall(r'^([A-Z]+, [A-Z]+ \d+, \d+)(.*?)(?=(?:[A-Z]+, [A-Z]+ \d+, \d+)|(?:ANNE\'S DIARY ENDS HERE\.))', data, flags=re.M|re.DOTALL)

d = OrderedDict( data )

from pprint import pprint
pprint(d)

输出:

OrderedDict([('SUNDAY, JUNE 14, 1942',
              '\n'
              '\n'
              '\n'
              "I'll begin from the moment I got you, the moment I saw you "
              'lying on the table among\n'

...till

          "what I'd like to be and what I could be if ... if only there "
          'were no other people in\n'
          'the world.\n'
          '\n'
          'Yours, Anne M. Frank\n'
          '\n'
          '\n')])

0

这种方法怎么样?(我不想改变你的正则表达式,所以我使用了它)

  1. 遍历行以查找与您的正则表达式匹配的所有行索引,在元组列表中保存这些结果,其中每个元组包含:(您想要的日期,行索引)
  2. 遍历上述找到的结果并将它们添加到字典(dia)中

import re

from pprint import pprint

r = '(SUNDAY|MONDAY|TUESDAY|WEDNESDAY|THURSDAY|FRIDAY|SATURDAY), (JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER) \d{1,2}, 19\d{2}\s*\n'
date_indexes = []
with open(r"your_file.txt", "r") as f:
    lines = f.readlines()
    for i, line in enumerate(lines):
        if re.match(r, line):
            date_indexes.append((line.strip(), i))

    dia = {}
    for i in range(0, len(date_indexes) - 1):
        cur_idx = date_indexes[i][1] + 1
        next_idx = date_indexes[i + 1][1] - 1
        dia.update({date_indexes[i][0]: ''.join(lines[cur_idx:next_idx])})


pprint(dia)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接