使用Python正则表达式将注释分割成数据框。

3

我有一堆由用户输入的字符串,这些字符串是不同评论串联在一起的。如果有多天的评论,他们有时会输入日期。我正在尝试找到一种方法来分离每个日期和相应的评论。文本评论可能像这样:

raw_text = ['3/30: The dog is red. 4/01: The dog is blue', 'there is a green door', '3-25:Foobar baz'] 

I would like to transform that text to:

df = pd.DataFrame([[0,'3/30','The dog is red.'],[0,'4/01','The dog is blue'],[1,np.nan,'there is a green door'],[2,'3-25','Foobar baz']],columns = 'row_id','date','text')

print(df)

   row_id  date                   text
0       0  3/30        The dog is red.
1       0  4/01        The dog is blue
2       1   NaN  there is a green door
3       2  3-25             Foobar baz

我认为我需要做的是找到分号,然后回溯到分号之前的第一个数字来确定日期(有时使用 / 分隔,有时使用 -)。

如果能提供使用正则表达式的方法将会非常感激——这已经超出了我的简单切割/查找知识范围。

谢谢!

2个回答

2

我不是很擅长正则表达式(所以可能有更好的解决方案),但这似乎是有效的...

# sample list
raw_text = ['10-30: The dog is red. 4/01: The dog is blue', 'there is a green door',
            '3-25:Foobar baz', '11-25:Foobar baz. 12/20: something else']

# create regex (e.g., the variable 'n' in the comment below represents a number)
# if 'nn/nn' OR 'nn-nn' OR ' n-nn' OR ' n/nn' OR ' nn-nn' OR ' nn/nn' OR string starts with a number
regex = r'(?=\d\d/\d\d:)|(?=\d\d-\d\d:)|(?= \d-\d\d:)|(?= \d/\d\d:)|(?= \d\d-\d\d:)|(?= \d\d/\d\d:)|(?=^\d)'
# if string starts with alpha characters or there is a ':'
regex2 = r'(?=^\D)|:'

# create a Series by splitting on regex and explode
s = pd.DataFrame(raw_text)[0].str.split(regex).explode()
# boolean indexing to remove blanks
s2 = s[(s != '') & (s != ' ')]

# strip leading or trailing white space then split on regex2
df = s2.str.strip().str.split(regex2, expand=True).reset_index()
# rename columns
df.columns = ['row_id', 'date', 'text']


   row_id   date                         text
0       0  10-30   The dog is red until 5/15.
1       0   4/01              The dog is blue
2       1               there is a green door
3       2   3-25                   Foobar baz
4       3  11-25                  Foobar baz.
5       3  12/20               something else

Close! 必须能够容忍在注释字符串中没有用冒号标记为文本开头的日期。如果我们将原始数据更改为:raw_text = ['10-30: 狗是红色的,直到5/15。4/01: 狗是蓝色的', '有一扇绿色的门', '3-25:Foobar baz', '11-25:Foobar baz. 12/20: 其他东西']它就会出错。 - flyingmeatball
通过更改您的正则表达式来修复它:regex = r'(?=\d\d/\d\d:)|(?=\d\d-\d\d:)|(?= \d-\d\d:)|(?= \d/\d\d:)|(?= \d\d-\d\d:)|(?= \d\d/\d\d:)|(?=^\d:)' - 你能编辑答案吗?我会给你信用。 - flyingmeatball
我刚刚更新了答案,但在我更正之前你就已经看到了。 - It_is_Chris

0

数据

df=pd.DataFrame({'raw_text':['3/30: The dog is red.', '4/01: The dog is blue', 'there is a green door', '3-25:Foobar baz']})
df

创建日期列

df['date']=df.raw_text.str.extract(r"([\d+\/\-+\d+]+(?=\:))")
df

创建文本列
df['text']=df.raw_text.str.extract(r"((?:-)?[^\s:][A-Za-z\s]+[^s])", expand=True)
df

创建行ID列 匹配文本'The dog'并创建临时列索引 k= 'The dog'

dicto ={'The dog':0}
df['index']=df['raw_text'].str.extract('('+ k + ')', expand=False).map(dicto)
df

使用 index 列来输入 row_id

df['row_id']=df['index'].isna().astype('int64')

将包含文本'The dog'的行进行掩码处理,并对其余行自动递增添加数字

    m=df['row_id']!=0
    df.loc[m,'row_id']=np.arange(start=1, stop=3,step=1)# please note the stop may need to be increased if df is longer
df.drop(columns=['index'], inplace=True)

输出

enter image description here


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接