使用另一列的值对Pandas列进行切片

Question

使用另一列的值对Pandas列进行切片

3

我有一个数据框，其中一列包含了一些文本。我想在这一列的每一行中查找并获取两个字符串之间的子字符串。具体做法如下：

startinds = df[column].str.find("First Event = ")
endinds   = df[column].str.find("\nLast Event = ")

df["first_timestamp"] = df[column].str.slice(startinds,endinds)

现在这个方式行不通，因为startinds和endinds都是序列，所以我不能使用它们作为索引来对column中的字符串进行切片。

有人知道我可以访问值以对每一行进行子字符串操作的方法吗？

示例输入：

    Data
0   "Blahblah
     First Event = 09/20/2017 12:00:00
     Last Event = 09/20/2017 13:00:00
     Blahblahblah"
1   "Blahblahblahblah
     Blahablahblah
     First Event = 09/20/2017 12:30:00
     Last Event = 09/20/2017 12:45:00
     Blahblahblah"

输出：

    first_timestamp
0   "First Event = 09/20/2017 12:00:00"
1   "First Event = 09/20/2017 12:30:00"

- andraiamatrix

2

这是一个在Github上的未解决问题（https://github.com/pandas-dev/pandas/issues/8748）。你很可能需要手动完成它。 - IanS

2

你执行 "First Event = " + df.Data.str.extract('(?<=First Event = )(.*)(?=\\\\nLast Event)', expand=False) 吗？ - Zero

2个回答

2

与评论中的答案类似，使用 Series.str.extract 方法也可以实现：

df['first_timestamp'] = df['Data'].str.extract('(First Event = .+)')

#                                                 Data  \
# 0  Blahblah\nFirst Event = 09/20/2017 12:00:00\nL...   
# 1  Blahblahblahblah\nFirst Event = 09/20/2017 12:...   
# 
#                      first_timestamp  
# 0  First Event = 09/20/2017 12:00:00  
# 1  First Event = 09/20/2017 12:30:00

模式'(First Event = .+)'捕获一个组（即()），该组以"First Event ="开头，后跟一个或多个字符（即.+），并在换行符处停止（.字符匹配除换行符外的任何字符）。

- cmaher

@andraiamatrix 正则表达式中的 . 字符匹配除了换行符以外的任何字符（因此 .+ 匹配一个或多个除了换行符以外的任何字符）。根据您更新的问题，看起来 df['Data'].str.extract('(First Event = .+)') 将捕获您的第一个时间戳组。我会更新我的答案。 - cmaher

我注意到 .+ 会在遇到换行符时停止匹配，但是它不会在遇到回车符 \r 时停止匹配（而我的数据中正好有这个字符）。有没有什么方法可以同时匹配这两种情况呢？我尝试了 (First Event = .+)[\r\n]，但是输出结果中仍然包含了回车符。 - andraiamatrix

1

你可以尝试使用 df['Data'].str.extract('(First Event = [^\n\r]+)')，而不是使用 .。 - cmaher

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Bharath M Shetty · Accepted Answer

为了完成您的切片方法，您可以使用lambda函数，即将startinds和endinds存储在df中，然后基于列对字符串进行切片，使用lambda函数跨列操作（请注意，您需要一个转义字符来获得\n）。

df['startinds'] = df['Data'].str.find("First Event = ")
df['endinds']  = df['Data'].str.find("\\nLast Event = ")

df.apply(lambda x : str(x['Data'])[x['startinds']:x['endinds']],1 )

输出:

0    第一个事件 = 09/20/2017 12:00:00
1    第一个事件 = 09/20/2017 12:30:00
dtype: object