如何获取数据框的子组起始和结束索引

Question

如何获取数据框的子组起始和结束索引

5

df=pd.DataFrame({"C1":['USA','USA','USA','USA','USA','JAPAN','JAPAN','JAPAN','USA','USA'],'C2':['A','B','A','A','A','A','A','A','B','A']})

    C1      C2
0   USA     A
1   USA     B
2   USA     A
3   USA     A
4   USA     A
5   JAPAN   A
6   JAPAN   A
7   JAPAN   A
8   USA     B
9   USA     A

这是我的问题的简化版本，以便让它更简单。我的目标是迭代数据框中C2包含B的子组。如果C2中有一个B，我会查看C1并需要整个组。因此在此示例中，我看到USA并且它从索引0开始并在4处结束。另一个位于8和9之间。

所以我的期望结果是这样的索引：

[[0,4],[8,9]]

我尝试使用groupby，但由于它将所有美国数据分组在一起，所以无法正常工作。

my_index = list(df[df['C2']=='B'].index)
my_index

会给出1和8，但如何获取起点/终点？

- ProcolHarum

4个回答

3

使用more_itertools的另一种方法。

# Keep all the indexes needed 
temp = df['C1'].ne(df['C1'].shift()).cumsum()
stored_index = df.index[temp.isin(temp[df['C2'].eq("B")])]

# Group the list based on consecutive numbers
import more_itertools as mit
out = [list(i) for i in mit.consecutive_groups(stored_index)]

# Get first and last elements from the nested (grouped) lists
final = [a[:1] + a[-1:] for a in out]

>>> print(final)
[[0, 4], [8, 9]]

- sophocles

3

解决方案

b = df['C1'].ne(df['C1'].shift()).cumsum()
m = b.isin(b[df['C2'].eq('B')])
i = m.index[m].to_series().groupby(b).agg(['first', 'last']).values.squeeze()

说明

shift 列 C1 并将移位后的列与未移位的列进行比较，创建布尔蒙版，然后对此蒙版执行 cumulative 求和以标识列 C1 中数值不变的行块。

>>> b

0    1
1    1
2    1
3    1
4    1
5    2
6    2
7    2
8    3
9    3
Name: C1, dtype: int64

创建一个布尔掩码 m 以识别包含至少一个 B 的行块。

>>> m

0     True
1     True
2     True
3     True
4     True
5    False
6    False
7    False
8     True
9     True
Name: C1, dtype: bool

使用布尔掩码过滤 index，然后按识别的块 b 对过滤后的索引进行分组，并使用 first 和 last 进行聚合以获取索引。

>>> i

array([[0, 4],
       [8, 9]])

- Shubham Sharma

2

另一个版本：

x = (
    df.groupby((df.C1 != df.C1.shift(1)).cumsum())["C2"]
    .apply(lambda x: [x.index[0], x.index[-1]] if ("B" in x.values) else np.nan)
    .dropna()
    .to_list()
)

print(x)

输出：

[[0, 4], [8, 9]]

- Andrej Kesely

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- anky · Accepted Answer

这是一种方法，您可以首先对包含至少一个B的组进行数据框掩码处理，然后获得索引并创建一个帮助列来汇总第一个和最后一个索引值：

s = df['C1'].ne(df['C1'].shift()).cumsum()
i = df.index[s.isin(s[df['C2'].eq("B")])]
p = np.where(np.diff(i)>1)[0]+1
split_ = np.split(i,p)
out = [[i[0],i[-1]] for i in split_]

print(out)
[[0, 4], [8, 9]]