你可以使用
mask
,来选择字符串的最后六个字符
indexing with str:
mask = df.RegionName.str[-6:] != '[edit]'
print (mask)
0 False
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
Name: RegionName, dtype: bool
df['State'] = df.RegionName.mask(mask).ffill()
df = df[df.State != df.RegionName]
print (df)
RegionName State
1 Auburn [1] Alabama[edit]
2 Florence Alabama[edit]
3 Jacksonville [2] Alabama[edit]
4 Livingston [2] Alabama[edit]
5 Montevallo [2] Alabama[edit]
6 Troy [2] Alabama[edit]
7 Tuscaloosa [3][4] Alabama[edit]
8 Tuskegee [5] Alabama[edit]
df['State'] = df.State.mask(df.State.duplicated(), '')
df = df[['State','RegionName']].reset_index(drop=True)
print (df)
State RegionName
0 Alabama[edit] Auburn [1]
1 Florence
2 Jacksonville [2]
3 Livingston [2]
4 Montevallo [2]
5 Troy [2]
6 Tuscaloosa [3][4]
7 Tuskegee [5]
但如果需要移除[]
和数字,可以使用稍作修改的answer:
df.insert(0, 'State', df['RegionName'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['RegionName'].str.contains('\[edit\]')].reset_index(drop=True)
df['RegionName'] = df['RegionName'].str.replace(r' \[.+$', '')
print (df)
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
df['State'] = df.State.mask(df.State.duplicated(), '')
print (df)
State RegionName
0 Alabama Auburn
1 Florence
2 Jacksonville
3 Livingston
4 Montevallo
5 Troy
6 Tuscaloosa
7 Tuskegee
编辑者注:
如果需要非常缓慢的循环解决方案,存在多个问题:
for i, row in df.iterrows():
print (row)
if row['RegionName'][-6:] == '[edit]':
df.loc[i, 'state'] = row['RegionName'][:-6]
print (df)
RegionName state
0 Alabama[edit] Alabama
1 Auburn [1] NaN
2 Florence NaN
3 Jacksonville [2] NaN
4 Livingston [2] NaN
5 Montevallo [2] NaN