在pandas数据框中重新分组列值

18

我有一个脚本,根据pandas df中的两个赋值。 下面的代码能够实现第一步,但我在第二步中遇到了问题。

所以脚本应该最初:

1) 为[区域]中的每个字符串和前3个唯一值[地方]中分配一个人员

2) 寻找重新分配人员,如果他们拥有少于3个唯一值 例如,下面的df[区域][地方]中具有6个唯一值。但是,只分配了3个人员。理想情况下,2个人将拥有每个2个唯一值

d = ({
    'Time' : ['8:03:00','8:17:00','8:20:00','10:15:00','10:15:00','11:48:00','12:00:00','12:10:00'],                 
   'Place' : ['House 1','House 2','House 1','House 3','House 4','House 5','House 1','House 1'],                 
    'Area' : ['X','X','Y','X','X','X','X','X'],    
     })

df = pd.DataFrame(data=d)

def g(gps):
        s = gps['Place'].unique()
        d = dict(zip(s, np.arange(len(s)) // 3 + 1))
        gps['Person'] = gps['Place'].map(d)
        return gps

df = df.groupby('Area', sort=False).apply(g)
s = df['Person'].astype(str) + df['Area']
df['Person'] = pd.Series(pd.factorize(s)[0] + 1).map(str).radd('Person ')
输出:
       Time    Place Area    Person
0   8:03:00  House 1    X  Person 1
1   8:17:00  House 2    X  Person 1
2   8:20:00  House 1    Y  Person 2
3  10:15:00  House 3    X  Person 1
4  10:15:00  House 4    X  Person 3
5  11:48:00  House 5    X  Person 3
6  12:00:00  House 1    X  Person 1
7  12:10:00  House 1    X  Person 1

如您所见,第一步顺利完成。对于[Area]中的每个单独的string,将[Place]中的前3个唯一值分配给Person。这使得Person 1拥有3个值Person 2拥有1个值Person 3拥有2个值

我在第二步遇到了困难。

如果一个Person被分配的唯一值少于3个,请更改此设置,以便每个Person最多具有3个唯一值

预期输出:

       Time    Place Area    Person
0   8:03:00  House 1    X  Person 1
1   8:17:00  House 2    X  Person 1
2   8:20:00  House 1    Y  Person 2
3  10:15:00  House 3    X  Person 1
4  10:15:00  House 4    X  Person 2
5  11:48:00  House 5    X  Person 2
6  12:00:00  House 1    X  Person 1
7  12:10:00  House 1    X  Person 1

9
我已经读了这个问题10遍,但仍然不理解你试图解释的逻辑。输入列是什么(地点和面积)?我特别困惑于1小时内独特值的规则。 - kevh
@kevh,我添加了更详细的描述。我不想用太多信息来困扰读者。问题现在更清晰了吗?还是仍然令人困惑? - user9394674
2
还是有点困惑。也许你想要将人员分组而不是合并他们? - 576i
1
仍然感到困惑。 - Alex
1
在第一步中,您确保人员在一个区域内获得任务。在下一步中,为了确保每个人都获得3个任务,您可以忽略此条件。 但是,看起来您添加了一个额外的条件,即从一个区域到另一个区域至少需要1小时?正确吗?否则,为什么不在第一步中忽略您的区域条件呢? - Sergey
显示剩余11条评论
4个回答

4

当前尝试

在下面,我已经在你的代码最后一行之前添加了几行:

d = ({'Time': ['8:03:00', '8:17:00', '8:20:00', '10:15:00', '10:15:00', '11:48:00', '12:00:00', '12:10:00'],
      'Place': ['House 1', 'House 2', 'House 1', 'House 3', 'House 4', 'House 5', 'House 1', 'House 1'],
      'Area': ['X', 'X', 'Y', 'X', 'X', 'X', 'X', 'X']})

df = pd.DataFrame(data=d)


def g(gps):
        s = gps['Place'].unique()
        d = dict(zip(s, np.arange(len(s)) // 3 + 1))
        gps['Person'] = gps['Place'].map(d)
        return gps


df = df.groupby('Area', sort=False).apply(g)
s = df['Person'].astype(str) + df['Area']

# added lines
t = s.value_counts()
df_sub = df.loc[s[s.isin(t[t < 3].index)].index].copy()
df_sub["tag"] = df_sub["Place"] + df_sub["Area"]
tags = list(df_sub.tag.unique())
f = lambda x: f'R{int(tags.index(x) / 3) + 1}'
df_sub['reassign'] = df_sub.tag.apply(f)
s[s.isin(t[t < 3].index)] = df_sub['reassign']

df['Person'] = pd.Series(pd.factorize(s)[0] + 1).map(str).radd('Person ')

老实说,我不确定它是否适用于所有情况,但在测试案例中它能给出你想要的输出结果。
之前的尝试:
让我们看看我是否能帮助你有限的理解你所要做的事情。
你有一些连续的数据(我称之为事件),你想为每个事件分配一个“人员”标识符。你将分配到每个连续事件上的标识符取决于先前的分配,我认为它需要按以下规则逐步应用:
1. “我认识你”:如果给定标识符的“地点”和“区域”具有相同的值(时间有关系吗?),则可以重用以前的标识符。
2. “我不认识你”:如果出现了新的区域值(那么地点和区域发挥不同的角色吗?),则会创建一个新的标识符。
3. “我是否认识你?”:如果一个标识符未被分配到至少三个事件(如果这对多个标识符都发生了怎么办?我会假设我使用最旧的...),则可能重新使用以前使用过的标识符。
4. “不,我不认识”:如果前面的规则都不适用,则会创建一个新的标识符。
在假设了以上内容后,以下是一个解决方案的实现:
# dict of list of past events assigned to each person. key is person identifier
people = dict()
# new column for df (as list) it will be appended at the end to dataframe
persons = list()


# first we define the rules
def i_know_you(people, now):
    def conditions(now, past):
        return [e for e in past if (now.Place == e.Place) and (now.Area == e.Area)]
    i_do = [person for person, past in people.items() if conditions(now, past)]
    if i_do:
        return i_do[0]
    return False


def i_do_not_know_you(people, now):
    conditions = not bool([e for past in people.values() for e in past if e.Area == now.Area])
    if conditions:
        return f'Person {len(people) + 1}'
    return False


def do_i_know_you(people, now):
    i_do = [person for person, past in people.items() if len(past) < 3]
    if i_do:
        return i_do[0]
    return False


# then we process the sequential data
for event in df.itertuples():
    print('event:', event)
    for rule in [i_know_you, i_do_not_know_you, do_i_know_you]:
        person = rule(people, event)
        print('\t', rule.__name__, person)
        if person:
            break
    if not person:
        person = f'Person {len(people) + 1}'
        print('\t', "nah, I don't", person)
    if person in people:
        people[person].append(event)
    else:
        people[person] = [event]
    persons.append(person)

df['Person'] = persons

输出:

event: Pandas(Index=0, Time='8:00:00', Place='House 1', Area='X', Person='Person 1')
     i_know_you False
     i_do_not_know_you Person 1
event: Pandas(Index=1, Time='8:30:00', Place='House 2', Area='X', Person='Person 1')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you Person 1
event: Pandas(Index=2, Time='9:00:00', Place='House 1', Area='Y', Person='Person 2')
     i_know_you False
     i_do_not_know_you Person 2
event: Pandas(Index=3, Time='9:30:00', Place='House 3', Area='X', Person='Person 1')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you Person 1
event: Pandas(Index=4, Time='10:00:00', Place='House 4', Area='X', Person='Person 2')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you Person 2
event: Pandas(Index=5, Time='10:30:00', Place='House 5', Area='X', Person='Person 2')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you Person 2
event: Pandas(Index=6, Time='11:00:00', Place='House 1', Area='X', Person='Person 1')
     i_know_you Person 1
event: Pandas(Index=7, Time='11:30:00', Place='House 6', Area='X', Person='Person 3')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you False
     nah, I don't Person 3
event: Pandas(Index=8, Time='12:00:00', Place='House 7', Area='X', Person='Person 3')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you Person 3
event: Pandas(Index=9, Time='12:30:00', Place='House 8', Area='X', Person='Person 3')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you Person 3

最终的数据框如您所需:

       Time    Place Area    Person
0   8:00:00  House 1    X  Person 1
1   8:30:00  House 2    X  Person 1
2   9:00:00  House 1    Y  Person 2
3   9:30:00  House 3    X  Person 1
4  10:00:00  House 4    X  Person 2
5  10:30:00  House 5    X  Person 2
6  11:00:00  House 1    X  Person 1
7  11:30:00  House 6    X  Person 3
8  12:00:00  House 7    X  Person 3
9  12:30:00  House 8    X  Person 3
备注: 注意,我故意避免使用分组操作并按顺序处理数据。我认为这种复杂性(而且不太理解你想做什么...)需要采用这种方法。此外,您可以使用上面相同的结构使规则更加复杂(时间是否真的起到作用?)。

新数据更新的答案

查看新数据,很明显我没有理解你想做什么(特别是,该任务似乎不遵循顺序规则)。 我有一个解决方案适用于第二个数据集,但对第一个数据集会产生不同的结果。

解决方案要简单得多,将添加一列(如果您愿意,可以稍后删除):

df["tag"] = df["Place"] + df["Area"]
tags = list(df.tag.unique())
f = lambda x: f'Person {int(tags.index(x) / 3) + 1}'
df['Person'] = df.tag.apply(f)

在第二个数据集上,它会得到:
       Time    Place Area       tag    Person
0   8:00:00  House 1    X  House 1X  Person 1
1   8:30:00  House 2    X  House 2X  Person 1
2   9:00:00  House 3    X  House 3X  Person 1
3   9:30:00  House 1    Y  House 1Y  Person 2
4  10:00:00  House 1    Z  House 1Z  Person 2
5  10:30:00  House 1    V  House 1V  Person 2

在第一个数据集上,它给出:

       Time    Place Area       tag    Person
0   8:00:00  House 1    X  House 1X  Person 1
1   8:30:00  House 2    X  House 2X  Person 1
2   9:00:00  House 1    Y  House 1Y  Person 1
3   9:30:00  House 3    X  House 3X  Person 2
4  10:00:00  House 4    X  House 4X  Person 2
5  10:30:00  House 5    X  House 5X  Person 2
6  11:00:00  House 1    X  House 1X  Person 1
7  11:30:00  House 6    X  House 6X  Person 3
8  12:00:00  House 7    X  House 7X  Person 3
9  12:30:00  House 8    X  House 8X  Person 3

这与您在索引2和3上的预期输出不同。这个输出是否符合您的要求?如果不是,为什么?


谢谢Pietro。这很接近了。我认为还需要第五条规则。有些东西可以决定每个标识符关联的值是什么。如果小于3,则尝试合并。我将在问题底部添加一个df,使用您的代码和应该的内容。 - user9394674
不用谢。我会查看编辑后的问题并尽力提供更新的答案。 - pietroppeter
谢谢Pietro。我需要一些重新分配“剩余物”的东西,以完成初始步骤后的操作。您的过程现在只是将其分成3组,而不考虑第一步。 - user9394674
代码已经适应重新分配第一步中的剩余部分。 - pietroppeter

3
据我所知,您对Person分配之前的所有内容都感到满意。因此,这里有一个即插即用的解决方案,可以将具有少于3个唯一值的Persons“合并”,使每个Person最终拥有3个唯一值,除了最后一个(基于您发布的倒数第二个df(“输出:”),而不触及已经有3个唯一值的那些,只合并其他人。

编辑:大大简化了代码。再次以您的df作为输入:

n = 3
df['complete'] = df.Person.apply(lambda x: 1 if df.Person.tolist().count(x) == n else 0)
df['num'] = df.Person.str.replace('Person ','')
df.sort_values(by=['num','complete'],ascending=True,inplace=True) #get all persons that are complete to the top

c = 0
person_numbers = []
for x in range(0,999): #Create the numbering [1,1,1,2,2,2,3,3,3,...] with n defining how often a person is 'repeated'
    if x % n == 0:
        c += 1        
    person_numbers.append(c) 

df['Person_new'] = person_numbers[0:len(df)] #Add the numbering to the df
df.Person = 'Person ' + df.Person_new.astype(str) #Fill the person column with the new numbering
df.drop(['complete','Person_new','num'],axis=1,inplace=True)

谢谢@David。但是我遇到了一个错误,df.Person = df.Person.str.replace(p_in_df[c],p) NameError: name 'c' is not defined - user9394674
感谢 @PeterJames123 的反馈。我重新思考了代码,见我的编辑。与之前一样,如果您之前的分配是正确的,这将起作用。变量 n 为您提供了更改一个人的最大值数量的灵活性。 - David

0
首先,这个答案不符合您只重新分配剩余物品的要求(因此我不指望您接受它)。尽管如此,我还是发布了它,因为在pandas世界中解决您的时间窗口限制是棘手的。也许我的解决方案现在对您没有用,但以后可能会有用;至少对我来说,这是一个学习经验 - 所以也许其他人可以从中获益。
import pandas as pd
from datetime import datetime, time, timedelta
import random

# --- helper functions for demo

random.seed( 0 )

def makeRandomTimes( nHours = None, mMinutes = None ):
    nHours = 10 if nHours is None else nHours
    mMinutes = 3 if mMinutes is None else mMinutes
    times = []
    for _ in range(nHours):
        hour = random.randint(8,18)
        for _ in range(mMinutes):
            minute = random.randint(0,59)
            times.append( datetime.combine( datetime.today(), time( hour, minute ) ) )
    return times

def makeDf():
    times   = makeRandomTimes()
    houses  = [ str(random.randint(1,10)) for _ in range(30) ]
    areas   = [ ['X','Y'][random.randint(0,1)] for _ in range(30) ]
    df      = pd.DataFrame( {'Time' : times, 'House' : houses, 'Area' : areas } )
    return df.set_index( 'Time' ).sort_index()

# --- real code begins

def evaluateLookback( df, idx, dfg ):

    mask = df.index >= dfg.Lookback.iat[-1]
    personTotals = df[ mask ].set_index('Loc')['Person'].value_counts()
    currentPeople = set(df.Person[ df.Person > -1 ]) 
    noAllocations = currentPeople - set(personTotals.index)
    available = personTotals < 3
    if noAllocations or available.sum():
        # allocate to first available person
        person = min( noAllocations.union(personTotals[ available ].index) )
    else:
        # allocate new person
        person = len( currentPeople )
    df.Person.at[ idx ] = person
    # debug
    df.Verbose.at[ idx ] = ( noAllocations, available.sum() )


def lambdaProxy( df, colName ):
    [ dff[1][colName].apply( lambda f: f(df,*dff) ) for dff in df.groupby(df.index) ]


lookback = timedelta( minutes = 120 )

df1 = makeDf()
df1[ 'Loc' ] = df1[ 'House' ] + df1[ 'Area' ]
df1[ 'Person' ] = None
df1[ 'Lambda' ] = evaluateLookback
df1[ 'Lookback' ] = df1.index - lookback
df1[ 'Verbose' ] = None
lambdaProxy( df1, 'Lambda' )

print( df1[ [ col for col in df1.columns if col != 'Lambda' ] ] )

在我的电脑上,示例输出如下:

                    House Area  Loc Person            Lookback         Verbose
Time
2018-09-30 08:16:00     6    Y   6Y      0 2018-09-30 06:16:00         ({}, 0)
2018-09-30 08:31:00     4    Y   4Y      0 2018-09-30 06:31:00         ({}, 1)
2018-09-30 08:32:00    10    X  10X      0 2018-09-30 06:32:00         ({}, 1)
2018-09-30 09:04:00     4    X   4X      1 2018-09-30 07:04:00         ({}, 0)
2018-09-30 09:46:00    10    X  10X      1 2018-09-30 07:46:00         ({}, 1)
2018-09-30 09:57:00     4    X   4X      1 2018-09-30 07:57:00         ({}, 1)
2018-09-30 10:06:00     1    Y   1Y      2 2018-09-30 08:06:00         ({}, 0)
2018-09-30 10:39:00    10    X  10X      0 2018-09-30 08:39:00        ({0}, 1)
2018-09-30 10:48:00     7    X   7X      0 2018-09-30 08:48:00         ({}, 2)
2018-09-30 11:08:00     1    Y   1Y      0 2018-09-30 09:08:00         ({}, 3)
2018-09-30 11:18:00     2    Y   2Y      1 2018-09-30 09:18:00         ({}, 2)
2018-09-30 11:32:00     9    X   9X      2 2018-09-30 09:32:00         ({}, 1)
2018-09-30 12:22:00     5    Y   5Y      1 2018-09-30 10:22:00         ({}, 2)
2018-09-30 12:30:00     9    X   9X      1 2018-09-30 10:30:00         ({}, 2)
2018-09-30 12:34:00     6    X   6X      2 2018-09-30 10:34:00         ({}, 1)
2018-09-30 12:37:00     1    Y   1Y      2 2018-09-30 10:37:00         ({}, 1)
2018-09-30 12:45:00     4    X   4X      0 2018-09-30 10:45:00         ({}, 1)
2018-09-30 12:58:00     8    X   8X      0 2018-09-30 10:58:00         ({}, 1)
2018-09-30 14:26:00     7    Y   7Y      0 2018-09-30 12:26:00         ({}, 3)
2018-09-30 14:48:00     2    X   2X      0 2018-09-30 12:48:00     ({1, 2}, 1)
2018-09-30 14:50:00     8    X   8X      1 2018-09-30 12:50:00     ({1, 2}, 0)
2018-09-30 14:53:00     8    Y   8Y      1 2018-09-30 12:53:00        ({2}, 1)
2018-09-30 14:56:00     6    X   6X      1 2018-09-30 12:56:00        ({2}, 1)
2018-09-30 14:58:00     9    Y   9Y      2 2018-09-30 12:58:00        ({2}, 0)
2018-09-30 17:09:00     2    Y   2Y      0 2018-09-30 15:09:00  ({0, 1, 2}, 0)
2018-09-30 17:19:00     4    X   4X      0 2018-09-30 15:19:00     ({1, 2}, 1)
2018-09-30 17:57:00     6    Y   6Y      0 2018-09-30 15:57:00     ({1, 2}, 1)
2018-09-30 18:21:00     3    X   3X      1 2018-09-30 16:21:00     ({1, 2}, 0)
2018-09-30 18:30:00     9    X   9X      1 2018-09-30 16:30:00        ({2}, 1)
2018-09-30 18:35:00     8    Y   8Y      1 2018-09-30 16:35:00        ({2}, 1)
>>>

注意事项:

  • lookback变量控制向后查看的时间长度,以考虑将位置分配给某个人
  • Lookback列显示截止时间
  • evaluateLookback针对表中的每一行重复调用,其中df是整个DataFrame,idx是当前索引/标签,dfg是当前行。
  • lambdaProxy控制调用evaluateLookback
  • 每个人的位置数设置为3,但可以根据需要进行调整
  • 通过拥有另一个func列来管理向后查看期间的任意复杂要求,该列首先由lambdaProxy评估,然后将结果存储并在evaluateLookback中使用

演示输出中有一些有趣的边缘情况:10:39:0014:48:0017:09:00


顺便说一下:在pandas中看到“函数列”会很有趣,也许具有类似于memory的功能?理想情况下,“人物”列应该采用一个函数并在请求时进行计算,可以使用自己的行或某些变量窗口视图。有人见过类似的东西吗?


0

第二步怎么样:

def reduce_df(df):
    values = df['Area'] + df['Place']
    df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
    person_count = df1.groupby('Person')['Person'].agg('count')
    leftover_count = person_count[person_count < 3] # the 'leftovers'

    # try merging pairs together
    nleft = leftover_count.shape[0]
    to_try = np.arange(nleft - 1)
    to_merge = (leftover_count.values[to_try] + 
                leftover_count.values[to_try + 1]) <= 3
    to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
    to_merge = to_try[to_merge]
    merge_dict = dict(zip(leftover_count.index.values[to_merge+1], 
                    leftover_count.index.values[to_merge]))
    def change_person(p):
        if p in merge_dict.keys():
            return merge_dict[p]
        return p
    reduced_df = df.copy()
    # update df with the merges you found
    reduced_df['Person'] = reduced_df['Person'].apply(change_person)
    return reduced_df

print(
    reduce_df(reduce_df(df)) # call twice in case 1,1,1 -> 2,1 -> 3
)

输出:

Area    Place      Time    Person
0    X  House 1   8:03:00  Person 1
1    X  House 2   8:17:00  Person 1
2    Y  House 1   8:20:00  Person 2
3    X  House 3  10:15:00  Person 1
4    X  House 4  10:15:00  Person 2
5    X  House 5  11:48:00  Person 2
6    X  House 1  12:00:00  Person 1
7    X  House 1  12:10:00  Person 1

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接