我将尝试通过删除错误添加的行来清理数据框。
以下是虚拟数据:
temp <- structure(list(Date = c("24/06/2002", "24/06/2002", "25/06/2002","25/06/2002", "26/06/2002",
"02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
"08/07/2002", "15/07/2002", "17/07/2002",
"22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"),
payment = c(200, 1000,-1000, -1000, 1000,
-1000,-1000,-1000, 1200,
-1200, 1200, 1200,
200, 56700, -56700, -200),
Code = c("ABC", "M567", "M567","M567", "XYZ", "M567", "ABX" ,
"M567","M567", "M567",
"M567", "M567", "M300",
"M678", "M678", "ABC"),
ID = c(NA, "98","187","187","12ee","M11","M13",
NA,"K999",
"K999", "111", "111", "11",
"12345", NA, NA)), row.names = c(NA, -16L), class = "data.frame")
数据框看起来像这样
Date payment Code ID
1 24/06/2002 200 ABC <NA>
2 24/06/2002 1000 M567 98
3 25/06/2002 -1000 M567 187
4 25/06/2002 -1000 M567 187
5 26/06/2002 1000 XYZ 12ee
6 02/07/2002 -1000 M567 M11
7 03/07/2002 -1000 ABX M13
8 24/07/2002 -1000 M567 <NA>
9 08/07/2002 1200 M567 K999
10 08/07/2002 -1200 M567 K999
11 15/07/2002 1200 M567 111
12 17/07/2002 1200 M567 111
13 22/07/2002 200 M300 11
14 22/07/2002 56700 M678 12345
15 28/07/2002 -56700 M678 <NA>
16 29/07/2002 -200 ABC <NA>
如你所见,数据中存在正负支付。负数的支付是错误添加的交易或退款。
例如+1200将与-1200相互抵消,基于代码和ID,然而第14行和15行类似,但ID为空-因此我必须用其正数支付行的ID填充并反之亦然,以便我可以删除这两行。
我曾经在StackOverflow上请求程序员帮助尝试的代码:
library(dplyr)
library(data.table)
library(tidyr)
Final_df <- df1 %>%
group_by(Code) %>%
mutate(ind = rowid(payment)) %>%
group_by(ind, .add = TRUE) %>%
fill(ID, .direction = 'downup') #%>%
ungroup %>%
mutate(absPayment = abs(payment)) %>%
arrange(ID, Code, absPayment) %>%
group_by(Code, ID, absPayment) %>%
mutate(grp = rowid(sign(payment))) %>%
group_by(grp, .add = TRUE) %>%
filter(n() == 1) %>%
ungroup %>%
select(names(df1))
但是这里的问题在于第8行 - 24/07/2002 -1000 M567 应该由第2行填充,因为代码和正向支付匹配了 - 这样以后我就可以取消这两行。由于该行远离第8行,所以.direction = 'downup'无法起作用。
我认为有一种更好的方法可以填充NA,而不是使用direction(因为它在相似的行相距很远时无法应用)
预期输出结果为:
Date payment Code ID
1 25/06/2002 -1000 M567 187
2 25/06/2002 -1000 M567 187
3 26/06/2002 1000 XYZ 12ee
4 02/07/2002 -1000 M567 M11
5 03/07/2002 -1000 ABX M13
6 15/07/2002 1200 M567 111
7 17/07/2002 1200 M567 111
8 22/07/2002 200 M300 11
我已经被这个问题困扰了5天。任何解决方案都将非常有帮助。
先行感谢。
另一个可能的虚假数据:
temp_2 <- structure(list(Date = c("22/06/2002", "23/06/2002","24/06/2002", "25/06/2002","25/06/2002", "26/06/2002",
"02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
"08/07/2002", "15/07/2002", "17/07/2002",
"22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"),
payment = c(200,-1000, 1000,-1000, -1000, 1000,
-1000,-1000,-1000, 1200,
-1200, 1200, 1200,
200, 56700, -56700, -200),
Code = c("ABC", "M567","M567", "M567","M567", "XYZ", "M567", "ABX" ,
"M567","M567", "M567",
"M567", "M567", "M300",
"M678", "M678", "ABC"),
ID = c(NA,"187", "98","187","187","12ee",NA,NA,
NA,"K999",
"K999", "111", "111", "11",
"12345", NA, NA)), row.names = c(NA, -17L), class = "data.frame")
temp_2的预期输出:
Date payment Code ID
1 23/06/2002 -1000 M567 187
2 25/06/2002 -1000 M567 187
3 25/06/2002 -1000 M567 187
4 26/06/2002 1000 XYZ 12ee
5 03/07/2002 -1000 ABX <NA>
6 24/07/2002 -1000 M567 98
7 15/07/2002 1200 M567 111
8 17/07/2002 1200 M567 111
9 22/07/2002 200 M300 11
assignment
是指什么? - akrun