不含特定符号的情况下计算唯一值的数量

Question

不含特定符号的情况下计算唯一值的数量

rdataframedatetimecountdata-manipulation

8

我有一个数据框‘df’，其中包含分类和POSIXct列。数据如下：

类别	日期时间
A	2022年08月29日 00:00:00
A	2022年08月29日 00:00:00
A 1	2022年08月29日 00:00:00
A 1	2022年08月29日 00:00:00
A 1	2022年08月29日 00:00:00
B	2022年08月29日 00:00:00
B	2022年08月29日 00:00:00
B	2022年08月29日 00:00:00
B 1	2022年08月29日 00:00:00
B 1	2022年08月29日 00:00:00
B 1	2022年08月29日 00:00:00
B 1	2022年08月29日 00:00:00
B 1	2022年08月29日 00:00:00
A	2022年08月29日 02:00:00
A 1	2022年08月29日 02:00:00
B	2022年08月29日 02:00:00
B	2022年08月29日 02:00:00
B	2022年08月29日 02:00:00
B 1	2022年08月29日 02:00:00
B 1	2022年08月29日 02:00:00
B 1	2022年08月29日 02:00:00

我想创建一个新的数据框"df2"来计算每个不以 " 1" 结尾的 "Category" 列的唯一值，按照 "DateTime" 列进行计数，结果如下表所示：

Category	DateTime	CatCount
A	2022-08-29 00:00:00	2
B	2022-08-29 00:00:00	3
A	2022-08-29 02:00:00	1
B	2022-08-29 02:00:00	3

- Jacob

6个回答

4

使用transform和aggregate的另一种基于R基础的解决方案：

transform(df1[!grepl("1$", df1$Category),], count = 1) |>
  aggregate(count ~ Category + DateTime, data = _, length)

  Category            DateTime count
1        A 2022-08-29 00:00:00     2
2        B 2022-08-29 00:00:00     3
3        A 2022-08-29 02:00:00     1
4        B 2022-08-29 02:00:00     3

- Maël

2

另外，aggregate(count ~ Category + DateTime, data = _, length) 可以替换为 aggregate(count ~ ., data = _, length)。 - GKi

3

library(dplyr)
your_data %>%
  filter(!endsWith(Category, "1")) %>%
  count(Category, DateTime)

- Gregor Thomas

3

或者使用 table 的基本解决方案：

df[nchar(df$Category) == 1,] |>
  table() |>
  as.data.frame(responseName = "CatCount")

我们当然可以通过各种方式进行子集划分，如 @Gregor Thomas 提出的 df[!endsWith(df$Category, "1"),] 或者 @akrun 提出的 df[!grepl("\\s+1", df$Category),]。

  Category            DateTime CatCount
1        A 2022-08-29 00:00:00        2
2        B 2022-08-29 00:00:00        3
3        A 2022-08-29 02:00:00        1
4        B 2022-08-29 02:00:00        3

数据：

library(readr)

df <- read_delim("Category,DateTime
A,2022-08-29 00:00:00
A,2022-08-29 00:00:00
A 1,2022-08-29 00:00:00
A 1,2022-08-29 00:00:00
A 1,2022-08-29 00:00:00
B,2022-08-29 00:00:00
B,2022-08-29 00:00:00
B,2022-08-29 00:00:00
B 1,2022-08-29 00:00:00
B 1,2022-08-29 00:00:00
B 1,2022-08-29 00:00:00
B 1,2022-08-29 00:00:00
B 1,2022-08-29 00:00:00
A,2022-08-29 02:00:00
A 1,2022-08-29 02:00:00
B,2022-08-29 02:00:00
B,2022-08-29 02:00:00
B,2022-08-29 02:00:00
B 1,2022-08-29 02:00:00
B 1,2022-08-29 02:00:00
B 1,2022-08-29 02:00:00", delim = ",")

更新： 添加了数据。

- harre

2

这里有一个 data.table 选项，我们可以使用 grepl （也可以使用 stringr）来忽略任何包含 Category 中数字的行，然后使用 .N 进行计数。

library(data.table)

setDT(dt)[!grepl("\\d", Category), .N, .(Category, DateTime)]

输出

   Category            DateTime N
1:        A 2022-08-29 00:00:00 2
2:        B 2022-08-29 00:00:00 3
3:        A 2022-08-29 02:00:00 1
4:        B 2022-08-29 02:00:00 3

数据

dt <- structure(list(Category = c("A", "A", "A 1", "A 1", "A 1", "B", 
"B", "B", "B 1", "B 1", "B 1", "B 1", "B 1", "A", "A 1", "B", 
"B", "B", "B 1", "B 1", "B 1"), DateTime = c("2022-08-29 00:00:00", 
"2022-08-29 00:00:00", "2022-08-29 00:00:00", "2022-08-29 00:00:00", 
"2022-08-29 00:00:00", "2022-08-29 00:00:00", "2022-08-29 00:00:00", 
"2022-08-29 00:00:00", "2022-08-29 00:00:00", "2022-08-29 00:00:00", 
"2022-08-29 00:00:00", "2022-08-29 00:00:00", "2022-08-29 00:00:00", 
"2022-08-29 02:00:00", "2022-08-29 02:00:00", "2022-08-29 02:00:00", 
"2022-08-29 02:00:00", "2022-08-29 02:00:00", "2022-08-29 02:00:00", 
"2022-08-29 02:00:00", "2022-08-29 02:00:00")), class = "data.frame", row.names = c(NA, 
-21L))

- AndrewGB

2

我们可以方便地使用aggregate的subset参数。

aggregate(cbind(CatCount=rep(1, length(Category))) ~ Category + DateTime, df1, length, 
          subset=!grepl('1', Category))
#   Category            DateTime CatCount
# 1        A 2022-08-29 00:00:00        2
# 2        B 2022-08-29 00:00:00        3
# 3        A 2022-08-29 02:00:00        1
# 4        B 2022-08-29 02:00:00        3

数据来自@akrun。

- jay.sf

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- akrun · Accepted Answer

我们可以筛选输出行中的1，然后进行计数

library(dplyr)
library(stringr)
df1 %>%
   filter(str_detect(Category, "\\s+1", negate = TRUE)) %>%
   count(Category, DateTime, name = "CatCount")

-输出

 Category            DateTime CatCount
1        A 2022-08-29 00:00:00        2
2        A 2022-08-29 02:00:00        1
3        B 2022-08-29 00:00:00        3
4        B 2022-08-29 02:00:00        3

数据

df1 <- structure(list(Category = c("A", "A", "A 1", "A 1", "A 1", "B", 
"B", "B", "B 1", "B 1", "B 1", "B 1", "B 1", "A", "A 1", "B", 
"B", "B", "B 1", "B 1", "B 1"), DateTime = c("2022-08-29 00:00:00", 
"2022-08-29 00:00:00", "2022-08-29 00:00:00", "2022-08-29 00:00:00", 
"2022-08-29 00:00:00", "2022-08-29 00:00:00", "2022-08-29 00:00:00", 
"2022-08-29 00:00:00", "2022-08-29 00:00:00", "2022-08-29 00:00:00", 
"2022-08-29 00:00:00", "2022-08-29 00:00:00", "2022-08-29 00:00:00", 
"2022-08-29 02:00:00", "2022-08-29 02:00:00", "2022-08-29 02:00:00", 
"2022-08-29 02:00:00", "2022-08-29 02:00:00", "2022-08-29 02:00:00", 
"2022-08-29 02:00:00", "2022-08-29 02:00:00")),
class = "data.frame", row.names = c(NA, 
-21L))