为什么rbind会出现警告

5
这与有没有更优雅的方法将不整齐的数据转换为整洁的数据框有关。
为什么以下代码无法运行:
events = structure(list(date = structure(c(-714974, -714579, -717835), class = "Date"), 
    days = c(1, 6, 0.5), name = c("Intro to stats", "Stats Winter school", 
    "TidyR tools"), topics = c("probability|R", "R|regression|ggplot", 
    "tidyR|dplyr")), .Names = c("date", "days", "name", "topics"
), row.names = c(NA, -3L), class = "data.frame")

> newdf <- data.frame(topic=character(), days=character())
> for(i in 1:length(events$topics)){
+ xx = unlist(strsplit(events$topics[i],'\\|'))
+ for(j in 1:length(xx)){
+ yy = c(xx[j], events$days[i]/length(xx))
+ print(yy)
+ newdf=rbind(newdf, yy)
+ }
+ }
[1] "probability" "0.5"        
[1] "R"   "0.5"
[1] "R" "2"
[1] "regression" "2"         
[1] "ggplot" "2"     
[1] "tidyR" "0.25" 
[1] "dplyr" "0.25" 
There were 11 warnings (use warnings() to see them)
> newdf
  X.probability. X.0.5.
1    probability    0.5
2           <NA>    0.5
3           <NA>   <NA>
4           <NA>   <NA>
5           <NA>   <NA>
6           <NA>   <NA>
7           <NA>   <NA>
> 
> warnings()
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA ... :
  invalid factor level, NAs generated
2: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA,  ... :
  invalid factor level, NAs generated
3: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L,  ... :
  invalid factor level, NAs generated
4: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA,  ... :
  invalid factor level, NAs generated
5: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L,  ... :
  invalid factor level, NAs generated
6: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA,  ... :
  invalid factor level, NAs generated
7: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L,  ... :
  invalid factor level, NAs generated
8: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA,  ... :
  invalid factor level, NAs generated
9: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L,  ... :
  invalid factor level, NAs generated
10: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA,  ... :
  invalid factor level, NAs generated
11: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L,  ... :
  invalid factor level, NAs generated
> 

yy是可以的,但rbind不起作用。错误在哪里,如何纠正?谢谢你的帮助。

3个回答

5

您可以尝试以下方法:

newdf <- data.frame(topic=character(), daysPerTopic=character(), stringsAsFactors=F)
for(i in 1:length(events$topics)){
xx = unlist(strsplit(events$topics[i],'\\|'))
for(j in 1:length(xx)){
yy = data.frame(topic=xx[j], daysPerTopic=events$days[i]/length(xx), stringsAsFactors=F)
newdf <- rbind(newdf, yy) 
 }
 }

 newdf
#        topic daysPerTopic
# 1 probability         0.50
# 2           R         0.50
# 3           R         2.00
# 4  regression         2.00
# 5      ggplot         2.00
# 6       tidyR         0.25
# 7       dplyr         0.25

或者

 op <- options(stringsAsFactors=F)  #set to F

 #Your code
 newdf <- data.frame(topic=character(), days=character())
 for(i in 1:length(events$topics)){
 xx = unlist(strsplit(events$topics[i],'\\|'))
 for(j in 1:length(xx)){
yy = c(xx[j], events$days[i]/length(xx))
print(yy)
newdf=rbind(newdf, yy)
 }
 }

 newdf
#  X.probability. X.0.5.
# 1    probability    0.5
# 2              R    0.5
# 3              R      2
# 4     regression      2
# 5         ggplot      2
# 6          tidyR   0.25
# 7          dplyr   0.25

 options(op) #et back to default

我没有意识到rbind的两个参数都应该是数据框。 - rnso
1
@rnso,它们并不是。只要在处理因素时小心就可以了。 - David Arenburg
OK. stringsAsFactors=F 是关键问题。谢谢。 - rnso

5

你有没有尝试调试你的 for 循环?例如,通过添加 print(class(yy))print(str(newdf)) ,你会发现在第一次迭代后,两个 newdf 向量都变成了因子。

# [1] "probability" "0.5"        
# [1] "character"
# 'data.frame':  0 obs. of  2 variables:
#   $ topic: Factor w/ 0 levels: 
#   $ days : Factor w/ 0 levels: 
#   NULL
# [1] "R"   "0.5"
# [1] "character"
# 'data.frame': 1 obs. of  2 variables:
#   $ X.probability.: Factor w/ 1 level "probability": 1
# $ X.0.5.        : Factor w/ 1 level "0.5": 1
# NULL
# [1] "R" "2"
# [1] "character"
# 'data.frame': 2 obs. of  2 variables:
#   $ X.probability.: Factor w/ 1 level "probability": 1 NA
# $ X.0.5.        : Factor w/ 1 level "0.5": 1 1

...

您可能会说“但我将它们定义为 character ”。 是的,但如果您阅读rbind文档,您会发现:

对于cbind(rbind),长度为零(包括NULL)的向量将被忽略,除非结果具有零行(列),以实现S兼容性。 (零范围矩阵不会出现在S3中并且在R中不会被忽略。)

rbind的另一个属性是,它从data.frame继承其属性,其中之一是stringsAsFactors == TRUE

这里发生的事情可以很容易地用虚拟示例说明,请考虑:

temp <- data.frame(A = letters[1:3])
str(temp)
## 'data.frame':    3 obs. of  1 variable:
## $ A: Factor w/ 3 levels "a","b","c": 1 2 3

temp$A[3] <- "d"
## Warning message:
## In `[<-.factor`(`*tmp*`, 3, value = c(1L, 2L, NA)) :
##   invalid factor level, NA generated

temp$A
## [1] a    b    <NA>
## Levels: a b c

您可以在此处看到两件事情:

  • data.frame 自动将 character 类型转换为因子
  • 当尝试将新级别解析为 factor 向量时,它会将其转换为 NA 并抛出您收到的确切错误

如 @akrun 所述,将选项设置为 options(stringsAsFactors=F) 将解决您的问题。


是的,是的,嗯嗯,是的,是的,我同意。+1 - Rich Scriven
我尝试在代码中使用print(..)行进行调试,但没有在这里写出所有内容。 - rnso

3

设置选项(stringsAsFactors=FALSE),您的代码应该按预期工作。警告和NA结果的原因是因为隐式转换为因子以及新df列和yy之间的类型不匹配,请参见https://dev59.com/qnI-5IYBdhLWcg3w3crS#1640729

为了更清晰地实现相同的结果,这里是使用data.table的分组解决方案

library(data.table)
events <- as.data.table(events)
events2 <- events[, list(topic=unlist(strsplit(topics, '|', fixed=TRUE))), by=c("date", "days", "name")]
events2[, probability := days / .N, by=name]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接