在R中，通过比较数据框中现有的变量来添加新变量。

Question

在R中，通过比较数据框中现有的变量来添加新变量。

3

我有一个包含2016年初选结果的数据集。该数据集包含8列：州/地区、州/地区简称、县/市、FIPS（即组合的州和县ID号码）、政党、候选人、得票数和分数票数。我想为每个候选人在每个县/市中创建一个名为“result”的新列，表示“获胜”或“失败”。我使用dplyr过滤数据到2个民主党候选人，然后使用以下代码添加了该列：

 Democrat$result <- ifelse(Democrat$fraction_votes > .5, "Win","Loss")

这显然不是一个准确的方法，因为获胜者并不总是得到50%的选票。我该如何让R比较每个县的选票比例或选票总数，并返回“胜利”或“失败”的结果？使用apply()函数族、for循环或编写函数哪种方式最好来创建新列？

  state state_abbreviation  county fips    party       candidate
   1    Alabama  AL         Autauga 1001 Democrat  Bernie Sanders
   2    Alabama  AL         Autauga 1001 Democrat Hillary Clinton
   3 Alabama    AL          Baldwin 1003 Democrat  Bernie Sanders
   4 Alabama   AL           Baldwin 1003 Democrat Hillary Clinton
   5 Alabama   AL           Barbour 1005 Democrat  Bernie Sanders
   6 Alabama   AL            Barbour 1005 Democrat Hillary Clinton
    votes fraction_votes
    1   544          0.182
    2  2387          0.800
     3  2694          0.329
     4  5290          0.647
    5   222          0.078
    6  2567          0.906

- Andrew Lastrapes

1

我们能否得到您数据集的一个示例？ - Nico Coallier

编辑你的帖子！ - Petter Friberg

好的，就是这样。 - Andrew Lastrapes

我不确定我理解你的问题...有几个例子是候选人只获得了47%的选票就赢得了选举。 - Andrew Lastrapes

2个回答

1

在基础R中，您可以使用ave计算二进制向量：

Democrat$winner <- ave(Democrat$fraction_votes, Democrat$fips, FUN=function(i) i == max(i))

which returns

Democrat
    state state_abbreviation  county fips    party candidate votes fraction_votes winner
1 Alabama                 AL Autauga 1001 Democrat    Bernie   544          0.182      0
2 Alabama                 AL Autauga 1001 Democrat   Hillary  2387          0.800      1
3 Alabama                 AL Baldwin 1003 Democrat    Bernie  2694          0.329      0
4 Alabama                 AL Baldwin 1003 Democrat   Hillary  5290          0.647      1
5 Alabama                 AL Barbour 1005 Democrat    Bernie   222          0.078      0
6 Alabama                 AL Barbour 1005 Democrat   Hillary  2567          0.906      1

如果需要，可以通过将ave包装在as.logical中将其转换为逻辑值。

在 data.table 中，这也非常简单。假设 fips 是唯一的州县 ID：

library(data.table)
# convert to data.table
setDT(Democrat)

# get logical vector that proclaims winner if vote fraction is maximum
Democrat[, winner := fraction_votes == max(fraction_votes), by=fips]

which返回

Democrat
     state state_abbreviation  county fips    party candidate votes fraction_votes winner
1: Alabama                 AL Autauga 1001 Democrat    Bernie   544          0.182  FALSE
2: Alabama                 AL Autauga 1001 Democrat   Hillary  2387          0.800   TRUE
3: Alabama                 AL Baldwin 1003 Democrat    Bernie  2694          0.329  FALSE
4: Alabama                 AL Baldwin 1003 Democrat   Hillary  5290          0.647   TRUE
5: Alabama                 AL Barbour 1005 Democrat    Bernie   222          0.078  FALSE
6: Alabama                 AL Barbour 1005 Democrat   Hillary  2567          0.906   TRUE

数据

Democrat <-
structure(list(state = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Alabama", class = "factor"), 
    state_abbreviation = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "AL", class = "factor"), 
    county = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("Autauga", 
    "Baldwin", "Barbour"), class = "factor"), fips = c(1001L, 
    1001L, 1003L, 1003L, 1005L, 1005L), party = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = "Democrat", class = "factor"), 
    candidate = structure(c(1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Bernie", 
    "Hillary"), class = "factor"), votes = c(544L, 2387L, 2694L, 
    5290L, 222L, 2567L), fraction_votes = c(0.182, 0.8, 0.329, 
    0.647, 0.078, 0.906)), .Names = c("state", "state_abbreviation", 
"county", "fips", "party", "candidate", "votes", "fraction_votes"
), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")

- lmo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ira · Accepted Answer

我首先会使用dplyr包中的summarise函数来找到每个县收到的最大选票数，然后将带有县最大值的列添加到原始数据集中，最后计算结果。

# create a sample dataset akin to the question setup
df <- data.frame(abrev = rep("AL", 6), county = c("Autuga", "Autuga", "Baldwin", "Baldwin",
                                                  "Barbour", "Barbour"),
                 party = rep("Democrat", 6), 
                 candidate = rep(c("Bernie", "Hillary"), 3),
                 fraction_votes = c(0.18, 0.8, 0.32, 0.64, 0.07, 0.9))

# load a dplyr library
library(dplyr)

# calculate what was the maximum ammount of votes candidate received in a given county

# take a df dataset
winners <- df %>%
        # group it by a county
        group_by(county) %>%
        # for each county, calculate maximum of votes
        summarise(score = max(fraction_votes))

# join the original dataset and the dataset with county maximumus
# join them by county column
df <- left_join(df, winners, by = c("county"))

# calculate the result column
df$result <- ifelse(df$fraction_votes == df$score, "Win", "Loss")

如果有不同的县市拥有相同的名称，您需要调整分组和连接部分，但逻辑应该是相同的。