以整洁的方式将字符串列表转换为数据框

3

我想知道是否有人有关于如何处理转换数据的技巧和窍门,例如下面的数据:

library(tidyverse)
example.list = list(" 1 North Carolina State University at Raleigh         15       9    12    13     22    15    32    19   14    20    12     17    19    20     19     25     283", 
    " 2 Iowa State University                                9      8     5    11     14     4    11    13   14      9   15     28    14      9    18     27     209", 
    " 3 University of Wisconsin-Madison                      5      6    14     9     20    13    15    12   13      9   13     10    13    24     15     17     208", 
    " 4 Stanford University*                               10      12    14     6      9    10     5     9   13      7   13     10     4      9    23       6    160", 
    " 5 Texas A & M University-College Station               6     12    18    10      7     4     5    11   16    18    10      7    15      4     8       8    159", 
    " 9 University of Michigan-Ann Arbor                     8      5     3     3      8     9    12    11    7    11    13      9     8    11     13       9    140", 
    "10 University of California-Los Angeles                 2      2     2     6      9     7     9     8    7    11    11      8     6    12     13     10     123", 
    "19 Rice University                                      3      3     5    11      4     7     7    11    2      6     4     6     3      8     7       7     94")

转化为类似此处的输出结果:

example.list %>%
    substring(3) %>% 
    str_replace_all("[^[:alnum:]]", " ") %>%
    str_squish() %>% 
    strsplit(split = "(?<=[a-zA-Z])\\s*(?=[0-9])", perl = TRUE) %>% 
    unlist() %>% 
    matrix(ncol = 2, byrow = TRUE) %>% 
    data.frame() %>% 
    separate("X2",into = paste0("X",2:18),sep = " ") 

需要提取的一般模式是将所有字符提取到第一个数字之前的自己的列中,其他所有列通过空格分隔到其他列中。

有趣的事情是,是否可以在单个正则表达式或完全不使用正则表达式的情况下完成大部分工作。

我只是想改善字符串处理,因为我没有多少经验! 在这里使用的用例类似于从pdf / html中提取表格数据到数据框中。

编辑:

感谢所有建议和不同的观点!

我意识到我实际上错过了一些值得一提的测试用例:

example2.list = list(" 2 Iowa State University                                9      8     5    11     14     4    11    13   14      9   15     28    14      9    18     27     209", 
    " 3 University of Wisconsin-Madison                      5      6    14     9     20    13    15    12   13      9   13     10    13    24     15     17     208", 
    " 4 Stanford University*                               10      12    14     6      9    10     5     9   13      7   13     10     4      9    23       6    160", 
    " 5 Texas A & M University-College Station               6     12    18    10      7     4     5    11   16    18    10      7    15      4     8       8    159", 
    " 9 University of Michigan-Ann Arbor                     8      5     3     3      8     9    12    11    7    11    13      9     8    11     13       9    140", 
    "10 University of California-Los Angeles                 2      2     2     6      9     7     9     8    7    11    11      8     6    12     13     10     123", 
    "19 Rice University                                      3      3     5    11      4     7     7    11    2      6     4     6     3      8     7       7     94", 
    "52 Bowling Green State University             0 0 0 0 0 1 5 2 2 2  4 7 3  4  4  3 37", 
    "55 University of New Mexico                   4 2 3 1 3 0 5 3 2 1  1 2 3  2  3  0 35")

这并不像对齐那样整齐地呈现。

完整数据集,稍微清理过:

library(pdftools)
library(tidyverse)
data.loc = "https://ww2.amstat.org/misc/StatsPhD2003-MostRecent.pdf"
data.full =
    pdf_text(data.loc) %>%
    read_lines() %>%
    head(-2) %>%
    tail(-3) %>%
    lapply(function(ele) if(ele == "") NULL else ele) %>% 
    compact()

这是我的第二次尝试:

library(tidyverse)
library(magrittr)
  # Ignores column names
  data.full[-1] %>%
  # Removing excess whitepace
  str_squish() %>%
  # Removes index
  str_remove("^\\s*\\d*\\s*") %>%
  # Split on all whitespace occurring before digits
  str_split("\\s+(?=\\d)") %>%
  # Turn list into a matrix
  unlist() %>%
  matrix(ncol = 18, byrow = TRUE) %>%
  # Handling variables names
  set_colnames(value =
                 data.full[1] %>%
                 str_squish() %>%
                 str_split("\\s+(?=\\d)") %>%
                 unlist) %>%
  as_tibble() %>%
  # Transformating variables into  numeric
  type_convert()

看起来你正在处理固定宽度的数据。请查找read.fwfread_fwf或相关函数。 - A5C1D2H2I1M1N2O1R2T1
5个回答

4

以下是一个可能采用的方法:

library(magrittr)
library(data.table)


gsub("^...", "", example.list) %>% 
  tstrsplit(" {2,}", type.convert = TRUE, names = TRUE) %>% 
  as.data.frame()

#                                           V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
# 1 North Carolina State University at Raleigh 15  9 12 13 22 15 32 19  14  20  12  17  19  20  19  25 283
# 2                      Iowa State University  9  8  5 11 14  4 11 13  14   9  15  28  14   9  18  27 209
# 3            University of Wisconsin-Madison  5  6 14  9 20 13 15 12  13   9  13  10  13  24  15  17 208
# 4                       Stanford University* 10 12 14  6  9 10  5  9  13   7  13  10   4   9  23   6 160
# 5     Texas A & M University-College Station  6 12 18 10  7  4  5 11  16  18  10   7  15   4   8   8 159
# 6           University of Michigan-Ann Arbor  8  5  3  3  8  9 12 11   7  11  13   9   8  11  13   9 140
# 7       University of California-Los Angeles  2  2  2  6  9  7  9  8   7  11  11   8   6  12  13  10 123
# 8                            Rice University  3  3  5 11  4  7  7 11   2   6   4   6   3   8   7   7  94

3
一种方法是根据不止一个空格或字符串开头后接只有空格或数字进行拆分。
library(magrittr)
library(stringr)
example.list %>% 
  str_split(.,"^( |[0-9])+| {2,}") %>%
  do.call(rbind,.) %>%
  as.data.frame %>%
  select(-V1)
#                                          V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
#1 North Carolina State University at Raleigh 15  9 12 13 22 15 32  19  14  20  12  17  19  20  19  25 283
#2                      Iowa State University  9  8  5 11 14  4 11  13  14   9  15  28  14   9  18  27 209
#3            University of Wisconsin-Madison  5  6 14  9 20 13 15  12  13   9  13  10  13  24  15  17 208
#4                       Stanford University* 10 12 14  6  9 10  5   9  13   7  13  10   4   9  23   6 160
#5     Texas A & M University-College Station  6 12 18 10  7  4  5  11  16  18  10   7  15   4   8   8 159
#6           University of Michigan-Ann Arbor  8  5  3  3  8  9 12  11   7  11  13   9   8  11  13   9 140
#7       University of California-Los Angeles  2  2  2  6  9  7  9   8   7  11  11   8   6  12  13  10 123
#8                            Rice University  3  3  5 11  4  7  7  11   2   6   4   6   3   8   7   7  94

3

使用基本R:

do.call(rbind, lapply(strsplit(substring(example.list, 3), "\\s{2,}"), 
                      function(x) as.data.frame(t(x))))


#                                           V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
#1  North Carolina State University at Raleigh 15  9 12 13 22 15 32 19  14  20  12  17  19  20  19  25 283
#2                       Iowa State University  9  8  5 11 14  4 11 13  14   9  15  28  14   9  18  27 209
#3             University of Wisconsin-Madison  5  6 14  9 20 13 15 12  13   9  13  10  13  24  15  17 208
#4                        Stanford University* 10 12 14  6  9 10  5  9  13   7  13  10   4   9  23   6 160
#5      Texas A & M University-College Station  6 12 18 10  7  4  5 11  16  18  10   7  15   4   8   8 159
#6            University of Michigan-Ann Arbor  8  5  3  3  8  9 12 11   7  11  13   9   8  11  13   9 140
#7        University of California-Los Angeles  2  2  2  6  9  7  9  8   7  11  11   8   6  12  13  10 123
#8                             Rice University  3  3  5 11  4  7  7 11   2   6   4   6   3   8   7   7  94

我们可以在超过2个空格的地方拆分字符串,将每个列表转换为一行数据框,并使用rbind合并它们。

3

更新

由于您分享的PDF文件中每页字符的“宽度”不同,因此您需要逐页处理该文件。这是一种方法:

library(pdftools)
library(tidyverse)
library(cgwtools)
data.loc = "https://ww2.amstat.org/misc/StatsPhD2003-MostRecent.pdf"
data.full =
  pdf_text(data.loc) %>%
  read_lines() %>%
  head(-2) %>%
  tail(-4)

data.full <- data.full[nzchar(data.full)]
# Split the file according to the number of characters per line
l <- split(data.full, nchar(data.full))

out = 
  bind_rows(lapply(l, function(x) {
    temp <- do.call(rbind, strsplit(x, ""))
    y <- which(colSums(temp == " ") == nrow(temp))
    temp[, y[cumsum(seqle(y)$lengths)]] <- ","
    read_csv(apply(temp, 1, paste, collapse = ""), col_names = FALSE)
  })) %>%
  arrange(X1)

out
## # A tibble: 78 x 19
##       X1 X2         X3    X4    X5    X6    X7    X8    X9   X10   X11   X12   X13   X14   X15
##    <dbl> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1 North …    15     9    12    13    22    15    32    19    14    20    12    17    19
##  2     2 Iowa S…     9     8     5    11    14     4    11    13    14     9    15    28    14
##  3     3 Univer…     5     6    14     9    20    13    15    12    13     9    13    10    13
##  4     4 Stanfo…    10    12    14     6     9    10     5     9    13     7    13    10     4
##  5     5 Texas …     6    12    18    10     7     4     5    11    16    18    10     7    15
##  6     6 Pennsy…     6     9     8     1     8     9    10    10    14     9    12     9    17
##  7     7 Ohio S…     3     5     3    12    12     7    11    11     6    11     8    13    14
##  8     8 Purdue…     6     5    12     8     8     7    10    12     9     9     5    10    14
##  9     9 Univer…     8     5     3     3     8     9    12    11     7    11    13     9     8
## 10    10 Univer…     2     2     2     6     9     7     9     8     7    11    11     8     6
## # … with 68 more rows, and 4 more variables: X16 <dbl>, X17 <dbl>, X18 <dbl>, X19 <dbl>

这里有另一种方法,以data.full为起点:

library(splitstackshape)
data.full <- data.full[nzchar(data.full)]

out = tibble(data.full) %>%
  extract(data.full, into = c("ind", "uni", "nums"), regex = "([0-9]+)([^0-9]+)(.*)") %>%
  mutate(uni = trimws(uni))%>%
  cSplit("nums", "\\s+", fixed = FALSE)

原始回答..

既然您已经加载了tidyverse,只需使用以下命令:

read_fwf(unlist(example.list), fwf_empty(unlist(example.list)))
## # A tibble: 8 x 19
##      X1 X2          X3    X4    X5    X6    X7    X8    X9   X10   X11   X12   X13   X14
##   <dbl> <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1 North C…    15     9    12    13    22    15    32    19    14    20    12    17
## 2     2 Iowa St…     9     8     5    11    14     4    11    13    14     9    15    28
## 3     3 Univers…     5     6    14     9    20    13    15    12    13     9    13    10
## 4     4 Stanfor…    10    12    14     6     9    10     5     9    13     7    13    10
## 5     5 Texas A…     6    12    18    10     7     4     5    11    16    18    10     7
## 6     9 Univers…     8     5     3     3     8     9    12    11     7    11    13     9
## 7    10 Univers…     2     2     2     6     9     7     9     8     7    11    11     8
## 8    19 Rice Un…     3     3     5    11     4     7     7    11     2     6     4     6
## # … with 5 more variables: X15 <dbl>, X16 <dbl>, X17 <dbl>, X18 <dbl>, X19 <dbl>

这可能是我发布的原始示例最简明的解决方案。 - BPipher

1
如果您也想要列名(我认为这是有意义的),请使用data.full,并像下面这样使用:
library(tidyverse)

data.full %>% 
    str_trim %>% 
    str_replace_all("\\s+(?=\\d)", ";") %>% 
    paste(collapse = "\n") %>% 
    read_csv2()

Output:

# A tibble: 78 x 18
   `Statistics PhD… `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012`
   <chr>             <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 1 North Carolin…     15      9     12     13     22     15     32     19     14     20
 2 2 Iowa State Un…      9      8      5     11     14      4     11     13     14      9
 3 3 University of5      6     14      9     20     13     15     12     13      9
 4 4 Stanford Univ…     10     12     14      6      9     10      5      9     13      7
 5 5 Texas A & M U…      6     12     18     10      7      4      5     11     16     18
 6 6 Pennsylvania …      6      9      8      1      8      9     10     10     14      9
 7 7 Ohio State Un…      3      5      3     12     12      7     11     11      6     11
 8 8 Purdue Univer…      6      5     12      8      8      7     10     12      9      9
 9 9 University of8      5      3      3      8      9     12     11      7     11
10 10 University o…      2      2      2      6      9      7      9      8      7     11
# … with 68 more rows, and 7 more variables: `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
#   `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, `20032018` <dbl>

1
利用数字前出现的空格是我最终采用的方法,但将其转换为分隔符则更加聪明! - BPipher

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接