我想知道是否有人有关于如何处理转换数据的技巧和窍门,例如下面的数据:
library(tidyverse)
example.list = list(" 1 North Carolina State University at Raleigh 15 9 12 13 22 15 32 19 14 20 12 17 19 20 19 25 283",
" 2 Iowa State University 9 8 5 11 14 4 11 13 14 9 15 28 14 9 18 27 209",
" 3 University of Wisconsin-Madison 5 6 14 9 20 13 15 12 13 9 13 10 13 24 15 17 208",
" 4 Stanford University* 10 12 14 6 9 10 5 9 13 7 13 10 4 9 23 6 160",
" 5 Texas A & M University-College Station 6 12 18 10 7 4 5 11 16 18 10 7 15 4 8 8 159",
" 9 University of Michigan-Ann Arbor 8 5 3 3 8 9 12 11 7 11 13 9 8 11 13 9 140",
"10 University of California-Los Angeles 2 2 2 6 9 7 9 8 7 11 11 8 6 12 13 10 123",
"19 Rice University 3 3 5 11 4 7 7 11 2 6 4 6 3 8 7 7 94")
转化为类似此处的输出结果:
example.list %>%
substring(3) %>%
str_replace_all("[^[:alnum:]]", " ") %>%
str_squish() %>%
strsplit(split = "(?<=[a-zA-Z])\\s*(?=[0-9])", perl = TRUE) %>%
unlist() %>%
matrix(ncol = 2, byrow = TRUE) %>%
data.frame() %>%
separate("X2",into = paste0("X",2:18),sep = " ")
需要提取的一般模式是将所有字符提取到第一个数字之前的自己的列中,其他所有列通过空格分隔到其他列中。
有趣的事情是,是否可以在单个正则表达式或完全不使用正则表达式的情况下完成大部分工作。
我只是想改善字符串处理,因为我没有多少经验! 在这里使用的用例类似于从pdf / html中提取表格数据到数据框中。
编辑:
感谢所有建议和不同的观点!
我意识到我实际上错过了一些值得一提的测试用例:
example2.list = list(" 2 Iowa State University 9 8 5 11 14 4 11 13 14 9 15 28 14 9 18 27 209",
" 3 University of Wisconsin-Madison 5 6 14 9 20 13 15 12 13 9 13 10 13 24 15 17 208",
" 4 Stanford University* 10 12 14 6 9 10 5 9 13 7 13 10 4 9 23 6 160",
" 5 Texas A & M University-College Station 6 12 18 10 7 4 5 11 16 18 10 7 15 4 8 8 159",
" 9 University of Michigan-Ann Arbor 8 5 3 3 8 9 12 11 7 11 13 9 8 11 13 9 140",
"10 University of California-Los Angeles 2 2 2 6 9 7 9 8 7 11 11 8 6 12 13 10 123",
"19 Rice University 3 3 5 11 4 7 7 11 2 6 4 6 3 8 7 7 94",
"52 Bowling Green State University 0 0 0 0 0 1 5 2 2 2 4 7 3 4 4 3 37",
"55 University of New Mexico 4 2 3 1 3 0 5 3 2 1 1 2 3 2 3 0 35")
这并不像对齐那样整齐地呈现。
完整数据集,稍微清理过:
library(pdftools)
library(tidyverse)
data.loc = "https://ww2.amstat.org/misc/StatsPhD2003-MostRecent.pdf"
data.full =
pdf_text(data.loc) %>%
read_lines() %>%
head(-2) %>%
tail(-3) %>%
lapply(function(ele) if(ele == "") NULL else ele) %>%
compact()
这是我的第二次尝试:
library(tidyverse)
library(magrittr)
# Ignores column names
data.full[-1] %>%
# Removing excess whitepace
str_squish() %>%
# Removes index
str_remove("^\\s*\\d*\\s*") %>%
# Split on all whitespace occurring before digits
str_split("\\s+(?=\\d)") %>%
# Turn list into a matrix
unlist() %>%
matrix(ncol = 18, byrow = TRUE) %>%
# Handling variables names
set_colnames(value =
data.full[1] %>%
str_squish() %>%
str_split("\\s+(?=\\d)") %>%
unlist) %>%
as_tibble() %>%
# Transformating variables into numeric
type_convert()
read.fwf
或read_fwf
或相关函数。 - A5C1D2H2I1M1N2O1R2T1