如何在R中读取文本文件并创建数据框架

5

1
首先,dat = readLines("addr.txt")将以每行文本字符串的形式将数据读入R中。然后,您需要使用字符串操作函数将其解析为数据框。正如@3209Cigs所说,最好在StackOverflow上提出这个问题。 - eipi10
1
实际上,我认为在readLines之后,你只需要执行dat = as.data.frame(do.call(rbind, strsplit(dat, split=" {2,10}"))),就可以将数据转换成数据框。然后只需将列名更改为所需的值即可。 - eipi10
4个回答

9

在我的评论基础上,这里提供另一种方法。如果您的完整数据集有更广泛的模式需要考虑,您可能需要调整一些代码。

library(stringr) # For str_trim 

# Read string data and split into data frame
dat = readLines("addr.txt")
dat = as.data.frame(do.call(rbind, strsplit(dat, split=" {2,10}")), stringsAsFactors=FALSE)
names(dat) = c("LastName", "FirstName", "address", "city", "state", "zip")

# Separate address into number and street (if streetno isn't always numeric,
# or if you don't want it to be numeric, then just remove the as.numeric wrapper).
dat$streetno = as.numeric(gsub("([0-9]{1,4}).*","\\1",  dat$address))
dat$streetname = gsub("[0-9]{1,4} (.*)","\\1",  dat$address)

# Clean up zip
dat$zip = gsub("O","0", dat$zip)
dat$zip = str_trim(dat$zip)

dat = dat[,c(1:2,7:8,4:6)]

dat
      LastName  FirstName streetno           streetname       city state        zip
1        Bania  Thomas M.      725    Commonwealth Ave.     Boston    MA      02215
2      Barnaby      David      373        W. Geneva St.   Wms. Bay    WI      53191
3       Bausch       Judy      373        W. Geneva St.   Wms. Bay    WI      53191
...
41      Wright       Greg      791  Holmdel-Keyport Rd.    Holmdel    NY 07733-1988
42     Zingale    Michael     5640        S. Ellis Ave.    Chicago    IL      60637

3

我发现最简单的方法是把文件修正为csv格式,加上逗号之后再读取它。

## get the page as text
txt <- RCurl::getURL(
    "https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt"
)
## fix the EOL (end-of-line) markers
g1 <- gsub(" \n", "\n", txt, fixed = TRUE)
## read it
df <- read.csv(
    ## add most comma-separators, then the last for the house number
    text = gsub("(\\d+) (\\D+)", "\\1,\\2", gsub("\\s{2,}", ",", g1)), 
    header = FALSE,
    ## set the column names
    col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip")
)
## result
head(df)
#     LastName  FirstName streetno        streetname     city state   zip
# 1      Bania  Thomas M.      725 Commonwealth Ave.   Boston    MA O2215
# 2    Barnaby      David      373     W. Geneva St. Wms. Bay    WI 53191
# 3     Bausch       Judy      373     W. Geneva St. Wms. Bay    WI 53191
# 4    Bolatto    Alberto      725 Commonwealth Ave.   Boston    MA O2215
# 5  Carlstrom       John      933       E. 56th St.  Chicago    IL 60637
# 6 Chamberlin Richard A.      111        Nowelo St.     Hilo    HI 96720

3

试试这个。

x<-scan("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt" , 
  what = list(LastName="", FirstName="", streetno="", streetname="", city="", state="",zip=""))

data<-as.data.frame(x)

3
数据帧的内容与预期不一致,请尝试使用 head(data) 查看。 - vaettchen

2

您的问题不是如何使用R来读取这个数据,而是您的数据在输入变量长度字段之间没有足够的结构化常规分隔符。此外,邮政编码字段包含一些字母“O”,应该为“0”。

以下是一种使用正则表达式替换添加分隔符的方法,然后使用read.csv()解析分隔文本。请注意,根据您完整的文本中的异常情况,您可能需要调整正则表达式。我已经逐步完成了它们,以明确正在做什么,以便您可以根据输入文本中的异常情况进行调整。(例如,一些城市名称像“Wms. Bay”是两个单词。)

addr.txt <- readLines("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt")
addr.txt <- gsub("\\s+O(\\d{4})", " 0\\1", addr.txt)       # replace O with 0 in zip
addr.txt <- gsub("(\\s+)([A-Z]{2})", ", \\2", addr.txt)    # state
addr.txt <- gsub("\\s+(\\d{5}(\\-\\d{4}){0,1})\\s*", ", \\1", addr.txt) # zip
addr.txt <- gsub("\\s+(\\d{1,4})\\s", ", \\1, ", addr.txt) # streetno
addr.txt <- gsub("(^\\w*)(\\s+)", "\\1, ", addr.txt)       # LastName (FirstName)
addr.txt <- gsub("\\s{2,}", ", ", addr.txt)                # city, by elimination

addr <- read.csv(textConnection(addr.txt), header = FALSE,
                 col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip"),
                 stringsAsFactors = FALSE)
head(addr)
##     LastName   FirstName streetno         streetname      city state    zip
## 1      Bania   Thomas M.      725  Commonwealth Ave.    Boston    MA  02215
## 2    Barnaby       David      373      W. Geneva St.  Wms. Bay    WI  53191
## 3     Bausch        Judy      373      W. Geneva St.  Wms. Bay    WI  53191
## 4    Bolatto     Alberto      725  Commonwealth Ave.    Boston    MA  02215
## 5  Carlstrom        John      933        E. 56th St.   Chicago    IL  60637
## 6 Chamberlin  Richard A.      111         Nowelo St.      Hilo    HI  96720

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接