导入包含逗号、千位分隔符和负号的CSV数据

4

我在 Mac OS X 上使用 R 2.13.1 版本。我正在尝试导入一个数据文件,它以点作为千位分隔符,以逗号作为小数点,并且负值带有结尾减号。

基本上,我正在尝试进行以下转换:

"A|324,80|1.324,80|35,80-"

  V1    V2     V3    V4
1  A 324.80 1324.8 -35.80

现在,以下两个交互式代码都可以正常工作:

gsub("\\.","","1.324,80")
[1] "1324,80"

gsub("(.+)-$","-\\1", "35,80-")
[1] "-35,80"

并且将它们组合起来:

gsub("\\.", "", gsub("(.+)-$","-\\1","1.324,80-"))
[1] "-1324,80"

然而,我无法从read.data中删除千位分隔符:
setClass("num.with.commas")

setAs("character", "num.with.commas", function(from) as.numeric(gsub("\\.", "", sub("(.+)-$","-\\1",from))) )
mydata <- "A|324,80|1.324,80|35,80-"

mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))

Warning messages:
1: In asMethod(object) : NAs introduced by coercion
2: In asMethod(object) : NAs introduced by coercion
3: In asMethod(object) : NAs introduced by coercion

mytable
  V1 V2 V3 V4
1  A NA NA NA

请注意,如果在函数中将“\\。”更改为“,”,情况会有所不同:
setAs("character", "num.with.commas", function(from) as.numeric(gsub(",", "", sub("(.+)-$","-\\1",from))) )

mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))

mytable
  V1    V2     V3    V4
1  A 32480 1.3248 -3580

我认为问题在于read.data使用dec=","将输入的","转换为".",然后才调用as(from, "num.with.commas"),因此输入字符串可以是"1.324.80"。
我希望as("1.123,80-","num.with.commas")返回-1123.80,并且as("1.100.123,80", "num.with.commas")返回1100123.80。
如何使我的num.with.commas替换输入字符串中除最后一个小数点之外的所有内容?
更新:首先,我添加了负向先行断言,并在控制台中使as()正常工作:
setAs("character", "num.with.commas", function(from) as.numeric(gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE)) )
as("1.210.123.80-","num.with.commas")
[1] -1210124
as("10.123.80-","num.with.commas")
[1] -10123.8
as("10.123.80","num.with.commas")
[1] 10123.8

然而,read.table仍然存在同样的问题。通过在我的函数中添加一些print(),我发现num.with.commas实际上得到了逗号而不是点号。因此,我的当前解决方案是将num.with.commas中的“,”替换为“.”。
setAs("character", "num.with.commas", function(from) as.numeric(gsub(",","\\.",gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE))) )
mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))
mytable
  V1    V2      V3    V4
1  A 324.8 1101325 -35.8
2个回答

4

在使用as.numeric()之前,您应该先删除所有句号,然后将逗号更改为小数点。您可以稍后使用options(OutDec=",")控制小数点的打印方式。我认为即使在传统使用逗号作为小数分隔符的语言环境中,R也不会在内部使用逗号作为小数分隔符。

> tst <- c("A","324,80","1.324,80","35,80-")
> 
> as.numeric( sub("\\,", ".", sub("(.+)-$","-\\1", gsub("\\.", "", tst)) ) )
[1]     NA  324.8 1324.8  -35.8
Warning message:
NAs introduced by coercion 

1
这是一个使用正则表达式和替换的解决方案。
mydata <- "A|324,80|1.324,80|35,80-"
# Split data
mydata2 <- strsplit(mydata,"|",fixed=TRUE)[[1]]
# Remove commas
mydata3 <- gsub(",","",mydata2,fixed=TRUE)
# Move negatives to front of string
mydata4 <- gsub("^(.+)-$","-\\1",mydata3)
# Convert to numeric
mydata.cleaned <- c(mydata4[1],as.numeric(mydata4[2:4]))

谢谢,gsk3。这基本上是与我先前阅读的DWin的答案相同。 - Terje Sten Bjerkseth
没问题。很高兴你得到了一些解决问题的答案。 - Ari B. Friedman

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接