在一个窗口中绘制多个CSV文件

Question

在一个窗口中绘制多个CSV文件

4

我有一个包含 701 个给定的 csv 文件列表。每个文件都有相同的列数（7 列），但行数不同（介于 25000 和 28000 之间）。

以下是第一个文件的部分内容：

Date,Week,Week Day,Hour,Price,Volume,Sale/Purchase
18/03/2011,11,5,1,-3000.00,17416,Sell
18/03/2011,11,5,1,-1001.10,17427,Sell
18/03/2011,11,5,1,-1000.00,18055,Sell
18/03/2011,11,5,1,-500.10,18057,Sell
18/03/2011,11,5,1,-500.00,18064,Sell
18/03/2011,11,5,1,-400.10,18066,Sell
18/03/2011,11,5,1,-400.00,18066,Sell
18/03/2011,11,5,1,-300.10,18068,Sell
18/03/2011,11,5,1,-300.00,18118,Sell

现在我正在尝试绘制Volume和Date，条件是Price恰好为200.00。然后我尝试获取一个窗口，在这个窗口中可以看到随着时间的推移Volume的进展情况。

allenamen <- dir(pattern="*.csv")
alledat <- lapply(allenamen, read.csv, header = TRUE, 
   sep = ",", stringsAsFactors = FALSE)
verlauf <- function(a) {plot(Volume ~ Date, a, 
  data=subset(a, (Price=="200.00")), 
  ylim = c(15000, 45000), 
  xlim = as.Date(c("2011-12-30", "2013-01-20")), type = "l")}
lapply(alledat, verlauf)

但是我遇到了这个错误：

error in strsplit(log, NULL): non-character argument

我该如何避免这个错误？

- fYpsE

3

你的代码中没有使用到 strsplit 函数。 - David Arenburg

你的Date列是什么类型的class？ - talat

@beginneR类(alledat$Date)显示“NULL”。 - fYpsE

我仍然收到错误信息：strsplit(log, NULL) 中的错误：非字符参数。 - fYpsE

3

你需要退一步。创建一个小的虚拟数组 alledat 并验证其内容是否为所需类型（数值、字符等）。然后验证你的 subset 调用是否提取了正确的数据集，然后绘制该数据。这将有助于理解你的代码应该是什么样子的，并帮助定位错误。 - Carl Witthoft

显示剩余10条评论

3个回答

2

当你想将Price==200的所有子集合并到一个图中时，你可以使用以下函数：

plotprice <- function(x) {
  files <- list.files(pattern="*.csv")
  df <- data.frame()
  for(i in 1:length(files)){
    xx <- read.csv(as.character(files[i]))
    xx <- subset(xx, Price==x)
    df <- rbind(df, xx)
  }
  df$Date <- as.Date(as.character(df$Date), format="%d/%m/%Y")
  plot(Volume ~ Date, df, ylim = c(15000, 45000), xlim = as.Date(c("2011-12-30", "2013-01-20")), type = "l")
}

使用plotprice(200)，你将在一个价格为200的图中获得所有信息。

如果你想要每个csv文件的图表，可以使用：

ploteach <- function(x) {
  files <- list.files(pattern="*.csv")
  for(i in 1:length(files)){
    df <- read.csv(as.character(files[i]))
    df <- subset(df, Price==x)
    df$Date <- as.Date(as.character(df$Date), format="%d/%m/%Y")
    plot(Volume ~ Date, df, ylim = c(15000, 45000), xlim = as.Date(c("2011-12-30", "2013-01-20")), type = "l")
  }
}

ploteach(200)

- Jaap

代码可以运行，非常感谢。只有一个问题，如果在子集部分添加“df <- subset（df，Price == x，Hour == 9，Sale.Purchase ==" Sell "）”，则会报错，显示 R 找不到 Sale.Purchase。您有什么想法吗？ - fYpsE

@fYpsE 在您的问题中，您将变量定义为“Sale/Purchase”而不是“Sale.Purchase”。这可能会导致错误。 - Jaap

使用head()函数时，列名被称为Sale.Purchase，因此我必须使用Sale.Purchase。但是无论如何，代码都可以完美运行。非常感谢！ - fYpsE

0

好的，首先您需要将lapply - read.csv的结果从包含701个CSV文件的列表转换为单个数据框。

新增读取和子集功能，以避免内存不足：

#
# function to read and subset data to avoid running out of RAM
read.subset <- function(dateiname){
   a <- read.csv(file = dateiname, header = TRUE, sep = ",",
                 stringsAsFactors = FALSE)
   a <- a[a$Price == 200.00,]
   print(gc())    # monitor and clean RAM after each file is read
   return(a)
}

* 更新2：添加了使用scan更快的read.subset实现

# function to read and subset data to avoid running out of RAM
read.subset.fast <- function(dateiname){
   # get data from csv into a data.frame
   a <- scan(file          = dateiname,
             what          = c(list(character()),
                               rep(list(numeric()),5),
                               list(character())),
             skip          = 1,  # skip header (equivalent to header = TRUE)
             sep           = ",")
   # transform efficiently list into data.frame
   attributes(a) <- list(class      = "data.frame",
                         row.names  = c(NA_integer_, length(a[[1]])),
                         names      = scan(file          = dateiname,
                                           what          = character(),
                                           skip          = 0,  
                                           nlines        = 1,  # just read first line to extract column names
                                           sep           = ","))
   # subset data
   a <- a[a$Price == 200.00,]
   print(gc())
   return(a)
}
#

现在让我们来读取、子集化和合并单个数据框中的数据：

#
allenamen <- list.files(pattern="*.csv") # updated (@Richard Scriven)
# get a single data frame, instead of a list of 701 data frames
alledat <- do.call(rbind, lapply(allenamen, read.subset.fast))
#

将日期转换为正确的格式：

# get dates in dates format
alledat$Date <- as.Date(as.character(alledat$Date), format="%d/%m/%Y")

然后你就可以开始了，不需要函数。只需绘制它：

plot(Volume ~ Date, 
     data = alledat,
     ylim = range(Volume),
     xlim = range(Date),
     type = "l")

- luis_js

这是一个很好的想法，但我有一个问题。如果我尝试将这些文件合并成一个数据框，我的电脑需要10分钟来计算，最后会出现以下错误：In rbind(deparse.level, ...) : Reached total allocation of 4043Mb: see help(memory.size). 这些文件太大了，无法进行操作。在rbind命令之前，有没有可能先过滤掉Price == 200.00的行呢？ - fYpsE

是的，我添加了read.subset函数来完成这个任务。现在这个解决方案回答了你的问题吗？ - luis_js

header = TRUE 和 sep = "," 在 read.csv 中都是默认值，因此它们都是不必要的。 - Rich Scriven

在执行alledat <- do.call(rbind, lapply(allenamen, read.subset.fast))之后，我收到了一个警告：error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '3B'。如果我继续执行alledat$Date <- as.Date(as.character(alledat$Date), format="%d/%m/%Y")，R会说找不到对象'alledat'。 - fYpsE

我使用你提供的数据集测试了read.subset.fast函数，它可以正常工作。你的某个.csv文件中可能有一些非标准数据。因此，请尝试使用read.subset函数而不是read.subset.fast，并让我知道它是否无法正常工作。 - luis_js

显示剩余4条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rich Scriven · Accepted Answer

以下是几个建议：

使用 list.files，而不是 dir，来查找文件。 dir 用于列出目录中的文件。你正在使用的方式是用于当前目录。
header = TRUE 和 sep = "," 是 read.csv 的默认参数，因此在你的代码中是不必要的。
读取每个文件时进行子集操作。

这是一个建议的方法。

> fnames <- list.files(pattern  = "*.csv")
> read <- lapply(fnames, function(x){
    rd <- read.csv(x, stringsAsFactors = FALSE)
    subset(rd, Price == 200)
    })
> dat <- do.call(rbind, read)

然后您应该能够绘制dat。