使用R语言读取以多个空格为分隔符的文本文件

Question

使用R语言读取以多个空格为分隔符的文本文件

83

我有一个大数据集，包含约94列和300万行。该文件在列之间使用单个或多个空格作为分隔符。我需要在R中从该文件中读取一些列。为此，我尝试使用read.table()函数，并使用下面代码中的选项，请参见以下代码-

### Defining the columns to be read from the file, the first 5 column, then we do not read next 24, after this we read next 5 columns. Last 60 columns are not read in-

    col_classes = c(rep("character",2), rep("numeric", 3), rep("NULL",24), rep("numeric", 5), rep("NULL", 60))   

### Reading first 100 rows of the data

    data <- read.table(file, sep = " ",header = F, nrows = 100, na.strings ="", stringsAsFactors= F)

由于要读取的文件有多个空格作为某些列之间的分隔符，上述方法无法使用。是否有使用其他方法可以高效读取此文件的方式。

- Pawan

6

只需删除 sep=" " 参数。read.table 默认知道如何处理多个空格。 - Hong Ooi

1

我有一个非常类似的问题，但我需要一个更通用的解决方案，因为我的某些字段中有单个空格。这意味着我应该能够设置最小连续空格数（在我的情况下为2）作为分隔符，而没有限制。 - EdM

相关帖子：https://dev59.com/nYzda4cB1Zd3GeqPjC-J - zx8754

1

@HongOoi：是的，但这仅因为read.table/read.csv的默认分隔符是sep="", 这意味着“多个空格”，我们本应该期望它应该是正则表达式"\w*"或"\w+"而不是""。 - smci

3个回答

12

如果您想使用tidyverse（或分别使用readr）软件包，可以改用read_table。

read_table(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = "NA", skip = 0, n_max = Inf,
  guess_max = min(n_max, 1000), progress = show_progress(), comment = "")

请看这里的描述：

read_table() and read_table2() are designed to read the type of textual data where
each column is #' separate by one (or more) columns of space.

- Revan

4

如果您的字段具有固定宽度，建议使用read.fwf()，这可能更好地处理缺失值。

- cmbarbu

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Simon O'Hanlon · Accepted Answer

您需要更改分隔符。" "表示一个空格字符。""表示任何长度的空格作为分隔符。

 data <- read.table(file, sep = "" , header = F , nrows = 100,
                     na.strings ="", stringsAsFactors= F)

来自手册：

如果sep = ""（对于read.table而言默认值），分隔符是“空格”，即一个或多个空格、制表符、换行符或回车符。

此外，对于大型数据文件，您可能需要考虑使用data.table:::fread函数，直接将数据快速读入data.table中。今天早晨我自己也在使用这个函数。它仍处于实验阶段，但我发现它的确非常有效。