将分块文件读取到数据框中

4

我对pandas/r还比较新手,不太确定如何将这些数据读入pandas或r进行分析。

目前,我想使用readr的read_chunkwise或pandas的chunksize,但这可能不是我需要的。这个问题是否可以使用for循环或使用purr来迭代所有元素来轻松解决?

数据:

wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856 
wine/variant: Red Rhone Blend 
wine/year: 1981 
review/points: 96   
review/time: 1160179200   
review/userId: 1 
review/userName: Eric 
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

wine/name: 1995 Château Pichon-Longueville Baron 
wine/wineId: 3495 wine/variant: Red Bordeaux Blend 
wine/year: 1995 
review/points: 93 
review/time: 1063929600 
review/userId: 1 
review/userName: Eric 
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.

目前,这是我的一个函数,但我遇到了一个错误:

>

 convertchunkfile <- function(df){   for(i in 1:length(df)){
>     #While the length of any line is not 0, process it with the following loop
>     while(nchar(df[[i]]) != 0){
>       case_when(
>         
>         #When data at x index == wine/name, then extract the data after that clause
>         #Wine Name parsing
>         cleandf$WineName[[i]] <- df[i] == str_sub(df[1],0, 10) ~ str_trim(substr(df[1], 11, nchar(df[1]))),
>         #Wine ID parsing
>         cleandf$WineID[[i]] <- df[i] == str_sub(df[2],0,11) ~ str_trim(substr(df[2], 13, nchar(df[1])))
>         #same format for other attributes
>       )
>     }   
>    }
>  } 

Error in cleandf$BeerName[[i]] <- df[i] == str_sub(df[1], 0, 10) ~ str_trim(substr(df[1],  : 
  more elements supplied than there are to replace

编辑:

在解决了一些问题之后,我认为这可能是最好的解决方案,参考了@hereismyname的解决方案:

#Use Bash's iconv to force convert the file in OS X
iconv -c -t UTF-8 cellartracker-clean.txt > cellartracker-iconv.txt

#Check number of lines within the file
wc -l cellartracker-iconv.txt
20259950 cellartracker-iconv.txt

#Verify new encoding of the file
file -I cellartracker-clean.txt


ReadEmAndWeep <- function(file, chunk_size) {
  f <- function(chunk, pos) {
    data_frame(text = chunk) %>%
      filter(text != "") %>%
      separate(text, c("var", "value"), ":", extra = "merge") %>%
      mutate(
        chunk_id = rep(1:(nrow(.) / 9), each = 9),
        value = trimws(value)
      ) %>%
      spread(var, value)
  }

  read_lines_chunked(file, DataFrameCallback$new(f), chunk_size = chunk_size)
}

#Final Function call to read in the file
dataframe <- ReadEmAndWeep(file, chunk_size = 100000)

你似乎有一些带有多个变量的行(例如:wine/year: 1981 review/points: 96)。这是正确的吗? - A5C1D2H2I1M1N2O1R2T1
谢谢指出,实际上格式错了。我已经更新了问题中的数据。谢谢。 - petergensler
3个回答

2

以下是在R中相当惯用的方法:

library(readr)
library(tidyr)
library(dplyr)

out <- data_frame(text = read_lines(the_text)) %>%
  filter(text != "") %>% 
  separate(text, c("var", "value"), ":", extra = "merge") %>% 
  mutate(
    chunk_id = rep(1:(nrow(.) / 9), each = 9),
    value    = trimws(value)
  ) %>% 
  spread(var, value)

你能否评论一下这个代码块中的“chunk id”行在做什么?我不确定这行代码在干什么。 - petergensler
@petergensler chunk_id 变量只是为每个评论创建一个唯一的记录编号。每个评论都作为 9 行文本集合传入,但代码将每行转换为一列。 - Andrew
为什么要使用rep() / 9?那部分让我感到困惑。 - petergensler

1

以下是将这些记录读入 pandas.DataFrame 的代码。这些记录的结构类似于 yaml 记录,因此该代码利用了这一点。空行用作记录分隔符。

import pandas as pd
import collections
import yaml

def read_records(lines):
    # keep track of the columns in an ordered set
    columns = collections.OrderedDict()

    record = []
    records = []
    for line in lines:
        if line:
            # gather each line of text until a blank line
            record.append(line)

            # keep track of the columns seen in an ordered set
            columns[line.split(':')[0].strip()] = None

        # if the line is empty and we have a record, then convert it 
        elif record:

            # use yaml to convert the lines into a dict
            records.append(yaml.load('\n'.join(record)))
            record = []

    # record last record
    if record:
        records.append(yaml.load('\n'.join(record)))

    # return a pandas dataframe from the list of dicts
    return pd.DataFrame(records, columns=list(columns.keys()))

测试代码:
print(read_records(data))

Results:

                                           wine/name  wine/wineId  \
0  1981 Ch&#226;teau de Beaucastel Ch&#226;teaune...        18856   
1         1995 Ch&#226;teau Pichon-Longueville Baron         3495   

         wine/variant  wine/year  review/points  review/time  review/userId  \
0     Red Rhone Blend       1981             96   1160179200              1   
1  Red Bordeaux Blend       1995             93   1063929600              1   

  review/userName                                        review/text  
0            Eric  Olive, horse sweat, dirty saddle, and smoke. T...  
1            Eric  A remarkably floral nose with violet and chamb...  

测试数据:

data = [x.strip() for x in """
    wine/name: 1981 Ch&#226;teau de Beaucastel Ch&#226;teauneuf-du-Pape
    wine/wineId: 18856
    wine/variant: Red Rhone Blend
    wine/year: 1981
    review/points: 96
    review/time: 1160179200
    review/userId: 1
    review/userName: Eric
    review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

    wine/name: 1995 Ch&#226;teau Pichon-Longueville Baron
    wine/wineId: 3495
    wine/variant: Red Bordeaux Blend
    wine/year: 1995
    review/points: 93
    review/time: 1063929600
    review/userId: 1
    review/userName: Eric
    review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
""".split('\n')[1:-1]]

谢谢您的尝试!我之前已经尝试过在r中使用我的上面列出的代码,但是我开始怀疑pandas是否真的是完成这个任务的更好工具……请让我知道您对我的代码有何看法。 - petergensler
很遗憾,我从未涉足过 R,所以在那方面无法提供任何帮助。干杯。 - Stephen Rauch
没问题,我想我可能需要研究一下pandas,因为这个R代码已经相当复杂了... pandas有没有读取像这样格式化的数据块的选项?我在其他数据源中看到过这种情况,认为应该有更简单的处理方法。 - petergensler

0

这是我建议的方法:

y <- readLines("your_file")
y <- unlist(strsplit(gsub("(wine\\/|review\\/)", "~~~\\1", y), "~~~", TRUE))

library(data.table)
dcast(fread(paste0(y[y != ""], collapse = "\n"), header = FALSE)[
  , rn := cumsum(V1 == "wine/name")], rn ~ V1, value.var = "V2")

唯一的假设是每个新葡萄酒的第一行都以wine/name开头。空行等不重要。

这里有两个数据集可供尝试。

将代码第一行中的"your_file"替换为url1url2来尝试它。

url1 <- "https://gist.githubusercontent.com/mrdwab/3db1f2d6bf75e9212d9e933ad18d2865/raw/7376ae59b201d57095f849cab079782efb8ac827/wines1.txt"

url2 <- "https://gist.githubusercontent.com/mrdwab/3db1f2d6bf75e9212d9e933ad18d2865/raw/7376ae59b201d57095f849cab079782efb8ac827/wines2.txt"

请注意第二个数据集中第一瓶酒的wine/variant:值缺失。
最好在awk或类似工具中使用gsub,然后直接使用fread

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接