读取大型XLS和XLSX文件

8

我知道周围有一些帖子,我已经尝试了几次来达到我的目标,如下所述:

我有一个包含多个xlsxlsx文件的.zip/.rar

每个Excel文件包含数十行至数千行,大约90列(每个Excel文件的列数可能多也可能少)。

我创建了一个java窗口生成器应用程序,在其中选择一个.zip/.rar文件并选择在何处解压这些文件并使用FileOutputStream创建它们。保存每个文件后,我正在读取其内容。

到目前为止都很好。 为了避免OOM(Out Of Memory)和加快速度,我尝试了几次,并达到了“最终版本”(虽然它很糟糕,但我需要弄清楚如何正确读取它) ,我将解释:

File file = new File('certainFile.xlsx'); //or xls, For example purposes
Workbook wb;
Sheet sheet;
/*
There is a ton of other things up to this point that I don't consider relevant, as it's related to unzipping and renaming, etc. 
This is within a cycle

/
In every zip file, there is at least 1 or 2 files that somehow, when it goes to
WorkbookFactory.create(), it still gives an OOM because it recognizes is has 
a bit over a million rows, meaning it's an 2007 format file (according to our friend Google.com), or so I believe so.
When I open the xlsx file, it indeed has like 10-20mb size and thousands of empty rows. When I save it again
it has 1mb and a couple thousand. After many attempts to read as InputStream, File or trying to save it in 
an automatic way, I've worked with converting it to a CSV and read it differently, 
ence, this 'solution'. if parseAsXLS is true, it applies my regular logic 
per row per cell, otherwise I parse the CSV.
*/
if (file.getName().contains("xlsx")) {
    this.parseAsXLS = false;
    OPCPackage pkg = OPCPackage.open(file);
    //This is just to output the content into a csv file, that I will read later on and it gets overwritten everytime it comes by
    FileOutputStream fo = new FileOutputStream(this.filePath + File.separator + "excel.csv");
    PrintStream ps = new PrintStream(fo);
    XLSX2CSV xlsxCsvConverter = new XLSX2CSV(pkg, ps, 90);
    try {
        xlsxCsvConverter.process();
    } catch (Exception e) {
        //I've added a count at the XLSX2CSV class in order to limit the ammount of rows I want to fetch and throw an Exception on purpose
        System.out.println("Limited the file at 60k rows");
    }
} else {
    this.parseAsXLS = true;
    this.wb = WorkbookFactory.create(file);
    this.sheet = wb.getSheetAt(0);
}

现在的情况是,从一个包含多个其他 .xls 和 .xlsx 文件的 .zip 文件中提取出一个 .xlsx 文件,在某一行中有一个特定字符,而 XLSX2CSV 将其视为 endRow,导致输出不正确。
这是一个例子:imagelink 注意:目标是仅从每个 Excel 文件中获取一组特定的列(可能有共同的或不强制要求),并将它们放在一个新的 Excel 中。电子邮件列(其中包含用逗号分隔的多个电子邮件)中有一个我认为是电子邮件前的“回车符”,因为如果我手动删除它,问题就会得到解决。但是,目标是不手动打开每个 Excel 并修复它,否则我只需打开每个 Excel 并复制粘贴所需的列即可。在这个例子中,我需要的列是:fieldAA、fieldAG、fieldAL 和 fieldAN。

XLSX2CSV.java(我不是这个文件的创建者,只是根据我的需求进行了应用)

import java.awt.List;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintStream;

import javax.xml.parsers.ParserConfigurationException;

import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.util.CellAddress;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.util.SAXHelper;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFComment;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * A rudimentary XLSX -> CSV processor modeled on the
 * POI sample program XLS2CSVmra from the package
 * org.apache.poi.hssf.eventusermodel.examples.
 * As with the HSSF version, this tries to spot missing
 *  rows and cells, and output empty entries for them.
 * <p>
 * Data sheets are read using a SAX parser to keep the
 * memory footprint relatively small, so this should be
 * able to read enormous workbooks.  The styles table and
 * the shared-string table must be kept in memory.  The
 * standard POI styles table class is used, but a custom
 * (read-only) class is used for the shared string table
 * because the standard POI SharedStringsTable grows very
 * quickly with the number of unique strings.
 * <p>
 * For a more advanced implementation of SAX event parsing
 * of XLSX files, see {@link XSSFEventBasedExcelExtractor}
 * and {@link XSSFSheetXMLHandler}. Note that for many cases,
 * it may be possible to simply use those with a custom 
 * {@link SheetContentsHandler} and no SAX code needed of
 * your own!
 */
public class XLSX2CSV {
    /**
     * Uses the XSSF Event SAX helpers to do most of the work
     *  of parsing the Sheet XML, and outputs the contents
     *  as a (basic) CSV.
     */
    private class SheetToCSV implements SheetContentsHandler {
        private boolean firstCellOfRow;
        private int currentRow = -1;
        private int currentCol = -1;
        private int maxrows = 60000;



        private void outputMissingRows(int number) {

            for (int i=0; i<number; i++) {
                for (int j=0; j<minColumns; j++) {
                    output.append(',');
                }
                output.append('\n');
            }
        }

        @Override
        public void startRow(int rowNum) {
            // If there were gaps, output the missing rows
            outputMissingRows(rowNum-currentRow-1);
            // Prepare for this row
            firstCellOfRow = true;
            currentRow = rowNum;
            currentCol = -1;

            if (rowNum == maxrows) {
                    throw new RuntimeException("Force stop at maxrows");
            }
        }

        @Override
        public void endRow(int rowNum) {
            // Ensure the minimum number of columns
            for (int i=currentCol; i<minColumns; i++) {
                output.append(',');
            }
            output.append('\n');
        }

        @Override
        public void cell(String cellReference, String formattedValue,
                XSSFComment comment) {
            if (firstCellOfRow) {
                firstCellOfRow = false;
            } else {
                output.append(',');
            }            

            // gracefully handle missing CellRef here in a similar way as XSSFCell does
            if(cellReference == null) {
                cellReference = new CellAddress(currentRow, currentCol).formatAsString();
            }

            // Did we miss any cells?
            int thisCol = (new CellReference(cellReference)).getCol();
            int missedCols = thisCol - currentCol - 1;
            for (int i=0; i<missedCols; i++) {
                output.append(',');
            }
            currentCol = thisCol;

            // Number or string?
            try {
                //noinspection ResultOfMethodCallIgnored
                Double.parseDouble(formattedValue);
                output.append(formattedValue);
            } catch (NumberFormatException e) {
                output.append('"');
                output.append(formattedValue);
                output.append('"');
            }
        }

        @Override
        public void headerFooter(String arg0, boolean arg1, String arg2) {
            // TODO Auto-generated method stub

        }
    }


    ///////////////////////////////////////

    private final OPCPackage xlsxPackage;

    /**
     * Number of columns to read starting with leftmost
     */
    private final int minColumns;

    /**
     * Destination for data
     */
    private final PrintStream output;

    /**
     * Creates a new XLSX -> CSV converter
     *
     * @param pkg        The XLSX package to process
     * @param output     The PrintStream to output the CSV to
     * @param minColumns The minimum number of columns to output, or -1 for no minimum
     */
    public XLSX2CSV(OPCPackage pkg, PrintStream output, int minColumns) {
        this.xlsxPackage = pkg;
        this.output = output;
        this.minColumns = minColumns;
    }

    /**
     * Parses and shows the content of one sheet
     * using the specified styles and shared-strings tables.
     *
     * @param styles The table of styles that may be referenced by cells in the sheet
     * @param strings The table of strings that may be referenced by cells in the sheet
     * @param sheetInputStream The stream to read the sheet-data from.

     * @exception java.io.IOException An IO exception from the parser,
     *            possibly from a byte stream or character stream
     *            supplied by the application.
     * @throws SAXException if parsing the XML data fails.
     */
    public void processSheet(
            StylesTable styles,
            ReadOnlySharedStringsTable strings,
            SheetContentsHandler sheetHandler, 
            InputStream sheetInputStream) throws IOException, SAXException {
        DataFormatter formatter = new DataFormatter();
        InputSource sheetSource = new InputSource(sheetInputStream);
        try {
            XMLReader sheetParser = SAXHelper.newXMLReader();
            ContentHandler handler = new XSSFSheetXMLHandler(
                  styles, null, strings, sheetHandler, formatter, false);
            sheetParser.setContentHandler(handler);
            sheetParser.parse(sheetSource);
         } catch(ParserConfigurationException e) {
            throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
         }
    }

    /**
     * Initiates the processing of the XLS workbook file to CSV.
     *
     * @throws IOException If reading the data from the package fails.
     * @throws SAXException if parsing the XML data fails.
     */
    public void process() throws IOException, OpenXML4JException, SAXException {
        ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(this.xlsxPackage);
        XSSFReader xssfReader = new XSSFReader(this.xlsxPackage);
        StylesTable styles = xssfReader.getStylesTable();
        XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
        int index = 0;
        while (iter.hasNext()) {
            try (InputStream stream = iter.next()) {
                processSheet(styles, strings, new SheetToCSV(), stream);
            }
            ++index;
        }
    }
} 

我正在寻找不同(且有效的)方法来达成我的目标。

感谢您的时间。


让我来试着预测一下。你不会得到这个问题的任何正确答案。为什么?因为你声称 XLSX2CSV 无法正常工作。但是你提供的唯一东西是来自 apache poi 的众所周知的代码和一个你怀疑会导致问题的 Excel 文件的 图片。你认为谁会从那张图片中重新创建一个真实的 Excel 文件呢?即使有人这样做了,那么重新创建的文件是否也会导致问题呢? - Axel Richter
1
首先,我不要求任何人重新创建这个问题,但如果他们想要复制,那么有库和示例PDF。创建一个Java项目,复制粘贴需要5分钟,创建类似的Excel文件需要2-3分钟。我不是怀疑,我确定这是错误,有一个叫做调试的术语,我已经浪费时间去确定问题所在。其次,我要求意见和方法来实现我的目的,而不是要求构建一个应用程序。去喝杯咖啡,继续工作吧。 - abr
3个回答

2
这个怎么样:

//获取压缩文件流

ZipFile zipFile = new ZipFile(billWater, Charset.forName("gbk"));


ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(billWater),  Charset.forName("gbk"));
//ZipEntry zipEntry;
//use openCsv 
 public static <T> List<T> processCSVFileByZip(ZipFile zipFile, ZipEntry zipEntry, Class<? extends T> clazz, Charset charset) throws IOException {
    Reader in = new InputStreamReader(zipFile.getInputStream(zipEntry), charset);
    return processCSVFile(in, clazz, charset, ',');
}

public static <T> List<T> processCSVFile(Reader in, Class<? extends T> clazz, Charset charset, char sep) {
    CsvToBean<T> csvToBean = new CsvToBeanBuilder(in)
            .withType(clazz).withSkipLines(1)
            .withIgnoreLeadingWhiteSpace(true).withSeparator(sep)
            .build();
    return csvToBean.parse();
}

//看起来这个问题与xlsx文件格式有关


1
我认为这里至少有两个未解决的问题:
  1. 在打开旧式XLS文件(它们是sparse)时,WorkbookFactory.create()会出现内存不足的情况

  2. XLSX2CSV正在破坏您的新式XLSX文件,可能是由于“某个字符[被错误地视为]endRow”

对于(1),我建议您找到一个Java XLS库,可以处理稀疏文件而不分配空间,或者一个Java XLS库,可以通过streaming方式处理文件,而不是WorkbookFactory采用的批处理方法。
对于(2),您需要找到一个不会破坏数据的Java XLSX库。
很抱歉,我不知道任何适用于(1)或(2)的好的Java库。
然而,我建议您在Excel中编写此脚本,而不是使用Java。 Excel内置了一个出色的脚本语言Excel VBA,可以处理打开多个文件,从中提取数据等操作。此外,您可以放心地运行在Excel VBA中的脚本不会遇到Java中遇到的稀疏表或XLSX解析等Excel功能问题。
(如果这是一项一次性工作,请您考虑手动完成所需时间与编写脚本所需时间之间的差异。)
祝您好运!

从来没有想过那个,我会去谷歌一下并尝试一下,感谢您的建议。 - abr

1
好的,所以我尝试复制您的Excel文件,并完全放弃了XLSX2CSV。我认为将xlsx转换为csv的方法不正确,因为根据您的XLSX格式,它可以读取所有空行(您可能知道这一点,因为您已经设置了60k的行计数器)。不仅如此,如果我们考虑到字段,它可能会导致特殊字符的不正确输出,就像您的问题一样。
我所做的是使用这个库https://github.com/davidpelfree/sjxlsx来读取和重新编写文件。它非常直观,新生成的xlsx文件已经纠正了字段。
我建议您尝试这种方法(也许不是使用这个库),以尝试重新编写文件以进行更正。

我一直在尝试应用那种方法,如果有任何更新,我会告诉你的。 - abr

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接