我知道周围有一些帖子,我已经尝试了几次来达到我的目标,如下所述:
我有一个包含多个xls
和xlsx
文件的.zip
/.rar
。
每个Excel文件包含数十行至数千行,大约90列(每个Excel文件的列数可能多也可能少)。
我创建了一个java窗口生成器应用程序,在其中选择一个.zip
/.rar
文件并选择在何处解压这些文件并使用FileOutputStream
创建它们。保存每个文件后,我正在读取其内容。
到目前为止都很好。 为了避免OOM(Out Of Memory)和加快速度,我尝试了几次,并达到了“最终版本”(虽然它很糟糕,但我需要弄清楚如何正确读取它) ,我将解释:
File file = new File('certainFile.xlsx'); //or xls, For example purposes
Workbook wb;
Sheet sheet;
/*
There is a ton of other things up to this point that I don't consider relevant, as it's related to unzipping and renaming, etc.
This is within a cycle
/
In every zip file, there is at least 1 or 2 files that somehow, when it goes to
WorkbookFactory.create(), it still gives an OOM because it recognizes is has
a bit over a million rows, meaning it's an 2007 format file (according to our friend Google.com), or so I believe so.
When I open the xlsx file, it indeed has like 10-20mb size and thousands of empty rows. When I save it again
it has 1mb and a couple thousand. After many attempts to read as InputStream, File or trying to save it in
an automatic way, I've worked with converting it to a CSV and read it differently,
ence, this 'solution'. if parseAsXLS is true, it applies my regular logic
per row per cell, otherwise I parse the CSV.
*/
if (file.getName().contains("xlsx")) {
this.parseAsXLS = false;
OPCPackage pkg = OPCPackage.open(file);
//This is just to output the content into a csv file, that I will read later on and it gets overwritten everytime it comes by
FileOutputStream fo = new FileOutputStream(this.filePath + File.separator + "excel.csv");
PrintStream ps = new PrintStream(fo);
XLSX2CSV xlsxCsvConverter = new XLSX2CSV(pkg, ps, 90);
try {
xlsxCsvConverter.process();
} catch (Exception e) {
//I've added a count at the XLSX2CSV class in order to limit the ammount of rows I want to fetch and throw an Exception on purpose
System.out.println("Limited the file at 60k rows");
}
} else {
this.parseAsXLS = true;
this.wb = WorkbookFactory.create(file);
this.sheet = wb.getSheetAt(0);
}
现在的情况是,从一个包含多个其他 .xls 和 .xlsx 文件的 .zip 文件中提取出一个 .xlsx 文件,在某一行中有一个特定字符,而 XLSX2CSV 将其视为 endRow,导致输出不正确。
这是一个例子:imagelink 注意:目标是仅从每个 Excel 文件中获取一组特定的列(可能有共同的或不强制要求),并将它们放在一个新的 Excel 中。电子邮件列(其中包含用逗号分隔的多个电子邮件)中有一个我认为是电子邮件前的“回车符”,因为如果我手动删除它,问题就会得到解决。但是,目标是不手动打开每个 Excel 并修复它,否则我只需打开每个 Excel 并复制粘贴所需的列即可。在这个例子中,我需要的列是:fieldAA、fieldAG、fieldAL 和 fieldAN。
XLSX2CSV.java(我不是这个文件的创建者,只是根据我的需求进行了应用)
import java.awt.List;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintStream;
import javax.xml.parsers.ParserConfigurationException;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.util.CellAddress;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.util.SAXHelper;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFComment;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
/**
* A rudimentary XLSX -> CSV processor modeled on the
* POI sample program XLS2CSVmra from the package
* org.apache.poi.hssf.eventusermodel.examples.
* As with the HSSF version, this tries to spot missing
* rows and cells, and output empty entries for them.
* <p>
* Data sheets are read using a SAX parser to keep the
* memory footprint relatively small, so this should be
* able to read enormous workbooks. The styles table and
* the shared-string table must be kept in memory. The
* standard POI styles table class is used, but a custom
* (read-only) class is used for the shared string table
* because the standard POI SharedStringsTable grows very
* quickly with the number of unique strings.
* <p>
* For a more advanced implementation of SAX event parsing
* of XLSX files, see {@link XSSFEventBasedExcelExtractor}
* and {@link XSSFSheetXMLHandler}. Note that for many cases,
* it may be possible to simply use those with a custom
* {@link SheetContentsHandler} and no SAX code needed of
* your own!
*/
public class XLSX2CSV {
/**
* Uses the XSSF Event SAX helpers to do most of the work
* of parsing the Sheet XML, and outputs the contents
* as a (basic) CSV.
*/
private class SheetToCSV implements SheetContentsHandler {
private boolean firstCellOfRow;
private int currentRow = -1;
private int currentCol = -1;
private int maxrows = 60000;
private void outputMissingRows(int number) {
for (int i=0; i<number; i++) {
for (int j=0; j<minColumns; j++) {
output.append(',');
}
output.append('\n');
}
}
@Override
public void startRow(int rowNum) {
// If there were gaps, output the missing rows
outputMissingRows(rowNum-currentRow-1);
// Prepare for this row
firstCellOfRow = true;
currentRow = rowNum;
currentCol = -1;
if (rowNum == maxrows) {
throw new RuntimeException("Force stop at maxrows");
}
}
@Override
public void endRow(int rowNum) {
// Ensure the minimum number of columns
for (int i=currentCol; i<minColumns; i++) {
output.append(',');
}
output.append('\n');
}
@Override
public void cell(String cellReference, String formattedValue,
XSSFComment comment) {
if (firstCellOfRow) {
firstCellOfRow = false;
} else {
output.append(',');
}
// gracefully handle missing CellRef here in a similar way as XSSFCell does
if(cellReference == null) {
cellReference = new CellAddress(currentRow, currentCol).formatAsString();
}
// Did we miss any cells?
int thisCol = (new CellReference(cellReference)).getCol();
int missedCols = thisCol - currentCol - 1;
for (int i=0; i<missedCols; i++) {
output.append(',');
}
currentCol = thisCol;
// Number or string?
try {
//noinspection ResultOfMethodCallIgnored
Double.parseDouble(formattedValue);
output.append(formattedValue);
} catch (NumberFormatException e) {
output.append('"');
output.append(formattedValue);
output.append('"');
}
}
@Override
public void headerFooter(String arg0, boolean arg1, String arg2) {
// TODO Auto-generated method stub
}
}
///////////////////////////////////////
private final OPCPackage xlsxPackage;
/**
* Number of columns to read starting with leftmost
*/
private final int minColumns;
/**
* Destination for data
*/
private final PrintStream output;
/**
* Creates a new XLSX -> CSV converter
*
* @param pkg The XLSX package to process
* @param output The PrintStream to output the CSV to
* @param minColumns The minimum number of columns to output, or -1 for no minimum
*/
public XLSX2CSV(OPCPackage pkg, PrintStream output, int minColumns) {
this.xlsxPackage = pkg;
this.output = output;
this.minColumns = minColumns;
}
/**
* Parses and shows the content of one sheet
* using the specified styles and shared-strings tables.
*
* @param styles The table of styles that may be referenced by cells in the sheet
* @param strings The table of strings that may be referenced by cells in the sheet
* @param sheetInputStream The stream to read the sheet-data from.
* @exception java.io.IOException An IO exception from the parser,
* possibly from a byte stream or character stream
* supplied by the application.
* @throws SAXException if parsing the XML data fails.
*/
public void processSheet(
StylesTable styles,
ReadOnlySharedStringsTable strings,
SheetContentsHandler sheetHandler,
InputStream sheetInputStream) throws IOException, SAXException {
DataFormatter formatter = new DataFormatter();
InputSource sheetSource = new InputSource(sheetInputStream);
try {
XMLReader sheetParser = SAXHelper.newXMLReader();
ContentHandler handler = new XSSFSheetXMLHandler(
styles, null, strings, sheetHandler, formatter, false);
sheetParser.setContentHandler(handler);
sheetParser.parse(sheetSource);
} catch(ParserConfigurationException e) {
throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
}
}
/**
* Initiates the processing of the XLS workbook file to CSV.
*
* @throws IOException If reading the data from the package fails.
* @throws SAXException if parsing the XML data fails.
*/
public void process() throws IOException, OpenXML4JException, SAXException {
ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(this.xlsxPackage);
XSSFReader xssfReader = new XSSFReader(this.xlsxPackage);
StylesTable styles = xssfReader.getStylesTable();
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
int index = 0;
while (iter.hasNext()) {
try (InputStream stream = iter.next()) {
processSheet(styles, strings, new SheetToCSV(), stream);
}
++index;
}
}
}
我正在寻找不同(且有效的)方法来达成我的目标。
感谢您的时间。
XLSX2CSV
无法正常工作。但是你提供的唯一东西是来自apache poi
的众所周知的代码和一个你怀疑会导致问题的Excel
文件的 图片。你认为谁会从那张图片中重新创建一个真实的Excel
文件呢?即使有人这样做了,那么重新创建的文件是否也会导致问题呢? - Axel Richter