如何使用PHPExcel从大型Excel文件(27MB+)中读取大型工作表?

31
我有大型的Excel工作表,希望能够使用PHPExcel将其读入MySQL。我正在使用最近的补丁,它允许您在不打开整个文件的情况下读取工作表。这样我就可以一次读取一个工作表。
然而,一个Excel文件大小为27MB。我可以成功地读取第一个工作表,因为它很小,但第二个工作表太大了,以至于启动进程的cron作业在22:00时开始,到早上8:00仍未完成,工作表太大了有没有办法逐行读取工作表,例如像这样:
$inputFileType = 'Excel2007';
$inputFileName = 'big_file.xlsx';
$objReader = PHPExcel_IOFactory::createReader($inputFileType);
$worksheetNames = $objReader->listWorksheetNames($inputFileName);

foreach ($worksheetNames as $sheetName) {
    //BELOW IS "WISH CODE":
    foreach($row = 1; $row <=$max_rows; $row+= 100) {
        $dataset = $objReader->getWorksheetWithRows($row, $row+100);
        save_dataset_to_database($dataset);
    }
}

附录

@mark,我使用您发布的代码创建了以下示例:

function readRowsFromWorksheet() {

    $file_name = htmlentities($_POST['file_name']);
    $file_type = htmlentities($_POST['file_type']);

    echo 'Read rows from worksheet:<br />';
    debug_log('----------start');
    $objReader = PHPExcel_IOFactory::createReader($file_type);
    $chunkSize = 20;
    $chunkFilter = new ChunkReadFilter();
    $objReader->setReadFilter($chunkFilter);

    for ($startRow = 2; $startRow <= 240; $startRow += $chunkSize) {
        $chunkFilter->setRows($startRow, $chunkSize);
        $objPHPExcel = $objReader->load('data/' . $file_name);
        debug_log('reading chunk starting at row '.$startRow);
        $sheetData = $objPHPExcel->getActiveSheet()->toArray(null, true, true, true);
        var_dump($sheetData);
        echo '<hr />';
    }
    debug_log('end');
}

如下日志文件所示,它在一个小的8K Excel 文件上运行正常,但是当我在一个3MB的 Excel文件上运行它时,它从未超过第一块,有没有办法优化这段代码以提高性能?否则看起来它不足以从大型Excel文件中获取块。

2011-01-12 11:07:15: ----------start
2011-01-12 11:07:15: reading chunk starting at row 2
2011-01-12 11:07:15: reading chunk starting at row 22
2011-01-12 11:07:15: reading chunk starting at row 42
2011-01-12 11:07:15: reading chunk starting at row 62
2011-01-12 11:07:15: reading chunk starting at row 82
2011-01-12 11:07:15: reading chunk starting at row 102
2011-01-12 11:07:15: reading chunk starting at row 122
2011-01-12 11:07:15: reading chunk starting at row 142
2011-01-12 11:07:15: reading chunk starting at row 162
2011-01-12 11:07:15: reading chunk starting at row 182
2011-01-12 11:07:15: reading chunk starting at row 202
2011-01-12 11:07:15: reading chunk starting at row 222
2011-01-12 11:07:15: end
2011-01-12 11:07:52: ----------start
2011-01-12 11:08:01: reading chunk starting at row 2
(...at 11:18, CPU usage at 93% still running...)

附录2

当我注释掉以下内容时:

//$sheetData = $objPHPExcel->getActiveSheet()->toArray(null, true, true, true);
//var_dump($sheetData);

然后它以一个可接受的速度(大约每秒2行)进行解析,是否有任何方法可以提高toArray()的性能?

2011-01-12 11:40:51: ----------start
2011-01-12 11:40:59: reading chunk starting at row 2
2011-01-12 11:41:07: reading chunk starting at row 22
2011-01-12 11:41:14: reading chunk starting at row 42
2011-01-12 11:41:22: reading chunk starting at row 62
2011-01-12 11:41:29: reading chunk starting at row 82
2011-01-12 11:41:37: reading chunk starting at row 102
2011-01-12 11:41:45: reading chunk starting at row 122
2011-01-12 11:41:52: reading chunk starting at row 142
2011-01-12 11:42:00: reading chunk starting at row 162
2011-01-12 11:42:07: reading chunk starting at row 182
2011-01-12 11:42:15: reading chunk starting at row 202
2011-01-12 11:42:22: reading chunk starting at row 222
2011-01-12 11:42:22: end

附录3

例如,这似乎在至少 3 MB 的文件上能够足够地工作:

for ($startRow = 2; $startRow <= 240; $startRow += $chunkSize) {
    echo 'Loading WorkSheet using configurable filter for headings row 1 and for rows ', $startRow, ' to ', ($startRow + $chunkSize - 1), '<br />';
    $chunkFilter->setRows($startRow, $chunkSize);
    $objPHPExcel = $objReader->load('data/' . $file_name);
    debug_log('reading chunk starting at row ' . $startRow);
    foreach ($objPHPExcel->getActiveSheet()->getRowIterator() as $row) {
        $cellIterator = $row->getCellIterator();
        $cellIterator->setIterateOnlyExistingCells(false);
        echo '<tr>';
        foreach ($cellIterator as $cell) {
            if (!is_null($cell)) {
                //$value = $cell->getCalculatedValue();
                $rawValue = $cell->getValue();
                debug_log($rawValue);
            }
        }
    }
}

$sheetData的var_dump仅在我的代码片段中用于演示分块工作原理,可能不是您在“现实世界”中需要的内容。如果您确实需要进行工作表数据转储,则我当前正在向Worksheet类添加的rangeToArray()方法也比toArray()方法更有效。 - Mark Baker
@Edward Tanguay 你好,你找到任何解决方案或替代方法了吗?我也遇到了同样的问题。 - Deepanshu Goyal
2
一个替代PHPExcel的开源库是Spout。它支持读写大文件,且不需要超过10MB的内存。而且它非常快! - Adrien
如果电子表格中的行数未知,您如何确定“240”的值应该是多少? - juuga
@Edward Tanguay 我知道你发布这个已经有一段时间了,但你能否把整个代码发布出来呢? - Yuri
@Adrien,你的建议看起来很棒,比这里的任何答案都要好。 - Decebal
3个回答

12

使用读取筛选器可以“分块”读取工作表,但我不能保证效率。

$inputFileType = 'Excel5';
$inputFileName = './sampleData/example2.xls';


/**  Define a Read Filter class implementing PHPExcel_Reader_IReadFilter  */
class chunkReadFilter implements PHPExcel_Reader_IReadFilter
{
    private $_startRow = 0;

    private $_endRow = 0;

    /**  Set the list of rows that we want to read  */
    public function setRows($startRow, $chunkSize) {
        $this->_startRow    = $startRow;
        $this->_endRow        = $startRow + $chunkSize;
    }

    public function readCell($column, $row, $worksheetName = '') {
        //  Only read the heading row, and the rows that are configured in $this->_startRow and $this->_endRow
        if (($row == 1) || ($row >= $this->_startRow && $row < $this->_endRow)) {
            return true;
        }
        return false;
    }
}


echo 'Loading file ',pathinfo($inputFileName,PATHINFO_BASENAME),' using IOFactory with a defined reader type of ',$inputFileType,'<br />';
/**  Create a new Reader of the type defined in $inputFileType  **/

$objReader = PHPExcel_IOFactory::createReader($inputFileType);



echo '<hr />';


/**  Define how many rows we want to read for each "chunk"  **/
$chunkSize = 20;
/**  Create a new Instance of our Read Filter  **/
$chunkFilter = new chunkReadFilter();

/**  Tell the Reader that we want to use the Read Filter that we've Instantiated  **/
$objReader->setReadFilter($chunkFilter);

/**  Loop to read our worksheet in "chunk size" blocks  **/
/**  $startRow is set to 2 initially because we always read the headings in row #1  **/

for ($startRow = 2; $startRow <= 240; $startRow += $chunkSize) {
    echo 'Loading WorkSheet using configurable filter for headings row 1 and for rows ',$startRow,' to ',($startRow+$chunkSize-1),'<br />';
    /**  Tell the Read Filter, the limits on which rows we want to read this iteration  **/
    $chunkFilter->setRows($startRow,$chunkSize);
    /**  Load only the rows that match our filter from $inputFileName to a PHPExcel Object  **/
    $objPHPExcel = $objReader->load($inputFileName);

    //    Do some processing here

    $sheetData = $objPHPExcel->getActiveSheet()->toArray(null,true,true,true);
    var_dump($sheetData);
    echo '<br /><br />';
}
请注意,此读取过滤器始终会读取工作表的第一行,以及由分段规则定义的行。
使用读取过滤器时,PHPExcel仍会解析整个文件,但仅加载与定义的读取过滤器匹配的那些单元格,因此它仅使用所需数量的内存。但是,它将多次解析文件,每个分段解析一次,因此速度会变慢。此示例每次读取20行:要逐行读取,请将$chunkSize设置为1。
如果您有引用不同“块”中的单元格的公式,这也可能会导致问题,因为除了当前“块”外,数据根本不可用于其他单元格。

我正在测试文件中尝试您的代码,但它告诉我“找不到类'ChunkReadFilter'”。如果我去掉“implements PHPExcel_Reader_IReadFilter”,那么它会找到该类并告诉我需要“必须实现接口PHPExcel_Reader_IReadFilter”,我在我的文件开头放置了“require_once'PHPExcelClasses/PHPExcel/Reader/IReadFilter.php'”和“require_once ' PHPExcelClasses/PHPExcel/Reader/IReader.php'”,但如果我实现此接口仍然无法找到类,是否有其他文件我需要包含? - Edward Tanguay
我在测试中尝试了上面发布的代码。虽然它在小文件(8K)上运行良好,但在3 MB文件的第一块似乎无法通过。 - Edward Tanguay
1
我已将块大小减小到1行,但即使如此,在27MB的Excel文件中最大的工作表上,50秒后我会收到“内存不足(分配了1538523136)”的错误提示。我将内存限制设置为几乎最大值:ini_set("memory_limit", "3700M");。我使用上面的最后一个代码示例(附录3),以便我知道它不是在计算单元格,而只是给我原始值。是否有其他方法可以防止使用太多内存,以便它至少可以一次读取一行? - Edward Tanguay
1
在“foreach”循环之后执行“$objPHPExcel->disconnectWorksheets(); unset($objPHPExcel);”也有助于释放每个“chunk”迭代中PHPExcel对象的任何内存泄漏。 - Mark Baker
1
有趣的是:这两个大文件(27MB和84MB)都是.xlsx文件。我将27MB的文件保存为Excel2004(Mac)格式,然后用Excel5格式轻松读取了它。我也会尝试对84MB的文件进行同样的操作。 - Edward Tanguay
显示剩余14条评论

5

目前,要读取 .xlsx.csv.ods 文件,最好的选择是使用 spreadsheet-reader (https://github.com/nuovo/spreadsheet-reader),因为它可以在不将文件全部加载到内存中的情况下读取文件。对于 .xls 扩展名,它存在一些限制,因为它使用 PHPExcel 进行读取。


如果你在使用nuovo时遇到了这个问题https://github.com/nuovo/spreadsheet-reader/issues/59,那么你可能需要前往https://github.com/box。写作时,box工具还不存在。 - Tebe

1
这是 ChunkReadFilter.php 文件:
<?php
Class ChunkReadFilter implements PHPExcel_Reader_IReadFilter {

    private $_startRow = 0;
    private $_endRow = 0;

    /**  Set the list of rows that we want to read  */
    public function setRows($startRow, $chunkSize) {
        $this->_startRow = $startRow;
        $this->_endRow = $startRow + $chunkSize;
    }

    public function readCell($column, $row, $worksheetName = '') {

        //  Only read the heading row, and the rows that are configured in $this->_startRow and $this->_endRow 
        if (($row == 1) || ($row >= $this->_startRow && $row < $this->_endRow)) {

            return true;
        }
        return false;
    }

}
?>

这是index.php文件,末尾有一个不完美但基本的实现。

<?php

require_once './Classes/PHPExcel/IOFactory.php';
require_once 'ChunkReadFilter.php';

class Excelreader {

    /**
     * This function is used to read data from excel file in chunks and insert into database
     * @param string $filePath
     * @param integer $chunkSize
     */
    public function readFileAndDumpInDB($filePath, $chunkSize) {
        echo("Loading file " . $filePath . " ....." . PHP_EOL);
        /**  Create a new Reader of the type that has been identified  * */
        $objReader = PHPExcel_IOFactory::createReader(PHPExcel_IOFactory::identify($filePath));

        $spreadsheetInfo = $objReader->listWorksheetInfo($filePath);

        /**  Create a new Instance of our Read Filter  * */
        $chunkFilter = new ChunkReadFilter();

        /**  Tell the Reader that we want to use the Read Filter that we've Instantiated  * */
        $objReader->setReadFilter($chunkFilter);
        $objReader->setReadDataOnly(true);
        //$objReader->setLoadSheetsOnly("Sheet1");
        //get header column name
        $chunkFilter->setRows(0, 1);
        echo("Reading file " . $filePath . PHP_EOL . "<br>");
        $totalRows = $spreadsheetInfo[0]['totalRows'];
        echo("Total rows in file " . $totalRows . " " . PHP_EOL . "<br>");

        /**  Loop to read our worksheet in "chunk size" blocks  * */
        /**  $startRow is set to 1 initially because we always read the headings in row #1  * */
        for ($startRow = 1; $startRow <= $totalRows; $startRow += $chunkSize) {
            echo("Loading WorkSheet for rows " . $startRow . " to " . ($startRow + $chunkSize - 1) . PHP_EOL . "<br>");
            $i = 0;
            /**  Tell the Read Filter, the limits on which rows we want to read this iteration  * */
            $chunkFilter->setRows($startRow, $chunkSize);
            /**  Load only the rows that match our filter from $inputFileName to a PHPExcel Object  * */
            $objPHPExcel = $objReader->load($filePath);
            $sheetData = $objPHPExcel->getActiveSheet()->toArray(null, true, true, false);

            $startIndex = ($startRow == 1) ? $startRow : $startRow - 1;
            //dumping in database
            if (!empty($sheetData) && $startRow < $totalRows) {
                /**
                 * $this->dumpInDb(array_slice($sheetData, $startIndex, $chunkSize));
                 */

                echo "<table border='1'>";
                foreach ($sheetData as $key => $value) {
                    $i++;
                    if ($value[0] != null) {
                        echo "<tr><td>id:$i</td><td>{$value[0]} </td><td>{$value[1]} </td><td>{$value[2]} </td><td>{$value[3]} </td></tr>";
                    }
                }
                echo "</table><br/><br/>";
            }
            $objPHPExcel->disconnectWorksheets();
            unset($objPHPExcel, $sheetData);
        }
        echo("File " . $filePath . " has been uploaded successfully in database" . PHP_EOL . "<br>");
    }

    /**
     * Insert data into database table 
     * @param Array $sheetData
     * @return boolean
     * @throws Exception
     * THE METHOD FOR THE DATABASE IS NOT WORKING, JUST THE PUBLIC METHOD..
     */
    protected function dumpInDb($sheetData) {

        $con = DbAdapter::getDBConnection();
        $query = "INSERT INTO employe(name,address)VALUES";

        for ($i = 1; $i < count($sheetData); $i++) {
            $query .= "(" . "'" . mysql_escape_string($sheetData[$i][0]) . "',"
                    . "'" . mysql_escape_string($sheetData[$i][1]) . "')";
        }

        $query = trim($query, ",");
        $query .="ON DUPLICATE KEY UPDATE name=VALUES(name),
                =VALUES(address),
               ";
        if (mysqli_query($con, $query)) {
            mysql_close($con);
            return true;
        } else {
            mysql_close($con);
            throw new Exception(mysqli_error($con));
        }
    }

    /**
     * This function returns list of files corresponding to given directory path
     * @param String $dataFolderPath
     * @return Array list of file
     */
    protected function getFileList($dataFolderPath) {
        if (!is_dir($dataFolderPath)) {
            throw new Exception("Directory " . $dataFolderPath . " is not exist");
        }
        $root = scandir($dataFolderPath);
        $fileList = array();
        foreach ($root as $value) {
            if ($value === '.' || $value === '..') {
                continue;
            }
            if (is_file("$dataFolderPath/$value")) {
                $fileList[] = "$dataFolderPath/$value";
                continue;
            }
        }
        return $fileList;
    }

}

$inputFileName = './prueba_para_batch.xls';
$excelReader = new Excelreader();
$excelReader->readFileAndDumpInDB($inputFileName, 500);

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接