如何将HTML表格转换为CSV？

Question

如何将HTML表格转换为CSV？

htmlcsvhtml-table

74

如何将HTML表格（<table>）的内容转换为CSV格式？是否有库或Linux程序可以实现此功能？这类似于在Internet Explorer中复制表格，并将其粘贴到Excel中。

- asdfasdf

可能是使用jQuery和HTML导出CSV的重复问题。 - Dave Jarvis

23个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- enharmonic · Answer 1

以下是我的做法，只使用tr和sed：

< table.txt tr -d '\n' | 
sed -e 's/<tr[^>]*>/\n/g' -e 's/<[^>]*t[dh]>/,/g' -e 's/<[^>]*>//g'

解释

tr -d '\n' 删除换行符
's/<tr[^>]*>/\n/g' 将 tr 标签转换为换行符，将数据分成表格行
's/<[^>]*t[dh]>/,/g' 将结束的 td/th 标签转换为逗号
's/<[^>]*>//g' 删除所有其他 html 标签

样例输入
(来自尝试使用 MsoNormal 渲染 HTML 表格的 Outlook 电子邮件）：

<table class=3D"MsoNormalTable" border=3D"0" cellspacing=3D"0" cellpadding=3D"0" width=3D"420" style=3D"width:315.0pt;border-collapse:collapse">
<tbody>
<tr style=3D"height:15.0pt">
<td width=3D"107" nowrap=3D"" style=3D"width:80.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
<td width=3D"107" nowrap=3D"" valign=3D"bottom" style=3D"width:80.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
<td width=3D"64" nowrap=3D"" valign=3D"bottom" style=3D"width:48.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
<td width=3D"79" nowrap=3D"" valign=3D"bottom" style=3D"width:59.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
<td width=3D"64" nowrap=3D"" valign=3D"bottom" style=3D"width:48.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
</tr>
<tr style=3D"height:6.75pt">
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
</tr>
<tr style=3D"height:15.0pt">
<td nowrap=3D"" valign=3D"bottom" style=3D"border:solid windowtext 1.0pt;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><b><span style=3D"color:black">ID</span></b><b><span style=3D"color:black"><o:p></o:p></span></b></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"border:solid windowtext 1.0pt;border-left:none;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><b><span style=3D"color:black">Price<o:p></o:p></span></b></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
</tr>
<tr style=3D"height:15.0pt">
<td nowrap=3D"" valign=3D"bottom" style=3D"border:solid windowtext 1.0pt;border-top:none;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><span style=3D"color:black">064159Q</span><span style=3D"color:black"><o:p></o:p></span></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><span style=3D"color:black">121.85<o:p></o:p></span></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
</tr>
<tr style=3D"height:15.0pt">
<td nowrap=3D"" valign=3D"bottom" style=3D"border:solid windowtext 1.0pt;border-top:none;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><span style=3D"color:black">2420128</span><span style=3D"color:black"><o:p></o:p></span></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><span style=3D"color:black">10.00<o:p></o:p></span></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
</tr>
</tbody>
</table>

样例输出


,,,,,
,,,,,
ID,Price,,,,
064159Q,121.85,,,,
2420128,10.00,,,,

请参见Sed中非贪婪正则匹配以了解该方法的讨论。

- Benjamin W. · Answer 2

这是一种使用pup和jq的方法。

假设infile.html包含一个<table>元素，我们可以使用pup选择其行，并将其转换为JSON：

pup 'table tr json{}' --file infile.html

这将返回一个对象数组，每行都有一个名为children的数组。例如，如果有一个标题行、两个数据行和三列：

[
 {
  "children": [
   { "tag": "th", "text": "ID" },
   { "tag": "th", "text": "First name" },
   { "tag": "th", "text": "Last name" }
  ],
  "tag": "tr"
 },
 {
  "children": [
   { "tag": "td", "text": "123" },
   { "tag": "td", "text": "Anna" },
   { "tag": "td", "text": "Alphabet" }
  ],
  "tag": "tr"
 },
 {
  "children": [
   { "tag": "td", "text": "456" },
   { "tag": "td", "text": "Brandon" },
   { "tag": "td", "text": "Betazoid" }
  ],
  "tag": "tr"
 }
]

将其转换为CSV，我们可以使用jq（参见代码片段）：

pup 'table tr json{}' --file infile.html \
    | jq --raw-output 'map(.children | map(.text))[] | @csv'

结果是

"ID","First name","Last name"
"123","Anna","Alphabet"
"456","Brandon","Betazoid"

- Tata · Answer 3

这是一个很旧的帖子，但可能会有像我一样的人偶然看到它。我对audiodude的脚本进行了一些改进，使其可以从文件中读取HTML而不是将其添加到代码中，并添加了另一个参数来控制标题行的打印。

该脚本应该这样运行：

ruby <script_name> <file_name> [<print_headers>]

代码如下：

require 'nokogiri'

print_header_lines = ARGV[1]

File.open(ARGV[0]) do |f|

  table_string=f
  doc = Nokogiri::HTML(table_string)

  doc.xpath('//table//tr').each do |row|
    if print_header_lines
      row.xpath('th').each do |cell|
        print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
      end
    end
    row.xpath('td').each do |cell|
      print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
    end
    print "\n"
  end
end

- Gene T · Answer 4

以下是几个选项：

http://groups.google.com/group/ruby-talk-google/browse_thread/thread/cfae0aa4b14e5560?hl=nn

使用Google电子表格从维基百科上抓取数据的方法

如何将HTML表格抓取为CSV格式？

https://addons.mozilla.org/en-US/firefox/addon/1852

- Sergey Zakharov · Answer 5

你可以使用LibreOffice或sed将HTML转换为CSV。

LibreOffice方法：

libreoffice：

mkdir in out
cp -v *.html in
rename 's/([^.]+).html/$1.xls/g' in/*.html
## 59 is ;
## 44 is ,
libreoffice --convert-to 'csv:Text - txt - csv (StarCalc):59,,0,3' in/*.xls --outdir out

查看：

请参见https://developer-core.blogspot.com/2022/03/preobrazovanie-html-v-csv-ili-obrabotka-html-tablic-v-bash.html

或者使用sed：

mkdir out
cp -v *.html out
sed -i ':a;N;$!ba
     s/<html.\+<table[^>]\+>//Ig
     s#\s*</td>\s*</tr>\s*<tr>\s*<td>\s*#\n#Ig
     s#\s*</td>\s*<td>\s*#;#Ig
     s/<[^>]\+>//g;s/\s\{2,\}//g' out/*.html
rename 's/([^.]+).html/$1.csv/g' out/*.html

看一下这个链接：https://developer-core.blogspot.com/2022/03/preobrazovanie-html-v-csv-ili-obrabotka-html-tablic-v-bash-v2.html 在在线的Bash“沙盒”中可以找到一个例子：https://onlinegdb.com/1oivp0uGm

- Diego Rivera · Answer 6

这是 Yuvai的答案的更新版本，可以正确处理需要引用的字段（即包含数据中逗号、双引号或跨越多行的字段）。

#!/usr/bin/env python3
from html.parser import HTMLParser
import sys
import re

class HTMLTableParser(HTMLParser):
    def __init__(self, row_delim="\n", cell_delim=","):
        HTMLParser.__init__(self)
        self.despace_re = re.compile("\s+")
        self.data_interrupt = False
        self.first_row = True
        self.first_cell = True
        self.in_cell = False
        self.row_delim = row_delim
        self.cell_delim = cell_delim
        self.quote_buffer = False
        self.buffer = None

    def handle_starttag(self, tag, attrs):
        self.data_interrupt = True
        if tag == "table":
            self.first_row = True
            self.first_cell = True
        elif tag == "tr":
            if not self.first_row:
                sys.stdout.write(self.row_delim)
            self.first_row = False
            self.first_cell = True
            self.data_interrupt = False
        elif tag == "td" or tag == "th":
            if not self.first_cell:
                sys.stdout.write(self.cell_delim)
            self.first_cell = False
            self.data_interrupt = False
            self.in_cell = True
        elif tag == "br":
            self.quote_buffer = True
            self.buffer += self.row_delim

    def handle_endtag(self, tag):
        self.data_interrupt = True
        if tag == "td" or tag == "th":
            self.in_cell = False
        if self.buffer != None:
            # Quote if needed...
            if self.quote_buffer or self.cell_delim in self.buffer or "\"" in self.buffer:
                # Need to quote! First, replace all double-quotes with quad-quotes
                self.buffer = self.buffer.replace("\"", "\"\"")
                self.buffer = "\"{0}\"".format(self.buffer)
            sys.stdout.write(self.buffer)
            self.quote_buffer = False
            self.buffer = None

    def handle_data(self, data):
        if self.in_cell:
            #if self.data_interrupt:
            #   sys.stdout.write(" ")
            if self.buffer == None:
                self.buffer = ""
            self.buffer += self.despace_re.sub(" ", data).strip()
            self.data_interrupt = False

parser = HTMLTableParser() 
parser.feed(sys.stdin.read())

这个脚本的一个改进可以是添加支持指定不同的行分隔符（或自动计算平台正确的分隔符）和不同的列分隔符。

- Josh · Answer 7

这是基于atomicules的答案，但更为简洁，并且还能处理表头单元格th，以及数据单元格td。我还添加了strip方法，以去除额外的空格。

CSV.open("output.csv", 'w') do |csv|
  doc.xpath('//table//tr').each do |row|
    csv << row.xpath('th|td').map {|cell| cell.text.strip}
  end
end

将代码放在CSV块中可以确保文件被正确关闭。

如果您只需要文本而不需要将其写入文件，则可以使用以下方法：

doc.xpath('//table//tr').inject('') do |result, row|
  result << row.xpath('th|td').map {|cell| cell.text.strip}.to_csv
end

- Happy Gilmore · Answer 8

OpenOffice.org可以查看HTML表格。只需在HTML文件上使用打开命令，或在浏览器中选择并复制表格，然后在OpenOffice.org中选择“特殊粘贴”。它会询问您文件类型，其中之一应该是HTML。选择它，就完成了！

- draegtun · Answer 9

这是一个使用 pQuery 和 Spreadsheet::WriteExcel 的示例：

use strict;
use warnings;

use Spreadsheet::WriteExcel;
use pQuery;

my $workbook = Spreadsheet::WriteExcel->new( 'data.xls' );
my $sheet    = $workbook->add_worksheet;
my $row = 0;

pQuery( 'http://www.blahblah.site' )->find( 'tr' )->each( sub{
    my $col = 0;
    pQuery( $_ )->find( 'td' )->each( sub{
        $sheet->write( $row, $col++, $_->innerHTML );
    });
    $row++;
});

$workbook->close;

这个例子简单地将所有找到的tr标签提取到一个Excel文件中。您可以轻松地将其调整为选择特定的table，甚至触发每个table标签的新Excel文件。

进一步需要考虑的事项：

您可能想要拾取td标签以创建Excel标题。
您可能会遇到rowspan和colspan的问题。

要查看是否使用了rowspan或colspan，您可以：

pQuery( $data )->find( 'td' )->each( sub{ 
    my $number_of_cols_spanned = $_->getAttribute( 'colspan' );
});

- Osogtustack · Answer 10

根据您的需求，您可以简单地进行以下操作：

var table ='';var selector='#customers';
document.querySelectorAll(`${selector} tr th`).forEach(h=>table+=`${h.innerText.trim()};`);table=table.trim();table+='\r\n';
document.querySelectorAll(`${selector} tr`).forEach(tr=>{tr.querySelectorAll('td').forEach(td=>table+=`${td.innerText.trim()};`);table+='\r\n';});

将“selector”更改为目标表格，执行“table”后，它将具有您的 CSV 内容。

此外，您还可以：

var a = document.createElement('a');a.href=`data:text/csv;base64,${btoa(table)}`;a.download="table.csv";a.click();

下载 "table" 的内容。