如何使用Powershell Core 7解析HTML表格？

Question

如何使用Powershell Core 7解析HTML表格？

8

I have the following code:

    $html = New-Object -ComObject "HTMLFile"
    $source = Get-Content -Path $FilePath -Raw
    try
    {
        $html.IHTMLDocument2_write($source) 2> $null
    }
    catch
    {
        $encoded = [Text.Encoding]::Unicode.GetBytes($source)
        $html.write($encoded)
    }
    $t = $html.getElementsByTagName("table") | Where-Object {
        $cells = $_.tBodies[0].rows[0].cells
        $cells[0].innerText -eq "Name" -and
        $cells[1].innerText -eq "Description" -and
        $cells[2].innerText -eq "Default Value" -and
        $cells[3].innerText -eq "Release"
    }

这段代码在Windows Powershell 5.1上运行正常，但是在Powershell Core 7上$_.tBodies[0].rows返回null。

那么，在PS 7中如何访问HTML表格的行呢？

- mark

参见：将HTML表格提取为CSV - iRon

2个回答

0

我使用了上面的答案来解决我的问题。我安装了PowerHTML。我想从https://www.dicomlibrary.com/dicom/dicom-tags/中提取数据表并将其转换。

从这个：

<tr><td>(0002,0000)</td><td>UL</td><td>File Meta Information Group Length</td><td></td></tr>

到这个：

{"00020000", "UL文件元信息组长度"}

$page = Invoke-WebRequest https://www.dicomlibrary.com/dicom/dicom-tags/
$htmldom = ConvertFrom-Html $page
$table = $htmlDom.SelectNodes('//table') | Where-Object {
  $headerRow = $_.Element('tr') # or $tbl.Elements('tr')[0]
  # Filter by column names
  $headerRow.ChildNodes[0].InnerText -eq 'Tag' 
}

foreach ($row in $table.SelectNodes('tr'))
 {$a = $row.SelectSingleNode('td[1]').innerText.Trim()  -replace "`n|`r|\s+", " " -replace "\(",'{"' -replace ",","" -replace "\)",'",'
 $c = $row.SelectSingleNode('td[3]').innerText.Trim() -replace "`n|`r|\s+", " "
 $b=$row.seletSingleNode('td[2]').innerText.Trim() -replace "`n|`r|\s+", ""; $c = '"'+$b+$c+'"},'
 $row = New-Object -TypeName psobject
     $row | Add-Member -MemberType NoteProperty -Name Tag -Value $a
     $row | Add-Member -MemberType NoteProperty -Name Value -Value $c

     [array]$data += $row
}

$data | Out-File c:\scripts\dd.txt

- Mike O

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mklement0 · Accepted Answer

PowerShell (Core)，截至7.4版本，没有内置的HTML解析器 - 而且这个情况可能永远不会改变。

你必须依赖于第三方解决方案，比如PSParseHTML模块，它包装了HTML Agility Pack^[1]和AngleSharp库。前者是默认使用的，后者需要选择性地使用-Engine AngleSharp；至于它们各自的DOM（对象模型）：

HTML Agility Pack，默认使用的与 Windows PowerShell 中可用的基于 Internet Explorer 的不同，它与标准的System.Xml.XmlDocument类型（[xml]）提供的 XML DOM 类似^[2]；请参阅文档和下面的示例代码。
AngleSharp，需要通过-Engine AngleSharp选择启用，它是基于官方的 W3C 规范构建的，因此提供了与 Web 浏览器中相同的 HTML DOM。特别值得注意的是，这意味着它的.QuerySelector()和.QuerySelectorAll()方法可以与通常的CSS 选择器一起使用。请参阅此答案以获取其使用示例。

自包含的示例代码，使用HTML Agility Pack引擎：

# Install the module on demand
If (-not (Get-Module -ErrorAction Ignore -ListAvailable PSParseHTML)) {
  Write-Verbose "Installing PSParseHTML module for the current user..."
  Install-Module -Scope CurrentUser PSParseHTML -ErrorAction Stop
}

# Create a sample HTML file with a table with 2 columns.
Get-Item $HOME | Select-Object Name, Mode | ConvertTo-Html > sample.html

# Parse the HTML file into an HTML DOM.
$htmlDom = Get-Content -Raw sample.html | ConvertFrom-Html

# Find a specific table by its column names, using an XPath
# query to iterate over all tables.
$table = $htmlDom.SelectNodes('//table') | Where-Object {
  $headerRow = $_.Element('tr') # or $tbl.Elements('tr')[0]
  # Filter by column names
  $headerRow.ChildNodes[0].InnerText -eq 'Name' -and 
    $headerRow.ChildNodes[1].InnerText -eq 'Mode'
}

# Print the table's HTML text.
$table.InnerHtml

# Extract the first data row's first column value.
# Note: @(...) is required around .Elements() for indexing to work.
@($table.Elements('tr'))[1].ChildNodes[0].InnerText

一个只适用于Windows的替代方法是使用HTMLFile COM对象，如this answer所示，并且正如你自己尝试的那样 - 我不清楚为什么在你的特定情况下它没有起作用。

^{[1] 注意，这个答案最初是基于一个不同的PowerShell封装模块来使用HTML Agility Pack的，PowerHTML - 然而，PSParseHTML是更活跃地维护的。}

^{[2] 特别是在支持XPath查询方面，通过.SelectSingleNode()和.SelectNodes()方法，通过.ChildNodes集合暴露子节点，并提供.InnerHtml / .OuterHtml / .InnerText属性。而不是支持子元素名称的索引器，提供了.Element(<name>)和.Elements(<name>)方法。}