DOMDocument编码问题/字符转换

Question

DOMDocument编码问题/字符转换

7

我正在使用DOMDocument在输出到页面之前操作/修改HTML。这只是一个html片段，而不是完整的页面。我的初始问题是所有法语字符都乱了，经过一些尝试和错误后，我能够进行更正。现在，似乎只剩下一个问题：'字符会被转换为?。

代码：

<?php
    $dom = new DOMDocument('1.0','utf-8');
         $dom->loadHTML(utf8_decode($row->text));

         //Some pretty basic modification here, not even related to text

         //reinsert HTML, and make sure to remove DOCTYPE, html and body that get added auto.
         $row->text = utf8_encode(preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML())));
?>

我知道使用UTF8解码/编码可能会有些混乱，但这是目前我能让它正常工作的唯一方式。以下是一个示例字符串：

输入： Sans doute parce qu’il vient d’atteindre une date déterminante dans son spectaculaire cheminement

输出： Sans doute parce qu?il vient d?atteindre une date déterminante dans son spectaculaire cheminement

如果我找到更多细节，我会添加的。谢谢您的时间和支持！

- Kyrotomia

1

$row->text 是什么字符集？如果它直接是UTF-8（假设它来自MySQL，您需要将连接字符集设置为UTF8），那么您就不需要使用 utf8_(en|de)code 函数。强制将字符集设置为UTF8，所有问题应该都会消失（假设 $row 就是来自这里）... - ircmaxell

输入来自CMS，所有设置为utf8（字符串、数据库等）。但是似乎我的问题不是我所认为的。我发现来自我的字符串全部正确，我的同事PC也是如此。问题仅在客户端PC输入字符串时出现。我敢打赌她是从Word或其他地方复制粘贴文本，然后发生了一些奇怪的事情。我需要深入挖掘这个问题。 - Kyrotomia

1

啊...那么也许可以检查UCS-2LE（UTF-16LE）字符（因为这是Word的默认设置，如果我没记错的话）... - ircmaxell

4个回答

8

loadHtml() 并不总是能够识别 Content-type HTTP-EQUIV meta 标签中指定的正确编码。

如果 DomDocument('1.0', 'UTF-8') 和 loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . $html) 这些方法无法正常工作（例如在 PHP 5.3.13 中），请尝试以下方法：

在开头的 <html> 标签之后立即添加另一个 <head> 标签，并加上正确的 Content-type HTTP-EQUIV meta 标签。然后调用 loadHtml()，最后再删除额外的 <head> 标签即可。

// Ensure entire page is encoded in UTF-8
$encoding = mb_detect_encoding($body);
$body = $encoding ? @iconv($encoding, 'UTF-8', $body) : $body;

// Insert a head and meta tag immediately after the opening <html> to force UTF-8 encoding
$insertPoint = false;
if (preg_match("/<html.*?>/is", $body, $matches, PREG_OFFSET_CAPTURE)) {
    $insertPoint = mb_strlen( $matches[0][0] ) + $matches[0][1];
}
if ($insertPoint) {
    $body = mb_substr(
        $body,
        0,
        $insertPoint
    ) . "<head><meta http-equiv='Content-type' content='text/html; charset=UTF-8' /></head>" . mb_substr(
        $body,
        $insertPoint
    );
}
$dom = new DOMDocument();

// Suppress warnings for loading non-standard html pages
libxml_use_internal_errors(true);
$dom->loadHTML($body);
libxml_use_internal_errors(false);

// Now remove extra <head>

请阅读这篇文章：http://devzone.zend.com/1538/php-dom-xml-extension-encoding-processing/

- Luke

4

对我来说，这已经足够了，其他的回答都过于复杂。假设我有一个已经存在HEAD标签的HTML文档。HEAD标签没有属性，因此在我的使用情况下，我没有问题将额外的META标签留在HTML中。

$data = str_ireplace('<head>', '<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" />', $data);
$document = new DOMDocument();
$document->loadHTML($data);

- David Meister

1

正如其他人指出的那样，DOMDocument 和 LoadHTML 在处理 HTML 片段时将默认使用 LATIN1 编码。它还会用类似以下内容包装你的 HTML：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>YOUR HTML</body></html>

正如其他人指出的那样，您可以通过在HTML中插入一个包含正确编码的META元素来修复编码问题。

但是，如果您正在使用HTML片段，则可能不希望进行包装，也不希望保留插入的HEAD元素。

以下代码将插入HEAD元素，然后在处理后，使用正则表达式将删除所有包装元素：

<?php
$html = '<article class="grid-item"><p>Hello World</p></article><article class="grid-item"><p>Goodbye World</p></article>';
$head = '<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head>';

libxml_use_internal_errors(true);
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($head . $html);
$xpath = new DOMXPath($dom);

// Loop through all article.grid-item elements and add the "invisible" class to them
$nodes = $xpath->query("//article[contains(concat(' ', normalize-space(@class), ' '), ' grid-item ')]");
foreach($nodes as $node) {
  $class = $node->getAttribute('class');
  $class .= ' invisible';
  $node->setAttribute('class', $class);
}

$content = preg_replace('/<\/?(!doctype|html|head|meta|body)[^>]*>/im', '', $dom->saveHTML());
libxml_use_internal_errors(false);

echo $content;
?>

- Kodie Grantham

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Artefacto · Accepted Answer

不要使用 utf8_decode。如果您的文本是UTF-8，请直接传入。

不幸的是，DOMDocument在处理HTML时默认使用LATIN1编码。它的行为似乎是这样的：

如果获取远程文档，则应从标头中推断出编码方式
如果标头未发送或文件是本地文件，则查找对应的meta-equiv标签
否则，默认使用LATIN1。

示例代码：

<?php
$s = <<<HTML
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
Sans doute parce qu’il vient d’atteindre une date déterminante
dans son spectaculaire cheminement
</body>
</html>
HTML;

libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadHTML($s);

echo $d->textContent;

使用 XML（默认为 UTF-8）：

<?php
$s = '<x>Sans doute parce qu’il vient d’atteindre une date déterminante'.
    'dans son spectaculaire cheminement</x>';
libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadXML($s);

echo $d->textContent;