将innerhtml分割成文本，以便在JavaScript中进行翻译JSON。

Question

将innerhtml分割成文本，以便在JavaScript中进行翻译JSON。

3

我目前正在开发一个应用程序，需要提取Body的innerHTML，然后将其中的文本以JSON格式提取出来。该JSON将用于翻译，然后将翻译后的JSON用作输入，创建具有翻译后文本的相同HTML标记。请参见下面的片段。

HTML输入

<section>Hello, <div>This is some text which I need to extract.<a class="link">It can be <strong> complicated.</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag</span><p>Please see the <span>desired output below.</span></p>Thanks!</section>';

翻译JSON输出

{
"text1":"Hello, ",
"text2":"This is some text which I need to extract.",
"text3":"It can be <strong> complicated.</strong>",
"text4":"The extracted text should contain the html tag if it 
             has any html tag in the span,p or a tag",
"text5":"Please see the <span>desired output below.</span>",
"text6":"Thanks!"
}

翻译后的JSON输入

{
"text1":"Hello,-in spanish ",
"text2":"This is some text which I need to extract.-in spanish",
"text3":"It can be <strong> complicated.-in spanish</strong>",
"text4":"The extracted text should contain the html tag if it 
             has any html tag in the span,p or a tag-in spanish",
"text5":"Please see the <span>desired output below.-in spanish</span>",
"text6":"Thanks!-in spanish"
}

翻译后的HTML输出

<section>Hello,-in spanish <div>This is some text which I need to extract.-in spanish<a class="link">It can be <strong> complicated.-in spanish</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag-in spanish</span><p>Please see the <span>desired output below.</span></p>Thanks!-in spanish</section>';

我尝试了各种正则表达式，但以下是我最终采用的其中一种，但使用它无法实现所需的输出。

//encode
const bodyHTML = '<a class="test">hello world<strong> this is gonna be hard</strong></a>';
//replace the quotes with escape quotes
const htmlContent = bodyHTML.replace(/"/g, '\\"');
let count = 0;
let translationObj = {};
let newHtml = htmlContent.replace(/\>(.*?)\</g, function(match) {
  //remove the special character 
  match = match.replace(/\>|\</g, '');
  count = count + 1;
  translationObj[count] = match;

  return '>~' + count + '~<';
});

const translationJSON = '{"1":"hello world in spanish","2":" this is gonna be hard in spanish","3":""}';

//decode
let trasnaltedHtml = '';
const translatedObj = JSON.parse(translationJSON)
trasnaltedHtml = newHtml.replace(/\~(.*?)\~/g, function(match) {
  //remove the special character 
  match = match.replace(/\~|\~/g, '');

  return translatedObj[match];
});
//replace the escape quotes with quotes
trasnaltedHtml = trasnaltedHtml.replace(/\\"/g, '"');
//console.log()
console.log("bodyHTML", bodyHTML);
console.log('tranlationObj', translationObj);
console.log("translationJSON", translationJSON);
console.log('newHtml', newHtml);
console.log("trasnaltedHtml", trasnaltedHtml);

我正在寻找一种在JS世界中工作的正则表达式或其他方法，可以将HTML中的所有文本以JSON格式获取。另一个条件是，如果它们具有某些内部HTML标记，则不要拆分文本，以便我们不会失去句子的上下文，例如<p>Click <a>here</a></p>应被视为一个文本"Click <a>here</a>"。我希望我澄清了所有疑虑。

提前感谢！

- dk111989

哎呀，有人正在使用正则表达式解析HTML。但是，也许可以考虑像JSoup这样的东西来解析JavaScript。除非我误解了这个问题。 - user1531971

@T.J.Crowder 我的起点是一个包含HTML的字符串，我将删除其中的jQuery标签。谢谢！ - dk111989

那你是在什么环境下进行这个操作的呢？Node.js？JVM？Windows通用应用程序？ - T.J. Crowder

1

我将使用Node.js创建类似于翻译微服务的东西 @T.J.Crowder - dk111989

1

非常感谢 @T.J.Crowder 指导我完成这个过程。今天绝对学到了新东西，但这与我的代码片段所做的事情相同。我的目标是将 It can be <strong> complicated.</strong> 放在一起。 - dk111989

显示剩余7条评论

2个回答

1

如果有人想要做类似的事情，那么我在这里创建了这个翻译服务。

https://github.com/gurusewak/translation

我的目标不是在破解句子方面达到100%的成功率，而是尽可能多地获取句子。我只是试图在输入一些HTML时帮助某人进行翻译。希望这能在未来对某人有所帮助。

干杯！

输出

此处为流程的输出

- dk111989

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- T.J. Crowder · Accepted Answer

到目前为止，最好的方法是使用HTML解析器，然后循环遍历树中的文本节点。你不能仅仅使用简单的JavaScript正则表达式¹（许多人浪费了很多时间尝试）来正确处理像HTML这样的非规则标记语言，更不用说HTML的所有特殊性了。

npm上至少有几个经过充分测试、得到积极支持的DOM解析器模块可供使用。

因此，基本结构如下：

将HTML解析为DOM。
按照定义的顺序（通常是深度优先遍历）遍历DOM，从遇到的文本节点中构建出要翻译的文本字符串的对象或数组。
如果需要，将该对象/数组转换为JSON，将其发送进行翻译，获取结果后，如果需要，将其从JSON解析回对象/数组。
按照相同的顺序遍历DOM，应用来自对象/数组的结果。
将DOM序列化为HTML。
发送结果。

这是一个例子 - 当然，这里我使用的是浏览器内置的HTML解析器，而不是一个npm模块，你使用的模块的API可能略有不同，但概念是相同的。

var html = '<section>Hello, <div>This is some text which I need to extract.<a class="link">It can be <strong> complicated.</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag</span><p>Please see the <span>desired output below.</span></p>Thanks!</section>';
var dom = parseHTML(html);
var strings = [];
walk(dom, function(node) {
  if (node.nodeType === 3) { // text node
    strings.push(node.nodeValue);
  }
});
console.log("strings = ", strings);
var translation = translate(strings);
console.log("translation = ", translation);
var n = 0;
walk(dom, function(node) {
  if (node.nodeType === 3) { // text node
    node.nodeValue = translation[n++];
  }
});
var newHTML = serialize(dom);
document.getElementById("before").innerHTML = html;
document.getElementById("after").innerHTML = newHTML;


function translate(strings) {
  return strings.map(str => str.toUpperCase());
}

function walk(node, callback) {
  var child;
  callback(node);
  switch (node.nodeType) {
    case 1: // Element
      for (child = node.firstChild; child; child = child.nextSibling) {
        walk(child, callback);
      }
  }
}

// Placeholder for module function
function parseHTML(html) {
  var div = document.createElement("div");
  div.innerHTML = html;
  return div;
}

// Placeholder for module function
function serialize(dom) {
  return dom.innerHTML;
}

<strong>Before:</strong>
<div id="before"></div>
<strong>After:</strong>
<div id="after"></div>

有些“正则表达式”库（或其他语言的正则表达式功能）确实是正则表达式+更多功能，可以帮助您做类似的事情，但它们不仅仅是正则表达式，而且JavaScript内置的正则表达式没有这些功能。