在awk中转义HTML特殊字符

3

我想从一个awk脚本中生成一个HTML文件。我的字符串可能包含像"<"和"&"这样的字符。是否有一个简短而经过验证的awk函数可以进行转义?

2个回答

2
当涉及到it技术时,如果你想转换每一行($0),只需调用makeEntities()。或者修改它以接受参数。我为与英国国家语料库一起使用而创建了这个函数,该库与HTML实体有很高的重叠度,但不是100%,所以如果你需要一些奇特的字符,请确保它们是正确的。请注意保留HTML标签。
function makeEntities()  {    
    gsub(/á/,  "\\&aacute;");
    gsub(/Á/,  "\\&Aacute;");
    gsub(/ă/,  "\\&abreve;");
    gsub(/â/,  "\\&acirc;");
    gsub(/´/,  "\\&acute;");
    gsub(/æ/,  "\\&aelig;");
    gsub(/Æ/,  "\\&AElig;");
    gsub(/α/,  "\\&agr;");
    gsub(/à/,  "\\&agrave;");
    gsub(/ā/,  "\\&amacr;");
    gsub(/Ā/,  "\\&Amacr;");
    gsub(/&/,  "\\&amp;");
    gsub(/ą/,  "\\&aogon;");
    gsub(/å/,  "\\&aring;");
    gsub(/Å/,  "\\&Aring;");
    gsub(/ã/,  "\\&atilde;");
    gsub(/ä/,  "\\&auml;");
    gsub(/Ä/,  "\\&Auml;");
    gsub(/β/,  "\\&bgr;");
    gsub(/\\/, "\\&bsol;");
    gsub(/•/,  "\\&bull;");
    gsub(/ć/,  "\\&cacute;");
    gsub(/č/,  "\\&ccaron;");
    gsub(/Č/,  "\\&Ccaron;");
    gsub(/ç/,  "\\&ccedil;");
    gsub(/Ç/,  "\\&Ccedil;");
    gsub(/ĉ/,  "\\&ccirc;");
    gsub(/✓/,  "\\&check;");
    gsub(/ˆ/,  "\\&circ;");
    gsub(/@/,  "\\&commat;");
    gsub(/©/,  "\\&copy;");
    gsub(/‐/,  "\\&dash;");
    gsub(/ď/,  "\\&dcaron;");
    gsub(/°/,  "\\&deg;");
    gsub(/δ/,  "\\&dgr;");
    gsub(/Δ/,  "\\&Dgr;");
    gsub(/¨/,  "\\&die;");
    gsub(/\$/, "\\&dollar;");
    gsub(/đ/,  "\\&dstrok;");
    gsub(/é/,  "\\&eacute;");
    gsub(/É/,  "\\&Eacute;");
    gsub(/ě/,  "\\&ecaron;");
    gsub(/ê/,  "\\&ecirc;");
    gsub(/è/,  "\\&egrave;");
    gsub(/È/,  "\\&Egrave;");
    gsub(/ε/,  "\\&egr;");
    gsub(/ē/,  "\\&emacr;");
    gsub(/Ē/,  "\\&Emacr;");
    gsub(/ę/,  "\\&eogon;");
    gsub(/ð/,  "\\&eth;");
    gsub(/ë/,  "\\&euml;");
    gsub(/Ë/,  "\\&Euml;");
    gsub(/♭/,  "\\&flat;");
    gsub(/½/,  "\\&frac12;");
    gsub(/⅓/,  "\\&frac13;");
    gsub(/¼/,  "\\&frac14;");
    gsub(/⅕/,  "\\&frac15;");
    gsub(/⅙/,  "\\&frac16;");
    gsub(/⅛/,  "\\&frac18;");
    gsub(/⅔/,  "\\&frac23;");
    gsub(/⅖/,  "\\&frac25;");
    gsub(/¾/,  "\\&frac34;");
    gsub(/⅗/,  "\\&frac35;");
    gsub(/⅜/,  "\\&frac38;");
    gsub(/⅘/,  "\\&frac45;");
    gsub(/⅝/,  "\\&frac58;");
    gsub(/⅞/,  "\\&frac78;");
    gsub(/′/,  "\\&ft;");
    gsub(/γ/,  "\\&ggr;");
    gsub(/>/,  "\\&gt;");
    gsub(/½/,  "\\&half;");
    gsub(/ħ/,  "\\&hstrok;");
    gsub(/í/,  "\\&iacute;");
    gsub(/Í/,  "\\&Iacute;");
    gsub(/î/,  "\\&icirc;");
    gsub(/Î/,  "\\&Icirc;");
    gsub(/ì/,  "\\&igrave;");
    gsub(/ī/,  "\\&imacr;");
    gsub(/″/,  "\\&ins;");
    gsub(/¿/,  "\\&iquest;");
    gsub(/ï/,  "\\&iuml;");
    gsub(/Ï/,  "\\&Iuml;");
    gsub(/ĺ/,  "\\&lacute;");
    gsub(/Ĺ/,  "\\&Lacute;");
    gsub(/\{/, "\\&lcub;");
    gsub(/≤/,  "\\&le;");
    gsub(/λ/,  "\\&lgr;");
    gsub(/_/,  "\\&lowbar;");
    gsub(/\[/, "\\&lsqb;");
    gsub(/ł/,  "\\&lstrok;");
    gsub(/Ł/,  "\\&Lstrok;");
    gsub(/</,  "\\&lt;");
    gsub(/—/,  "\\&mdash;");
    gsub(/μ/,  "\\&mgr;");
    gsub(/µ/,  "\\&micro;");
    gsub(/·/,  "\\&middot;");
    gsub(/ń/,  "\\&nacute;");
    gsub(/ň/,  "\\&ncaron;");
    gsub(/ņ/,  "\\&ncedil;");
    gsub(/–/,  "\\&ndash;");
    gsub(/ñ/,  "\\&ntilde;");
    gsub(/Ñ/,  "\\&Ntilde;");
    gsub(/#/,  "\\&num;");
    gsub(/ó/,  "\\&oacute;");
    gsub(/Ó/,  "\\&Oacute;");
    gsub(/ô/,  "\\&ocirc;");
    gsub(/œ/,  "\\&oelig;");
    gsub(/ò/,  "\\&ograve;");
    gsub(/Ω/,  "\\&ohm;");
    gsub(/ō/,  "\\&omacr;");
    gsub(/ø/,  "\\&oslash;");
    gsub(/Ø/,  "\\&Oslash;");
    gsub(/õ/,  "\\&otilde;");
    gsub(/ö/,  "\\&ouml;");
    gsub(/Ö/,  "\\&Ouml;");
    gsub(/φ/,  "\\&phgr;");
    gsub(/\+/, "\\&plus;");
    gsub(/±/,  "\\&plusmn;");
    gsub(/£/,  "\\&pound;");
    gsub(/ŕ/,  "\\&racute;");
    gsub(/√/,  "\\&radic;");
    gsub(/ř/,  "\\&rcaron;");
    gsub(/Ř/,  "\\&Rcaron;");
    gsub(/\}/, "\\&rcub;");
    gsub(/®/,  "\\&reg;");
    gsub(/-/,  "\\&rehy;");
    gsub(/\]/, "\\&rsqb;");
    gsub(/ś/,  "\\&sacute;");
    gsub(/Ś/,  "\\&Sacute;");
    gsub(/š/,  "\\&scaron;");
    gsub(/Š/,  "\\&Scaron;");
    gsub(/ş/,  "\\&scedil;");
    gsub(/Ş/,  "\\&Scedil;");
    gsub(/ŝ/,  "\\&scirc;");
    gsub(/σ/,  "\\&sgr;");
    gsub(/♯/,  "\\&sharp;");
    gsub(/\//, "\\&shilling;");
    gsub(/∼/,  "\\&sim;");
    gsub(/\//, "\\&sol;");
    gsub(/²/,  "\\&sup2;");
    gsub(/ß/,  "\\&szlig;");
    gsub(/ť/,  "\\&tcaron;");
    gsub(/ţ/,  "\\&tcedil;");
    gsub(/τ/,  "\\&tgr;");
    gsub(/þ/,  "\\&thorn;");
    gsub(/Þ/,  "\\&THORN;");
    gsub(/×/,  "\\&times;");
    gsub(/™/,  "\\&trade;");
    gsub(/ú/,  "\\&uacute;");
    gsub(/Ú/,  "\\&Uacute;");
    gsub(/û/,  "\\&ucirc;");
    gsub(/ù/,  "\\&ugrave;");
    gsub(/ū/,  "\\&umacr;");
    gsub(/¨/,  "\\&uml;");
    gsub(/ů/,  "\\&uring;");
    gsub(/ü/,  "\\&uuml;");
    gsub(/Ü/,  "\\&Uuml;");
    gsub(/\|/, "\\&verbar;");
    gsub(/ŵ/,  "\\&wcirc;");
    gsub(/ý/,  "\\&yacute;");
    gsub(/ŷ/,  "\\&ycirc;");
    gsub(/¥/,  "\\&yen;");
    gsub(/ÿ/,  "\\&yuml;");
    gsub(/Ÿ/,  "\\&Yuml;");
    gsub(/ź/,  "\\&zacute;");
    gsub(/Ž/,  "\\&Zcaron;");
    gsub(/ž/,  "\\&zcaron;");
    gsub(/ż/,  "\\&zdot;");
}

不错的东西。但是非ASCII字符也没关系,通常UTF8中的所有内容都可以。但是我需要正确转义特殊的HTML字符。 - cruftex
这个不正确,因为它会重新转义已经被 & 转义过的内容。 - Matthew Buckett

1

为了避免最低限度的情况,你可以这样做:

function escapeHtml(t)
{
  # Must do this one first
  gsub(/&/,  "\\&amp;", t);
  gsub(/</,  "\\&lt;", t);
  gsub(/>/,  "\\&gt;", t);
  return t;
}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接