在Emacs/Elisp中解码HTML实体

5

有些在线网站喜欢把所有文本都编码为HTML实体,所以我们看到的不是这个文本:

So I'm looking

你会得到类似这样的东西:
So I'm looking 

我想知道是否有内置的方法可以使用任何Emacs内置功能将编码文本转换为常规文本,或者我应该声明我的字符串映射("&83" => "S" ...)并手动使用映射进行解码。

非常感谢您提供任何指导。


3
顺便提一句:那些不是 HTML 实体,而是 Unicode 实体——这是不同的。请参见 http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_reference_overview。 - ty812
3个回答

2
我编写了这个函数来处理非数字的Unicode实体,以防有人需要。
(defun html-entities-to-unicode (string)
  (let* ((plist '(Aacute "Á" aacute "á" Acirc "Â" acirc "â" acute "´" AElig "Æ" aelig "æ" Agrave "À" agrave "à" alefsym "ℵ" Alpha "Α" alpha "α" amp "&" and "∧" ang "∠" apos "'" aring "å" Aring "Å" asymp "≈" atilde "ã" Atilde "Ã" auml "ä" Auml "Ä" bdquo "„" Beta "Β" beta "β" brvbar "¦" bull "•" cap "∩" ccedil "ç" Ccedil "Ç" cedil "¸" cent "¢" Chi "Χ" chi "χ" circ "ˆ" clubs "♣" cong "≅" copy "©" crarr "↵" cup "∪" curren "¤" Dagger "‡" dagger "†" darr "↓" dArr "⇓" deg "°" Delta "Δ" delta "δ" diams "♦" divide "÷" eacute "é" Eacute "É" ecirc "ê" Ecirc "Ê" egrave "è" Egrave "È" empty "∅" emsp " " ensp " " Epsilon "Ε" epsilon "ε" equiv "≡" Eta "Η" eta "η" eth "ð" ETH "Ð" euml "ë" Euml "Ë" euro "€" exist "∃" fnof "ƒ" forall "∀" frac12 "½" frac14 "¼" frac34 "¾" frasl "⁄" Gamma "Γ" gamma "γ" ge "≥" gt ">" harr "↔" hArr "⇔" hearts "♥" hellip "…" iacute "í" Iacute "Í" icirc "î" Icirc "Î" iexcl "¡" igrave "ì" Igrave "Ì" image "ℑ" infin "∞" int "∫" Iota "Ι" iota "ι" iquest "¿" isin "∈" iuml "ï" Iuml "Ï" Kappa "Κ" kappa "κ" Lambda "Λ" lambda "λ" lang "〈" laquo "«" larr "←" lArr "⇐" lceil "⌈" ldquo "“" le "≤" lfloor "⌊" lowast "∗" loz "◊" lrm "" lsaquo "‹" lsquo "‘" lt "<" macr "¯" mdash "—" micro "µ" middot "·" minus "−" Mu "Μ" mu "μ" nabla "∇" nbsp "" ndash "–" ne "≠" ni "∋" not "¬" notin "∉" nsub "⊄" ntilde "ñ" Ntilde "Ñ" Nu "Ν" nu "ν" oacute "ó" Oacute "Ó" ocirc "ô" Ocirc "Ô" OElig "Œ" oelig "œ" ograve "ò" Ograve "Ò" oline "‾" omega "ω" Omega "Ω" Omicron "Ο" omicron "ο" oplus "⊕" or "∨" ordf "ª" ordm "º" oslash "ø" Oslash "Ø" otilde "õ" Otilde "Õ" otimes "⊗" ouml "ö" Ouml "Ö" para "¶" part "∂" permil "‰" perp "⊥" Phi "Φ" phi "φ" Pi "Π" pi "π" piv "ϖ" plusmn "±" pound "£" Prime "″" prime "′" prod "∏" prop "∝" Psi "Ψ" psi "ψ" quot "\"" radic "√" rang "〉" raquo "»" rarr "→" rArr "⇒" rceil "⌉" rdquo "”" real "ℜ" reg "®" rfloor "⌋" Rho "Ρ" rho "ρ" rlm "" rsaquo "›" rsquo "’" sbquo "‚" scaron "š" Scaron "Š" sdot "⋅" sect "§" shy "" Sigma "Σ" sigma "σ" sigmaf "ς" sim "∼" spades "♠" sub "⊂" sube "⊆" sum "∑" sup "⊃" sup1 "¹" sup2 "²" sup3 "³" supe "⊇" szlig "ß" Tau "Τ" tau "τ" there4 "∴" Theta "Θ" theta "θ" thetasym "ϑ" thinsp " " thorn "þ" THORN "Þ" tilde "˜" times "×" trade "™" uacute "ú" Uacute "Ú" uarr "↑" uArr "⇑" ucirc "û" Ucirc "Û" ugrave "ù" Ugrave "Ù" uml "¨" upsih "ϒ" Upsilon "Υ" upsilon "υ" uuml "ü" Uuml "Ü" weierp "℘" Xi "Ξ" xi "ξ" yacute "ý" Yacute "Ý" yen "¥" yuml "ÿ" Yuml "Ÿ" Zeta "Ζ" zeta "ζ" zwj "" zwnj ""))
         (get-function (lambda (s) (or (plist-get plist (intern (substring s 1 -1))) s))))
    (replace-regexp-in-string "&[^; ]*;" get-function string)))

1
我写了下面的内容,可以满足你的需求,@federico-builes。 (我也需要同样的东西。)
(defun ajs-decimal-escapes-to-unicode (start end)
  "Convert escapes like '&#955;' to Unicode like 'λ'.
Operates on the active region or the whole buffer."
  (interactive (list (point) (mark)))
  (or (use-region-p)
      (setq start (point-min) end (point-max)))
  (insert (replace-regexp-in-string
           "&#[0-9]*;"
           (lambda (match)
             (format "%c" (string-to-number (substring match 2 -1))))
           (filter-buffer-substring start end t))))

@konr的回答很有帮助-谢谢!我也一直在享受在Emacs Lisp中编程入门。这是我写的第一个可能有用的Lisp。我很感激反馈,即使是像空格这样的细节;谢谢!


0

不知道是否有内置函数,但是这个小函数可以完成任务:

(defun my-insert-encode-entities-string (str)
  (mapconcat
   (lambda (char) (format "&#%d;" char))
   (string-to-list str)
   ""))

如果您只想编码HTML实体,请改用url-insert-entities-in-string


该函数有误,因为您不想将其格式化为%d,而是要获取一个%d并将其格式化为char。 - Federico Builes
@Federico:我不确定我是否理解你的观点。调用 (my-insert-encode-entities-string "So I'm looking") 会返回与您提供的完全相同的结果。变量 char 包含当前表示为整数的字符,因此在这种特殊情况下,我认为使用 %s 还是 %d 都无关紧要。 - viam0Zah
@Török:有一点误解,正如您在问题中所看到的,我正在寻找“...一种将编码文本转换为常规文本的内置方法”。您的解决方案将常规文本转换为编码文本 :)我写了这个http://gist.github.com/222709来修复它,但显然不像您的原始建议那样干净。 - Federico Builes

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接