如何纠正混合编码的文件?

4

如何在 Emacs 中配置,以便将所有符号在保存文件时投影到单个编码(例如 utf-8),对于带有混合编码的已损坏文件(例如 utf-8 和 latin-1)?

我编写了以下函数来自动化一些清理工作,但我想我可以在某处找到将一个编码中的符号“é”映射到utf-8中的“é”的信息,以改进这个函数(或者已经有人编写了这样的函数)。

  (defun jyby/cleanToUTF ()
    "Cleaning to UTF"
    (interactive)
    (progn
         (save-excursion (replace-regexp "अ" ""))
         (save-excursion (replace-regexp "आ" ""))
         (save-excursion (replace-regexp "ॆ" ""))
       )
  )

  (global-unset-key [f11])
  (global-set-key [f11] 'jyby/cleanToUTF)

我有许多文件因为混合编码而“损坏”(由于从具有不良字体配置的浏览器复制粘贴而导致),生成下面的错误。有时我会通过手动查找和替换每个问题符号来清理它们,用“”或适当的字符来代替,或更快地指定“utf-8-unix”作为编码(这将提示下一次我编辑和保存文件时出现相同的消息)。 在任何这种受损文件中,任何重音字符都会被一个序列所取代,该序列在每次保存时加倍,最终使文件大小加倍。我正在使用GNU Emacs 24.2.1。

These default coding systems were tried to encode text
in the buffer `test_accents.org':
(utf-8-unix (30 . 4194182) (33 . 4194182) (34 . 4194182) (37
. 4194182) (40 . 4194181) (41 . 4194182) (42 . 4194182) (45
. 4194182) (48 . 4194182) (49 . 4194182) (52 . 4194182))
However, each of them encountered characters it couldn't encode:
utf-8-unix cannot encode these:           ...

Click on a character (or switch to this window by `C-x o'
and select the characters by RET) to jump to the place it appears,
where `C-u C-x =' will give information about it.

Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
to remove or modify the problematic characters,
or specify any other coding system (and risk losing
the problematic characters).

raw-text emacs-mule no-conversion

但是有没有自动转换的方法呢?目前我手动选择每个有问题的字符,然后执行搜索和替换以在整个文档中删除它。我计划编写一个Lisp函数来自动化这个过程,但我不知道如何自动化有问题的字符列表(而且我希望做一些更聪明的事情,比如将é -> e,或者投影到utf-8中的重音字符之类的东西...) - J..y B..y
2个回答

2

我在emacs中遇到过这个问题很多次。当我有一个文件被搞乱了,例如在raw-text-unix模式下,并保存为utf-8时,即使是已经干净的utf-8文本,emacs也会抱怨。我还没有找到一种方法只让它抱怨非utf-8。

我刚刚发现了一个合理的半自动化方法,使用recode:

f=mixed-file
recode -f ..utf-8 $f > /tmp/recode.out
diff $f recode.out | cat -vt

# manually fix lines of text that can't be converted to utf-8 in $f,
# and re-run recode and diff until the output diff is empty.

在 IT 技术学习路上,有一个非常有用的工具是http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=342+200+224&mode=obytes

然后,我只需在emacs中重新打开文件,它将被识别为干净的 Unicode。


1
这是一些可能帮助你入门的内容:
(put 'eof-error 'error-conditions '(error eof-error))
(put 'eof-error 'error-message "End of stream")
(put 'bad-byte 'error-conditions '(error bad-byte))
(put 'bad-byte 'error-message "Not a UTF-8 byte")

(defclass stream ()
  ((bytes :initarg :bytes :accessor bytes-of)
   (position :initform 0 :accessor position-of)))

(defun logbitp (byte bit) (not (zerop (logand byte (ash 1 bit)))))

(defmethod read-byte ((this stream) &optional eof-error eof)
  (with-slots (bytes position) this
    (if (< position (length bytes))
        (prog1 (aref bytes position) (incf position))
      (if eof-error (signal eof-error (list position)) eof))))

(defmethod unread-byte ((this stream))
  (when (> (position-of this) 0) (decf (position-of this))))

(defun read-utf8-char (stream)
  (let ((byte (read-byte stream 'eof-error)))
    (if (not (logbitp byte 7)) byte
      (let ((numbytes
             (cond
              ((not (logbitp byte 5))
               (setf byte (logand #2r11111 byte)) 1)
              ((not (logbitp byte 4))
               (setf byte (logand #2r1111 byte)) 2)
              ((not (logbitp byte 3))
               (setf byte (logand #2r111 byte)) 3))))
        (dotimes (b numbytes byte)
          (let ((next-byte (read-byte stream 'eof-error)))
            (if (and (logbitp next-byte 7) (not (logbitp next-byte 6)))
                (setf byte (logior (ash byte 6) (logand next-byte #2r111111)))
              (signal 'bad-byte (list next-byte)))))
        (signal 'bad-byte (list byte))))))

(defun load-corrupt-file (file)
  (interactive "fFile to load: ")
  (with-temp-buffer
    (set-buffer-multibyte nil)
    (insert-file-literally file)
    (with-output-to-string
      (set-buffer-multibyte t)
      (loop with stream = (make-instance 'stream :bytes (buffer-string))
            for next-char =
            (condition-case err
                (read-utf8-char stream)
              (bad-byte (message "Fix this byte %d" (cdr err)))
              (eof-error nil))
            while next-char
            do (write-char next-char)))))

这段代码的作用是——以无转换方式加载文件,并尝试将其读取为使用UTF-8编码的文件。一旦遇到一个似乎不属于UTF-8的字节,它就会出现错误,需要你想办法处理它,这就是出现 "Fix this byte" 消息的地方。但你需要有创意地解决它...

看起来很有趣-谢谢!但是当我将其放入缓冲区并进行评估时,会出现“eval-region:Symbol's function definition is void:defclass”的错误,而谷歌无法理解这个错误,而我也不理解似乎相关的EIEIO。 我正在使用Ubuntu Precise上的GNU Emacs 23.3.1。 - nealmcb

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接