您是否可以轻松地在 ASCII 字符和它们的亚洲全角 Unicode 全宽字符之间进行转换? 比如:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!゛#$%&()*+、ー。/:;〈==〉?@[\\]^_‘{|}~
您是否可以轻松地在 ASCII 字符和它们的亚洲全角 Unicode 全宽字符之间进行转换? 比如:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!゛#$%&()*+、ー。/:;〈==〉?@[\\]^_‘{|}~
这些“宽字符”被称为全角拉丁字母
:http://www.unicodemap.org/range/87/Halfwidth%20and%20Fullwidth%20Forms/
它们的范围是0xFF00至0xFFEF。您可以制作查找表或仅将0xFEE0添加到ASCII代码中。
chr(0xFF20 + ord(asciichar))
:) - werewindleValueError: chr()参数不在范围内(256)
- user975135unichr(0xFF20 + ord(asciichar))
。 - werewindleunichr(0xFEE0 + ord(asciichar))
。现在它可以正常工作了。我已经修复了答案。 - werewindle全角ASCII替代字符的范围从U+FF01开始,而不是U+FF00。奇怪的是,U+FF00没有定义。要获得全角空格,您需要使用U+3000 IDEOGRAPHIC SPACE。不要仅仅依靠键入看似所需内容并通过字符的视觉检查来检查映射 - unicodedata.name
是您的朋友。以下是示例代码:
# coding: utf-8
from unicodedata import name as ucname
# OP
normal = u"""0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~"""
wide = u"""0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!゛#$%&()*+、ー。/:;〈=〉?@[\\]^_‘{|}~"""
# above after editing (had = twice)
widemapOP = dict((ord(x[0]), x[1]) for x in zip(normal, wide))
# Ingacio V
normal = u' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'
wide = u' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!゛#$%&()*+、ー。/:;〈=〉?@[\\]^_‘{|}~'
widemapIV = dict((ord(x[0]), x[1]) for x in zip(normal, wide))
# JM
widemapJM = dict((i, i + 0xFF00 - 0x20) for i in xrange(0x21, 0x7F))
widemapJM[0x20] = 0x3000 # IDEOGRAPHIC SPACE
maps = {'OP': widemapOP, 'IV': widemapIV, 'JM': widemapJM}.items()
for i in xrange(0x20, 0x7F):
a = unichr(i)
na = ucname(a, '?')
for tag, widemap in maps:
w = a.translate(widemap)
nw = ucname(w, '?')
if nw != "FULLWIDTH " + na:
print "%s: %04X %s => %04X %s" % (tag, i, na, ord(w), nw)
运行时会展示你真正拥有的东西:一些缺失的映射和一些特殊的映射:
JM: 0020 SPACE => 3000 IDEOGRAPHIC SPACE
IV: 0020 SPACE => 3000 IDEOGRAPHIC SPACE
OP: 0020 SPACE => 0020 SPACE
IV: 0022 QUOTATION MARK => 309B KATAKANA-HIRAGANA VOICED SOUND MARK
OP: 0022 QUOTATION MARK => 309B KATAKANA-HIRAGANA VOICED SOUND MARK
IV: 0027 APOSTROPHE => 0027 APOSTROPHE
OP: 0027 APOSTROPHE => 0027 APOSTROPHE
IV: 002C COMMA => 3001 IDEOGRAPHIC COMMA
OP: 002C COMMA => 3001 IDEOGRAPHIC COMMA
IV: 002D HYPHEN-MINUS => 30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
OP: 002D HYPHEN-MINUS => 30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
IV: 002E FULL STOP => 3002 IDEOGRAPHIC FULL STOP
OP: 002E FULL STOP => 3002 IDEOGRAPHIC FULL STOP
IV: 003C LESS-THAN SIGN => 3008 LEFT ANGLE BRACKET
OP: 003C LESS-THAN SIGN => 3008 LEFT ANGLE BRACKET
IV: 003E GREATER-THAN SIGN => 3009 RIGHT ANGLE BRACKET
OP: 003E GREATER-THAN SIGN => 3009 RIGHT ANGLE BRACKET
IV: 005C REVERSE SOLIDUS => 005C REVERSE SOLIDUS
OP: 005C REVERSE SOLIDUS => 005C REVERSE SOLIDUS
IV: 0060 GRAVE ACCENT => 2018 LEFT SINGLE QUOTATION MARK
OP: 0060 GRAVE ACCENT => 2018 LEFT SINGLE QUOTATION MARK
>>> normal = u' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'
>>> wide = u' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!゛#$%&()*+、ー。/:;〈=〉?@[\\]^_‘{|}~'
>>> widemap = dict((ord(x[0]), x[1]) for x in zip(normal, wide))
>>> print u'Hello, world!'.translate(widemap)
Hello、 world!
Gello+ orldZ
作为“Hello world!”的翻译。 - juliomalegria是的,在Python 3中,最干净的方法是使用 str.translate 和 str.maketrans:
HALFWIDTH_TO_FULLWIDTH = str.maketrans(
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&()*+,-./:;<=>?@[]^_`{|}~',
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!゛#$%&()*+、ー。/:;〈=〉?@[]^_‘{|}~')
def halfwidth_to_fullwidth(s):
return s.translate(HALFWIDTH_TO_FULLWIDTH)
import unicodedata
unicode_range = (0, 0x10ffff)
# create a dict of where the values are unicode characters
# and the keys are the character names, if they have one.
chars = {}
for uc_point in range(unicode_range[0], unicode_range[1]+1):
char = chr(uc_point)
try:
name = unicodedata.name(char)
chars[name] = char
except ValueError: #chars with no name such as control characters
pass
def normal(name):
# 'IDEOGRAPHIC COMMA' -> 'COMMA'
# 'HALFWIDTH IDEOGRAPHIC COMMA' -> 'COMMA'
# 'LATIN SMALL LETTER A' -> None
# so we want to look foor these at the start of character names:
starts = ['HALFWIDTH IDEOGRAPHIC','IDEOGRAPHIC','FULLWIDTH','HALFWIDTH']
l = [name[len(start)+1:] for start in starts if name.startswith(start)]
if l:
return l[0]
else:
return None
# who doesn't love a bit of dict comprehension for the finish:
mapping = {chars[name]: chars[normal(name)] for name in chars if normal(name) in chars}
>>> ''.join(mapping.keys())
'\u3000、。!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワンᅠᄀᄁᆪᄂᆬᆭᄃᄄᄅᆰᆱᆲᆳᆴᆵᄚᄆᄇᄈᄡᄉᄊᄋᄌᄍᄎᄏᄐᄑ하ᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵ¢£¬ ̄¦¥₩←↑→↓■○'
>>> ''.join(mapping.values())
' ,.!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆.「」,・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワンㅤㄱㄲㄳㄴㄵㄶㄷㄸㄹㄺㄻㄼㄽㄾㄿㅀㅁㅂㅃㅄㅅㅆㅇㅈㅉㅊㅋㅌㅍㅎㅏㅐㅑㅒㅓㅔㅕㅖㅗㅘㅙㅚㅛㅜㅝㅞㅟㅠㅡㅢㅣ¢£¬¯¦¥₩←↑→↓■○'
这只走一条路:
#!/usr/bin/env perl
# uniwide
use utf8;
use strict;
use warnings;
use open qw(:std :utf8);
while (<>) {
s/\s/\x{A0}\x{A0}/g if tr
<!"#$%&´()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~¢£>
<!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~¢£>;;
} continue {
print;
}
close(STDOUT) || die "can't close stdout: $!";
这个也是一样的:
#!/usr/bin/env perl
# uninarrow
use utf8;
use strict;
use warnings;
use open qw(:std :utf8);
while (<>) {
s/ / /g if tr
<!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~¢£>
<!"#$%&´()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~¢£>
} continue {
print;
}
close(STDOUT) || die "can't close stdout: $!";
python
标签真的太小了... :o) - decezeASCII的UTF-8 Unicode代码完全相同。对于UTF-16,在(LE / BE)之前/之后添加零即可。
或者在Python中使用mystr.encode(“utf-8”)
0123456789a
!= 0123456789a
的意思是不相等。 - glglgl