在Swift中处理Unicode代码点

73

如果您对蒙古语的细节不感兴趣,只想快速了解在Swift中使用和转换Unicode值的方法,请跳至已接受答案的第一部分。


背景

我想要在iOS应用程序中呈现传统蒙古文的Unicode文本。更好且长期的解决方案是使用AAT智能字体来呈现这种复杂的文字。(这样的字体确实存在, 但其许可证不允许修改和非个人使用。) 然而,由于我从未制作过字体,更不用说AAT字体的所有渲染逻辑了,所以我计划暂时自己在Swift中进行渲染。也许在以后的某个日期,我可以学习制作智能字体。

在外部,我将使用Unicode文本,但在内部(用于在UITextView中显示),我将把Unicode转换为存储在哑字体中的单个字形(使用Unicode PUA值编码)。因此,我的渲染引擎需要将蒙古语Unicode值(范围:U+1820到U+1842)转换为存储在PUA中的字形值(范围:U+E360到U+E5CF)。无论如何,这是我的计划,因为这是我过去在Java中所做的, 但也许我需要改变整个思考方式。

示例

以下图像显示了用两种不同的形式书写的蒙古语su,其中字母u采用不同的形式(红色)。 (蒙古文是竖排书写的,字母连接在一起,类似于英文草书字母。)

enter image description here

在Unicode中,这两个字符串将被表示为:
var suForm1: String = "\u{1830}\u{1826}"
var suForm2: String = "\u{1830}\u{1826}\u{180B}"

suForm2 中,自由变量选择器(U+180B)被 Swift 的 String 正确地识别为与其前面的 u (U+1826) 一起构成的单元。Swift 认为它是一个单一字符,即扩展字形簇。但是,为了自己进行渲染,我需要将 u (U+1826) 和 FVS1 (U+180B) 区分开来,作为两个不同的 UTF-16 代码点。
为了内部显示目的,我会将上述 Unicode 字符串转换为以下呈现的字形字符串:
suForm1 = "\u{E46F}\u{E3BA}" 
suForm2 = "\u{E46F}\u{E3BB}"

问题

我一直在使用 Swift 的 StringCharacter 进行实验。它们有很多方便之处,但由于在我特定的情况下,我只处理 UTF-16 代码单元,所以我想知道是否应该使用旧的 NSString 而不是 Swift 的 String。我意识到我可以使用 String.utf16 获取 UTF-16 代码点,但转换回 String 不太好

您认为是坚持使用StringCharacter更好,还是应该使用NSStringunichar

我所阅读的内容


1
你的问题对我来说不太清楚。count(string) 返回“扩展Unicode图形簇”的数量,count(string.utf16) 返回相同字符串所需的UTF-16代码点数量(即相应的NSStringCFString的长度)。 (而count(string.utf8)将返回UTF-8代码点的数量)。关于问题“我是否应该每次引用字符串时都使用suForm1.utf16这样的写法?”无法一概而论,这取决于你需要计算数量的目的。 - Martin R
我需要专门使用UTF-16代码点。计算字符串长度(以UTF-16为单位)是我需要做的一件事,但这只是一个例子。我还需要做一些比较字符的操作(即UTF-16代码点,而不仅仅是Swift认为相等的字形簇)。我应该使用String还是NSString或其他东西? - Suragch
“length”是什么意思?字节?代码单元?代码点?字形簇?像素?同样,“比较”也是如此。代码点相等吗?规范等价性?兼容等价性? - 一二三
UTF-16代码单元。我重新修改了问题以使其更加清晰。 - Suragch
3
我已经向https://dev59.com/jIHba4cB1Zd3GeqPNy0f#24757284添加了另一个可能的解决方案。你应该使用UInt16 / unichar数组或NSString作为中间表示来处理。将NSString转换为String实际上很容易。 - Martin R
2个回答

99

更新至Swift 3

字符串和字符

对于未来的几乎所有访问此问题的人,StringCharacter将是您的答案。

直接在代码中设置Unicode值:

var str: String = "I want to visit 北京, Москва, मुंबई, القاهرة, and 서울시. "
var character: Character = ""

使用十六进制设置值

var str: String = "\u{61}\u{5927}\u{1F34E}\u{3C0}" // a大π
var character: Character = "\u{65}\u{301}" // é = "e" + accent mark

请注意,Swift字符可以由多个Unicode代码点组成,但看起来是一个单独的字符。这称为扩展字形集群。

另请参见此问题

转换为Unicode值:

str.utf8
str.utf16
str.unicodeScalars // UTF-32

String(character).utf8
String(character).utf16
String(character).unicodeScalars

从Unicode十六进制值转换:

let hexValue: UInt32 = 0x1F34E

// convert hex value to UnicodeScalar
guard let scalarValue = UnicodeScalar(hexValue) else {
    // early exit if hex does not form a valid unicode value
    return
}

// convert UnicodeScalar to String
let myString = String(scalarValue) // 
或者也可以:
let hexValue: UInt32 = 0x1F34E
if let scalarValue = UnicodeScalar(hexValue) {
    let myString = String(scalarValue)
}

更多例子

let value0: UInt8 = 0x61
let value1: UInt16 = 0x5927
let value2: UInt32 = 0x1F34E

let string0 = String(UnicodeScalar(value0)) // a
let string1 = String(UnicodeScalar(value1)) // 大
let string2 = String(UnicodeScalar(value2)) // 

// convert hex array to String
let myHexArray = [0x43, 0x61, 0x74, 0x203C, 0x1F431] // an Int array
var myString = ""
for hexValue in myHexArray {
    myString.append(UnicodeScalar(hexValue))
}
print(myString) // Cat‼

请注意,对于UTF-8和UTF-16,转换并不总是那么容易。(请参阅UTF-8UTF-16UTF-32问题。)

NSString和unichar

在Swift中还可以使用NSStringunichar,但你应该意识到,除非你熟悉Objective C并且擅长将语法转换为Swift,否则很难找到好的文档。

另外,unichar是一个UInt16数组,如上所述,从UInt16到Unicode标量值的转换并不总是容易的(也就是说,转换代理对用于表情符号和其他位于高代码平面的字符)。

自定义字符串结构

出于上述原因,我最终没有使用以上任何一种方法。相反,我编写了自己的字符串结构,这基本上是一个UInt32数组,用于保存Unicode标量值。

同样,对大多数人来说,这不是解决方案。如果你只需要稍微扩展StringCharacter的功能,请首先考虑使用扩展(extension)

但是,如果您确实需要专门处理Unicode标量值,可以编写自定义结构。

优点是:

  • 在进行字符串操作时不需要经常切换类型(如StringCharacterUnicodeScalarUInt32等)。
  • 在Unicode操作完成后,最终转换为String很容易。
  • 需要添加更多方法时很容易
  • 简化从Java或其他语言转换代码的过程

缺点是:

  • 使代码对其他Swift开发者来说不太可移植和易读
  • 没有本地Swift类型那样经过完善测试和优化
  • 每次需要时都要包含另一个文件

您可以自己编写,但以下是我的参考代码。最困难的部分是使其可哈希

// This struct is an array of UInt32 to hold Unicode scalar values
// Version 3.4.0 (Swift 3 update)


struct ScalarString: Sequence, Hashable, CustomStringConvertible {
    
    fileprivate var scalarArray: [UInt32] = []
    
    
    init() {
        // does anything need to go here?
    }
    
    init(_ character: UInt32) {
        self.scalarArray.append(character)
    }
    
    init(_ charArray: [UInt32]) {
        for c in charArray {
            self.scalarArray.append(c)
        }
    }
    
    init(_ string: String) {
        
        for s in string.unicodeScalars {
            self.scalarArray.append(s.value)
        }
    }
    
    // Generator in order to conform to SequenceType protocol
    // (to allow users to iterate as in `for myScalarValue in myScalarString` { ... })
    func makeIterator() -> AnyIterator<UInt32> {
        return AnyIterator(scalarArray.makeIterator())
    }
    
    // append
    mutating func append(_ scalar: UInt32) {
        self.scalarArray.append(scalar)
    }
    
    mutating func append(_ scalarString: ScalarString) {
        for scalar in scalarString {
            self.scalarArray.append(scalar)
        }
    }
    
    mutating func append(_ string: String) {
        for s in string.unicodeScalars {
            self.scalarArray.append(s.value)
        }
    }
    
    // charAt
    func charAt(_ index: Int) -> UInt32 {
        return self.scalarArray[index]
    }
    
    // clear
    mutating func clear() {
        self.scalarArray.removeAll(keepingCapacity: true)
    }
    
    // contains
    func contains(_ character: UInt32) -> Bool {
        for scalar in self.scalarArray {
            if scalar == character {
                return true
            }
        }
        return false
    }
    
    // description (to implement Printable protocol)
    var description: String {
        return self.toString()
    }
    
    // endsWith
    func endsWith() -> UInt32? {
        return self.scalarArray.last
    }
    
    // indexOf
    // returns first index of scalar string match
    func indexOf(_ string: ScalarString) -> Int? {
        
        if scalarArray.count < string.length {
            return nil
        }
        
        for i in 0...(scalarArray.count - string.length) {
            
            for j in 0..<string.length {
                
                if string.charAt(j) != scalarArray[i + j] {
                    break // substring mismatch
                }
                if j == string.length - 1 {
                    return i
                }
            }
        }
        
        return nil
    }
    
    // insert
    mutating func insert(_ scalar: UInt32, atIndex index: Int) {
        self.scalarArray.insert(scalar, at: index)
    }
    mutating func insert(_ string: ScalarString, atIndex index: Int) {
        var newIndex = index
        for scalar in string {
            self.scalarArray.insert(scalar, at: newIndex)
            newIndex += 1
        }
    }
    mutating func insert(_ string: String, atIndex index: Int) {
        var newIndex = index
        for scalar in string.unicodeScalars {
            self.scalarArray.insert(scalar.value, at: newIndex)
            newIndex += 1
        }
    }
    
    // isEmpty
    var isEmpty: Bool {
        return self.scalarArray.count == 0
    }
    
    // hashValue (to implement Hashable protocol)
    var hashValue: Int {
        
        // DJB Hash Function
        return self.scalarArray.reduce(5381) {
            ($0 << 5) &+ $0 &+ Int($1)
        }
    }
    
    // length
    var length: Int {
        return self.scalarArray.count
    }
    
    // remove character
    mutating func removeCharAt(_ index: Int) {
        self.scalarArray.remove(at: index)
    }
    func removingAllInstancesOfChar(_ character: UInt32) -> ScalarString {
        
        var returnString = ScalarString()
        
        for scalar in self.scalarArray {
            if scalar != character {
                returnString.append(scalar)
            }
        }
        
        return returnString
    }
    func removeRange(_ range: CountableRange<Int>) -> ScalarString? {
        
        if range.lowerBound < 0 || range.upperBound > scalarArray.count {
            return nil
        }
        
        var returnString = ScalarString()
        
        for i in 0..<scalarArray.count {
            if i < range.lowerBound || i >= range.upperBound {
                returnString.append(scalarArray[i])
            }
        }
        
        return returnString
    }
    
    
    // replace
    func replace(_ character: UInt32, withChar replacementChar: UInt32) -> ScalarString {
        
        var returnString = ScalarString()
        
        for scalar in self.scalarArray {
            if scalar == character {
                returnString.append(replacementChar)
            } else {
                returnString.append(scalar)
            }
        }
        return returnString
    }
    func replace(_ character: UInt32, withString replacementString: String) -> ScalarString {
        
        var returnString = ScalarString()
        
        for scalar in self.scalarArray {
            if scalar == character {
                returnString.append(replacementString)
            } else {
                returnString.append(scalar)
            }
        }
        return returnString
    }
    func replaceRange(_ range: CountableRange<Int>, withString replacementString: ScalarString) -> ScalarString {
        
        var returnString = ScalarString()
        
        for i in 0..<scalarArray.count {
            if i < range.lowerBound || i >= range.upperBound {
                returnString.append(scalarArray[i])
            } else if i == range.lowerBound {
                returnString.append(replacementString)
            }
        }
        return returnString
    }
    
    // set (an alternative to myScalarString = "some string")
    mutating func set(_ string: String) {
        self.scalarArray.removeAll(keepingCapacity: false)
        for s in string.unicodeScalars {
            self.scalarArray.append(s.value)
        }
    }
    
    // split
    func split(atChar splitChar: UInt32) -> [ScalarString] {
        var partsArray: [ScalarString] = []
        if self.scalarArray.count == 0 {
            return partsArray
        }
        var part: ScalarString = ScalarString()
        for scalar in self.scalarArray {
            if scalar == splitChar {
                partsArray.append(part)
                part = ScalarString()
            } else {
                part.append(scalar)
            }
        }
        partsArray.append(part)
        return partsArray
    }
    
    // startsWith
    func startsWith() -> UInt32? {
        return self.scalarArray.first
    }
    
    // substring
    func substring(_ startIndex: Int) -> ScalarString {
        // from startIndex to end of string
        var subArray: ScalarString = ScalarString()
        for i in startIndex..<self.length {
            subArray.append(self.scalarArray[i])
        }
        return subArray
    }
    func substring(_ startIndex: Int, _ endIndex: Int) -> ScalarString {
        // (startIndex is inclusive, endIndex is exclusive)
        var subArray: ScalarString = ScalarString()
        for i in startIndex..<endIndex {
            subArray.append(self.scalarArray[i])
        }
        return subArray
    }
    
    // toString
    func toString() -> String {
        var string: String = ""
        
        for scalar in self.scalarArray {
            if let validScalor = UnicodeScalar(scalar) {
                string.append(Character(validScalor))
            }
        }
        return string
    }
    
    // trim
    // removes leading and trailing whitespace (space, tab, newline)
    func trim() -> ScalarString {
        
        //var returnString = ScalarString()
        let space: UInt32 = 0x00000020
        let tab: UInt32 = 0x00000009
        let newline: UInt32 = 0x0000000A
        
        var startIndex = self.scalarArray.count
        var endIndex = 0
        
        // leading whitespace
        for i in 0..<self.scalarArray.count {
            if self.scalarArray[i] != space &&
                self.scalarArray[i] != tab &&
                self.scalarArray[i] != newline {
                
                startIndex = i
                break
            }
        }
        
        // trailing whitespace
        for i in stride(from: (self.scalarArray.count - 1), through: 0, by: -1) {
            if self.scalarArray[i] != space &&
                self.scalarArray[i] != tab &&
                self.scalarArray[i] != newline {
                
                endIndex = i + 1
                break
            }
        }
        
        if endIndex <= startIndex {
            return ScalarString()
        }
        
        return self.substring(startIndex, endIndex)
    }
    
    // values
    func values() -> [UInt32] {
        return self.scalarArray
    }
    
}

func ==(left: ScalarString, right: ScalarString) -> Bool {
    return left.scalarArray == right.scalarArray
}

func +(left: ScalarString, right: ScalarString) -> ScalarString {
    var returnString = ScalarString()
    for scalar in left.values() {
        returnString.append(scalar)
    }
    for scalar in right.values() {
        returnString.append(scalar)
    }
    return returnString
}

0
//Swift 3.0  
// This struct is an array of UInt32 to hold Unicode scalar values
struct ScalarString: Sequence, Hashable, CustomStringConvertible {

    private var scalarArray: [UInt32] = []

    init() {
        // does anything need to go here?
    }

    init(_ character: UInt32) {
        self.scalarArray.append(character)
    }

    init(_ charArray: [UInt32]) {
        for c in charArray {
            self.scalarArray.append(c)
        }
    }

    init(_ string: String) {

        for s in string.unicodeScalars {
            self.scalarArray.append(s.value)
        }
    }

    // Generator in order to conform to SequenceType protocol
    // (to allow users to iterate as in `for myScalarValue in myScalarString` { ... })

    //func generate() -> AnyIterator<UInt32> {
    func makeIterator() -> AnyIterator<UInt32> {

        let nextIndex = 0

        return AnyIterator {
            if (nextIndex > self.scalarArray.count-1) {
                return nil
            }
            return self.scalarArray[nextIndex + 1]
        }
    }

    // append
    mutating func append(scalar: UInt32) {
        self.scalarArray.append(scalar)
    }

    mutating func append(scalarString: ScalarString) {
        for scalar in scalarString {
            self.scalarArray.append(scalar)
        }
    }

    mutating func append(string: String) {
        for s in string.unicodeScalars {
            self.scalarArray.append(s.value)
        }
    }

    // charAt
    func charAt(index: Int) -> UInt32 {
        return self.scalarArray[index]
    }

    // clear
    mutating func clear() {
        self.scalarArray.removeAll(keepingCapacity: true)
    }

    // contains
    func contains(character: UInt32) -> Bool {
        for scalar in self.scalarArray {
            if scalar == character {
                return true
            }
        }
        return false
    }

    // description (to implement Printable protocol)
    var description: String {

        var string: String = ""

        for scalar in scalarArray {
            string.append(String(describing: UnicodeScalar(scalar))) //.append(UnicodeScalar(scalar)!)
        }
        return string
    }

    // endsWith
    func endsWith() -> UInt32? {
        return self.scalarArray.last
    }

    // insert
    mutating func insert(scalar: UInt32, atIndex index: Int) {
        self.scalarArray.insert(scalar, at: index)
    }

    // isEmpty
    var isEmpty: Bool {
        get {
            return self.scalarArray.count == 0
        }
    }

    // hashValue (to implement Hashable protocol)
    var hashValue: Int {
        get {

            // DJB Hash Function
            var hash = 5381

            for i in 0 ..< scalarArray.count {
                hash = ((hash << 5) &+ hash) &+ Int(self.scalarArray[i])
            }
            /*
             for i in 0..< self.scalarArray.count {
             hash = ((hash << 5) &+ hash) &+ Int(self.scalarArray[i])
             }
             */
            return hash
        }
    }

    // length
    var length: Int {
        get {
            return self.scalarArray.count
        }
    }

    // remove character
    mutating func removeCharAt(index: Int) {
        self.scalarArray.remove(at: index)
    }
    func removingAllInstancesOfChar(character: UInt32) -> ScalarString {

        var returnString = ScalarString()

        for scalar in self.scalarArray {
            if scalar != character {
                returnString.append(scalar: scalar) //.append(scalar)
            }
        }

        return returnString
    }

    // replace
    func replace(character: UInt32, withChar replacementChar: UInt32) -> ScalarString {

        var returnString = ScalarString()

        for scalar in self.scalarArray {
            if scalar == character {
                returnString.append(scalar: replacementChar) //.append(replacementChar)
            } else {
                returnString.append(scalar: scalar) //.append(scalar)
            }
        }
        return returnString
    }

    // func replace(character: UInt32, withString replacementString: String) -> ScalarString {
    func replace(character: UInt32, withString replacementString: ScalarString) -> ScalarString {

        var returnString = ScalarString()

        for scalar in self.scalarArray {
            if scalar == character {
                returnString.append(scalarString: replacementString) //.append(replacementString)
            } else {
                returnString.append(scalar: scalar) //.append(scalar)
            }
        }
        return returnString
    }

    // set (an alternative to myScalarString = "some string")
    mutating func set(string: String) {
        self.scalarArray.removeAll(keepingCapacity: false)
        for s in string.unicodeScalars {
            self.scalarArray.append(s.value)
        }
    }

    // split
    func split(atChar splitChar: UInt32) -> [ScalarString] {
        var partsArray: [ScalarString] = []
        var part: ScalarString = ScalarString()
        for scalar in self.scalarArray {
            if scalar == splitChar {
                partsArray.append(part)
                part = ScalarString()
            } else {
                part.append(scalar: scalar) //.append(scalar)
            }
        }
        partsArray.append(part)
        return partsArray
    }

    // startsWith
    func startsWith() -> UInt32? {
        return self.scalarArray.first
    }

    // substring
    func substring(startIndex: Int) -> ScalarString {
        // from startIndex to end of string
        var subArray: ScalarString = ScalarString()
        for i in startIndex ..< self.length {
            subArray.append(scalar: self.scalarArray[i]) //.append(self.scalarArray[i])
        }
        return subArray
    }
    func substring(startIndex: Int, _ endIndex: Int) -> ScalarString {
        // (startIndex is inclusive, endIndex is exclusive)
        var subArray: ScalarString = ScalarString()
        for i in startIndex ..< endIndex {
            subArray.append(scalar: self.scalarArray[i]) //.append(self.scalarArray[i])
        }
        return subArray
    }

    // toString
    func toString() -> String {
        let string: String = ""

        for scalar in self.scalarArray {
            string.appending(String(describing:UnicodeScalar(scalar))) //.append(UnicodeScalar(scalar)!)
        }
        return string
    }

    // values
    func values() -> [UInt32] {
        return self.scalarArray
    }

}

func ==(left: ScalarString, right: ScalarString) -> Bool {

    if left.length != right.length {
        return false
    }

    for i in 0 ..< left.length {
        if left.charAt(index: i) != right.charAt(index: i) {
            return false
        }
    }

    return true
}

func +(left: ScalarString, right: ScalarString) -> ScalarString {
    var returnString = ScalarString()
    for scalar in left.values() {
        returnString.append(scalar: scalar) //.append(scalar)
    }
    for scalar in right.values() {
        returnString.append(scalar: scalar) //.append(scalar)
    }
    return returnString
}

我认为使用String(describing:UnicodeScalar(scalar))会产生带有“Optional”字样的字符串。请查看我的答案更新,以获取更好的方法。 - Suragch

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接