Objective-C / Cocoa Touch中的HTML字符解码

103
首先,我找到了这个链接:Objective C HTML转义/反转义,但它对我没有用。
我的编码字符(顺便说一下,来自于一个RSS源)看起来像这样:& 我在网上搜索了相关讨论,但没有针对我的特定编码的解决方法,我认为它们被称为十六进制字符。

3
这条评论是原问题发布六个月后的,所以更多是为那些在寻找答案和解决方案时偶然遇到此问题的人而写的。最近刚出现了一个非常类似的问题,我回答了它。https://dev59.com/z0vSa4cB1Zd3GeqPifKd#2260140 它使用RegexKitLite和Blocks在字符串中进行搜索和替换“&#...;”为其相应的字符。 - johne
具体是什么“不起作用”?我在这个问题中没有看到任何不是早期问题的重复。 - Peter Hosey
这是十进制。十六进制是 8 - kennytm
十进制和十六进制的区别在于,十进制是基于10的,而十六进制是基于16的。在每个进制中,“38”都是不同的数字;在十进制中,它是3×10 + 8×1 = 三十八,在十六进制中,它是3×16 + 8×1 = 五十六。更高的数字是基数的更高次幂的(倍数);最低的整数位是基数0(= 1),下一个更高的数字是基数1(= 基数),下一个是基数*2(= 基数基数),等等。这就是指数运算的工作原理。 - Peter Hosey
https://dev59.com/SV8e5IYBdhLWcg3wlbFR - Sanju
13个回答

164

请查看我的处理HTML的NSString分类。以下是可用的方法:

- (NSString *)stringByConvertingHTMLToPlainText;
- (NSString *)stringByDecodingHTMLEntities;
- (NSString *)stringByEncodingHTMLEntities;
- (NSString *)stringWithNewLinesAsBRs;
- (NSString *)stringByRemovingNewLinesAndWhitespace;

3
兄弟,功能太棒了。你的stringByDecodingXMLEntities方法让我一整天都很开心!谢谢! - Brian Moeskau
3
没问题 ;) 很高兴你觉得这个有用! - Michael Waterfall
4
几个小时的搜索后,我知道这是唯一能真正起作用的方法。NSString需要一个可以做到这一点的字符串方法。干得好。 - Adam Eberbach
10
更新ARC的代码会很方便。在构建时,Xcode报告了大量ARC错误和警告。 - Matej
1
@MichaelWaterfall:非常好。但是使用stringByConvertingHTMLToPlainText也会删除换行符。我尝试在该函数中注释一些代码,但会导致崩溃。有什么建议吗? - rohan-patel
显示剩余9条评论

53

Daniel的代码基本上非常好,我修复了一些问题:

  1. 删除了NSScanner跳过字符的设置(否则两个连续实体之间的空格将被忽略)

    [scanner setCharactersToBeSkipped:nil];

  2. 修复了当有孤立的'&'符号时的解析(我不确定这种情况下的“正确”输出是什么,我只是将其与Firefox进行了比较):

例如:

    &#ABC DF & B'  & C' Items (288)

这是修改后的代码:
- (NSString *)stringByDecodingXMLEntities {
    NSUInteger myLength = [self length];
    NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;

    // Short-circuit if there are no ampersands.
    if (ampIndex == NSNotFound) {
        return self;
    }
    // Make result string with some extra capacity.
    NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];

    // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
    NSScanner *scanner = [NSScanner scannerWithString:self];

    [scanner setCharactersToBeSkipped:nil];

    NSCharacterSet *boundaryCharacterSet = [NSCharacterSet characterSetWithCharactersInString:@" \t\n\r;"];

    do {
        // Scan up to the next entity or the end of the string.
        NSString *nonEntityString;
        if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
            [result appendString:nonEntityString];
        }
        if ([scanner isAtEnd]) {
            goto finish;
        }
        // Scan either a HTML or numeric character entity reference.
        if ([scanner scanString:@"&" intoString:NULL])
            [result appendString:@"&"];
        else if ([scanner scanString:@"'" intoString:NULL])
            [result appendString:@"'"];
        else if ([scanner scanString:@""" intoString:NULL])
            [result appendString:@"\""];
        else if ([scanner scanString:@"<" intoString:NULL])
            [result appendString:@"<"];
        else if ([scanner scanString:@"&gt;" intoString:NULL])
            [result appendString:@">"];
        else if ([scanner scanString:@"&#" intoString:NULL]) {
            BOOL gotNumber;
            unsigned charCode;
            NSString *xForHex = @"";

            // Is it hex or decimal?
            if ([scanner scanString:@"x" intoString:&xForHex]) {
                gotNumber = [scanner scanHexInt:&charCode];
            }
            else {
                gotNumber = [scanner scanInt:(int*)&charCode];
            }

            if (gotNumber) {
                [result appendFormat:@"%C", (unichar)charCode];

                [scanner scanString:@";" intoString:NULL];
            }
            else {
                NSString *unknownEntity = @"";

                [scanner scanUpToCharactersFromSet:boundaryCharacterSet intoString:&unknownEntity];


                [result appendFormat:@"&#%@%@", xForHex, unknownEntity];

                //[scanner scanUpToString:@";" intoString:&unknownEntity];
                //[result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
                NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);

            }

        }
        else {
            NSString *amp;

            [scanner scanString:@"&" intoString:&amp];  //an isolated & symbol
            [result appendString:amp];

            /*
            NSString *unknownEntity = @"";
            [scanner scanUpToString:@";" intoString:&unknownEntity];
            NSString *semicolon = @"";
            [scanner scanString:@";" intoString:&semicolon];
            [result appendFormat:@"%@%@", unknownEntity, semicolon];
            NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
             */
        }

    }
    while (![scanner isAtEnd]);

finish:
    return result;
}

这应该是对问题的明确答案!!谢谢! - boliva
这个很好用。不幸的是,最高评分答案的代码由于 ARC 问题已经不能用了,但这个可以。 - Ted Kulp
@TedKulp 它可以很好地工作,您只需要针对每个文件禁用ARC即可。https://dev59.com/T2w15IYBdhLWcg3wYawx - Kyle
如果可以的话,我会给你点赞两次。 - Kibitz503
Swift翻译适用于仍在2016年及以后访问此问题的人:https://dev59.com/YHNA5IYBdhLWcg3wEZeT#35303635 - Max Chuquimia

49

iOS 7以后,您可以使用带有NSHTMLTextDocumentType属性的NSAttributedString来本地解码HTML字符:

NSString *htmlString = @"&#63743; &amp; &#38; &lt; &gt; &trade; &copy; &hearts; &clubs; &spades; &diams;";
NSData *stringData = [htmlString dataUsingEncoding:NSUTF8StringEncoding];

NSDictionary *options = @{NSDocumentTypeDocumentAttribute:NSHTMLTextDocumentType};
NSAttributedString *decodedString;
decodedString = [[NSAttributedString alloc] initWithData:stringData
                                                 options:options
                                      documentAttributes:NULL
                                                   error:NULL];

解码后的属性字符串将显示为: & & < > ™ © ♥ ♣ ♠ ♦。

注意:仅当在主线程上调用时才起作用。


7
如果您不需要支持iOS 6及更早版本,则最佳答案是... - jcesarmobile
1
不是最好的选择,如果有人想在后台线程上编码它;O - badeleux
4
这个方法可以解码一个实体,但也会破坏一个未编码的破折号。 - Andrew
当涉及到UITableView时,它会使GUI停滞不前,因此无法正常工作。 - Asif Bilal
不错的解决方案,但请注意它相当“侵入性”,即它将多个连续空格缩减为一个。如果您只想解码HTML实体,则可能不是预期的结果。 - DrMickeyLauer
显示剩余3条评论

46
那些被称为字符实体引用。当它们采取&#<number>;形式时,它们被称为数字实体引用。基本上,这是应该被替换的字节的字符串表示形式。在&#038;的情况下,它代表ISO-8859-1字符编码方案中值为38的字符,即&
RSS中必须对和号进行编码的原因是它是一个保留的特殊字符。
你需要做的是解析字符串并使用在&#;之间的值匹配的字节替换实体。我不知道有什么很好的方法在objective C中实现这一点,但是这个stack overflow问题可能会有所帮助。
编辑:自从两年前回答这个问题以来,有一些很好的解决方案;请参见@Michael Waterfall下面的答案。

2
+1 我正准备提交完全相同的答案(包括相同的链接,更不用说了!) - e.James
基本上,它是应该被替换的字节的字符串表示。更像是字符。这是文本,而不是数据;将文本转换为数据后,字符可能占用多个字节,这取决于字符和编码。 - Peter Hosey
谢谢您的回复。您说“它在ISO-8859-1字符编码方案中表示值为38的字符,即&”。您确定吗?您有这种类型字符表的链接吗?因为我记得那是一个单引号。 - treznik
请访问以下链接以了解有关编程的内容:http://en.wikipedia.org/wiki/ISO/IEC_8859-1#ISO-8859-1 或在谷歌搜索栏中输入 &。 - Matt Bridges
那么 & 或 © 符号呢? - vokilam

35
没人提到最简单的选项之一:Google Toolbox for Mac。(尽管名字是这样,这也适用于iOS。)
https://github.com/google/google-toolbox-for-mac/blob/master/Foundation/GTMNSString%2BHTML.h
/// Get a string where internal characters that are escaped for HTML are unescaped 
//
///  For example, '&amp;' becomes '&'
///  Handles &#32; and &#x32; cases as well
///
//  Returns:
//    Autoreleased NSString
//
- (NSString *)gtm_stringByUnescapingFromHTML;

我只需要在项目中包含三个文件:头文件、实现文件和 GTMDefines.h


我已经包含了这三个脚本,但现在该如何使用它们呢? - Borut Tomazin
2
我选择只包括那三个文件,因此我需要这样做使其与 ARC 兼容:http://code.google.com/p/google-toolbox-for-mac/wiki/ARC_Compatibility - jaime
我必须说,这是目前为止最简单、最轻量级的解决方案。 - lensovet
我希望我能完全让它工作。它似乎跳过了我的字符串中的许多内容。 - Joseph Toronto
@JosephToronto 或许在这里发布一个示例字符串,并说明您希望如何转义它? - Nikita Rybak
显示剩余2条评论

18
我应该将这个发布在GitHub上或其他地方。这属于NSString类别,使用NSScanner进行实现,并处理十六进制和十进制数字符实体以及通常的符号实体。
此外,它相对优雅地处理了格式不正确的字符串(当您有一个&后面跟着无效的字符序列时),这在我使用此代码的已发布的应用程序中非常关键。
- (NSString *)stringByDecodingXMLEntities {
    NSUInteger myLength = [self length];
    NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;

    // Short-circuit if there are no ampersands.
    if (ampIndex == NSNotFound) {
        return self;
    }
    // Make result string with some extra capacity.
    NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];

    // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
    NSScanner *scanner = [NSScanner scannerWithString:self];
    do {
        // Scan up to the next entity or the end of the string.
        NSString *nonEntityString;
        if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
            [result appendString:nonEntityString];
        }
        if ([scanner isAtEnd]) {
            goto finish;
        }
        // Scan either a HTML or numeric character entity reference.
        if ([scanner scanString:@"&amp;" intoString:NULL])
            [result appendString:@"&"];
        else if ([scanner scanString:@"&apos;" intoString:NULL])
            [result appendString:@"'"];
        else if ([scanner scanString:@"&quot;" intoString:NULL])
            [result appendString:@"\""];
        else if ([scanner scanString:@"&lt;" intoString:NULL])
            [result appendString:@"<"];
        else if ([scanner scanString:@"&gt;" intoString:NULL])
            [result appendString:@">"];
        else if ([scanner scanString:@"&#" intoString:NULL]) {
            BOOL gotNumber;
            unsigned charCode;
            NSString *xForHex = @"";

            // Is it hex or decimal?
            if ([scanner scanString:@"x" intoString:&xForHex]) {
                gotNumber = [scanner scanHexInt:&charCode];
            }
            else {
                gotNumber = [scanner scanInt:(int*)&charCode];
            }
            if (gotNumber) {
                [result appendFormat:@"%C", charCode];
            }
            else {
                NSString *unknownEntity = @"";
                [scanner scanUpToString:@";" intoString:&unknownEntity];
                [result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
                NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);
            }
            [scanner scanString:@";" intoString:NULL];
        }
        else {
            NSString *unknownEntity = @"";
            [scanner scanUpToString:@";" intoString:&unknownEntity];
            NSString *semicolon = @"";
            [scanner scanString:@";" intoString:&semicolon];
            [result appendFormat:@"%@%@", unknownEntity, semicolon];
            NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
        }
    }
    while (![scanner isAtEnd]);

finish:
    return result;
}

非常有用的代码片段,但是它确实存在一些问题,这些问题已经被Walty解决了。感谢分享! - Michael Waterfall
你知道一种方法可以通过解码它们的XML实体(如µ ...等)来显示lambda、mu、nu、pi符号吗? - chinthakad
你应该避免使用 goto,因为它是可怕的代码风格。你应该将行 goto finish; 替换为 break; - Stunner

4
你可以仅使用这个函数来解决这个问题。
+ (NSString*) decodeHtmlUnicodeCharactersToString:(NSString*)str
{
    NSMutableString* string = [[NSMutableString alloc] initWithString:str];  // #&39; replace with '
    NSString* unicodeStr = nil;
    NSString* replaceStr = nil;
    int counter = -1;

    for(int i = 0; i < [string length]; ++i)
    {
        unichar char1 = [string characterAtIndex:i];    
        for (int k = i + 1; k < [string length] - 1; ++k)
        {
            unichar char2 = [string characterAtIndex:k];    

            if (char1 == '&'  && char2 == '#' ) 
            {   
                ++counter;
                unicodeStr = [string substringWithRange:NSMakeRange(i + 2 , 2)];    
                // read integer value i.e, 39
                replaceStr = [string substringWithRange:NSMakeRange (i, 5)];     //     #&39;
                [string replaceCharactersInRange: [string rangeOfString:replaceStr] withString:[NSString stringWithFormat:@"%c",[unicodeStr intValue]]];
                break;
            }
        }
    }
    [string autorelease];

    if (counter > 1)
        return  [self decodeHtmlUnicodeCharactersToString:string]; 
    else
        return string;
}

4

这是我使用RegexKitLite框架的方式:

-(NSString*) decodeHtmlUnicodeCharacters: (NSString*) html {
NSString* result = [html copy];
NSArray* matches = [result arrayOfCaptureComponentsMatchedByRegex: @"\\&#([\\d]+);"];

if (![matches count]) 
    return result;

for (int i=0; i<[matches count]; i++) {
    NSArray* array = [matches objectAtIndex: i];
    NSString* charCode = [array objectAtIndex: 1];
    int code = [charCode intValue];
    NSString* character = [NSString stringWithFormat:@"%C", code];
    result = [result stringByReplacingOccurrencesOfString: [array objectAtIndex: 0]
                                               withString: character];      
}   
return result;  

希望这能对某些人有所帮助。


3

这是Walty Yeung的答案的Swift版本:

extension String {
    static private let mappings = ["&quot;" : "\"","&amp;" : "&", "&lt;" : "<", "&gt;" : ">","&nbsp;" : " ","&iexcl;" : "¡","&cent;" : "¢","&pound;" : " £","&curren;" : "¤","&yen;" : "¥","&brvbar;" : "¦","&sect;" : "§","&uml;" : "¨","&copy;" : "©","&ordf;" : " ª","&laquo" : "«","&not" : "¬","&reg" : "®","&macr" : "¯","&deg" : "°","&plusmn" : "±","&sup2; " : "²","&sup3" : "³","&acute" : "´","&micro" : "µ","&para" : "¶","&middot" : "·","&cedil" : "¸","&sup1" : "¹","&ordm" : "º","&raquo" : "»&","frac14" : "¼","&frac12" : "½","&frac34" : "¾","&iquest" : "¿","&times" : "×","&divide" : "÷","&ETH" : "Ð","&eth" : "ð","&THORN" : "Þ","&thorn" : "þ","&AElig" : "Æ","&aelig" : "æ","&OElig" : "Œ","&oelig" : "œ","&Aring" : "Å","&Oslash" : "Ø","&Ccedil" : "Ç","&ccedil" : "ç","&szlig" : "ß","&Ntilde;" : "Ñ","&ntilde;":"ñ",]

    func stringByDecodingXMLEntities() -> String {

        guard let _ = self.rangeOfString("&", options: [.LiteralSearch]) else {
            return self
        }

        var result = ""

        let scanner = NSScanner(string: self)
        scanner.charactersToBeSkipped = nil

        let boundaryCharacterSet = NSCharacterSet(charactersInString: " \t\n\r;")

        repeat {
            var nonEntityString: NSString? = nil

            if scanner.scanUpToString("&", intoString: &nonEntityString) {
                if let s = nonEntityString as? String {
                    result.appendContentsOf(s)
                }
            }

            if scanner.atEnd {
                break
            }

            var didBreak = false
            for (k,v) in String.mappings {
                if scanner.scanString(k, intoString: nil) {
                    result.appendContentsOf(v)
                    didBreak = true
                    break
                }
            }

            if !didBreak {

                if scanner.scanString("&#", intoString: nil) {

                    var gotNumber = false
                    var charCodeUInt: UInt32 = 0
                    var charCodeInt: Int32 = -1
                    var xForHex: NSString? = nil

                    if scanner.scanString("x", intoString: &xForHex) {
                        gotNumber = scanner.scanHexInt(&charCodeUInt)
                    }
                    else {
                        gotNumber = scanner.scanInt(&charCodeInt)
                    }

                    if gotNumber {
                        let newChar = String(format: "%C", (charCodeInt > -1) ? charCodeInt : charCodeUInt)
                        result.appendContentsOf(newChar)
                        scanner.scanString(";", intoString: nil)
                    }
                    else {
                        var unknownEntity: NSString? = nil
                        scanner.scanUpToCharactersFromSet(boundaryCharacterSet, intoString: &unknownEntity)
                        let h = xForHex ?? ""
                        let u = unknownEntity ?? ""
                        result.appendContentsOf("&#\(h)\(u)")
                    }
                }
                else {
                    scanner.scanString("&", intoString: nil)
                    result.appendContentsOf("&")
                }
            }

        } while (!scanner.atEnd)

        return result
    }
}

1
实际上,Michael Waterfall的伟大MWFeedParser框架(参见他的回答)已经被rmchaara分叉,并更新了ARC支持!你可以在Github 这里找到它。它真的很好用,我使用了stringByDecodingHTMLEntities方法,运行得非常顺畅。

那解决了ARC问题,但引入了一些警告。我认为可以安全地忽略它们? - Robert J. Clegg

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接