Objective-C中的NSString分词

147

在Objective-C中,对于分词/拆分NSString的最佳方法是什么?

9个回答

278

在这里找到答案

NSString *string = @"oop:ack:bork:greeble:ponies";
NSArray *chunks = [string componentsSeparatedByString: @":"];

40
作为未来读者的参考,我想指出相反的写法是 [anArray componentsJoinedByString:@":"]; - Ivan Vučica
2
谢谢,但是如何分割一个由多个标记分隔的NSString?(如果你知道我的意思,我的英语不是很好)@Adam - 11684
2
@Adam,我认为你想要的是componentsSeparatedByCharactersInSet。请看下面的答案。 - Wienke

32

大家都提到了componentsSeparatedByString:,但是你也可以使用 CFStringTokenizer(记住NSStringCFString是可互换的),它也可以对自然语言进行分词(比如中文/日文不会在空格上拆分单词)。


7
在Mac OS X 10.6及更高版本中,NSString类有两个方法:enumerateLinesUsingBlock:enumerateSubstringsInRange:options:usingBlock:,后者是基于块的CFStringTokenizer版本。具体信息请参考这里:http://developer.apple.com/mac/library/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/Reference/NSString.html#//apple_ref/occ/instm/NSString/enumerateLinesUsingBlock: 和 http://developer.apple.com/mac/library/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/Reference/NSString.html#//apple_ref/occ/instm/NSString/enumerateSubstringsInRange:options:usingBlock:。 - Peter Hosey
1
enumerate 方法也适用于 iOS 4 及以后的版本。 - bugloaf

21

如果你只想拆分一个字符串,请使用 -[NSString componentsSeparatedByString:] 方法。对于更复杂的分词,可以使用 NSScanner 类。


7
如果您的分词需求更加复杂,请查看我的开源Cocoa字符串分词/解析工具包:ParseKit。

http://parsekit.com

如果只需要使用分隔符字符(如“:”)对字符串进行简单拆分,则使用ParseKit肯定过于复杂。但是,对于复杂的标记化需求,ParseKit非常强大/灵活。

还可以参见ParseKit标记化文档


这还能用吗?我试了一下,出现了几个错误,我不敢自己尝试修复。 - griotspeak
嗯?还活着?ParseKit项目正在积极维护。然而,在这里发表评论并不是报告该项目错误的正确方式。如果您需要报告错误,可以在Google Code和Github上进行。 - Todd Ditchendorf
听起来不错,但现在我不能取消我的反对票,除非你以某种方式编辑答案(网站规则)。也许你可以注明它适用于哪些版本,或者是否使用ARC等?或者你可以在某个地方添加一个空格,这取决于你 :) - Dan Rosenstark

6
如果您想在多个字符上进行标记化,可以使用NSString的componentsSeparatedByCharactersInSet。NSCharacterSet有一些方便的预制集合,例如whitespaceCharacterSetillegalCharacterSet。它还有用于Unicode范围的初始化程序。
您还可以组合字符集并将它们用于标记化,像这样:
// Tokenize sSourceEntityName on both whitespace and punctuation.
NSMutableCharacterSet *mcharsetWhitePunc = [[NSCharacterSet whitespaceAndNewlineCharacterSet] mutableCopy];
[mcharsetWhitePunc formUnionWithCharacterSet:[NSCharacterSet punctuationCharacterSet]];
NSArray *sarrTokenizedName = [self.sSourceEntityName componentsSeparatedByCharactersInSet:mcharsetWhitePunc];
[mcharsetWhitePunc release];

请注意,componentsSeparatedByCharactersInSet 如果连续遇到 charSet 中的多个成员,将产生空字符串,因此您可能需要测试长度小于1。


不涉及那些在所有逻辑标记之间没有分隔符的语言。这是一个糟糕的解决方案。 - uchuugaka
在这种情况下,您将使用不同的字符集或一组字符集进行标记化。我只是使用具体的例子来说明一个普遍的概念。 - Wienke

5
如果您想将字符串令牌化为搜索项,同时保留“引号短语”,这里有一个NSString类别,它尊重各种类型的引号配对:""''‘’“” 用法:
NSArray *terms = [@"This is my \"search phrase\" I want to split" searchTerms];
// results in: ["This", "is", "my", "search phrase", "I", "want", "to", "split"]

代码:

@interface NSString (Search)
- (NSArray *)searchTerms;
@end

@implementation NSString (Search)

- (NSArray *)searchTerms {

    // Strip whitespace and setup scanner
    NSCharacterSet *whitespace = [NSCharacterSet whitespaceAndNewlineCharacterSet];
    NSString *searchString = [self stringByTrimmingCharactersInSet:whitespace];
    NSScanner *scanner = [NSScanner scannerWithString:searchString];
    [scanner setCharactersToBeSkipped:nil]; // we'll handle whitespace ourselves

    // A few types of quote pairs to check
    NSDictionary *quotePairs = @{@"\"": @"\"",
                                 @"'": @"'",
                                 @"\u2018": @"\u2019",
                                 @"\u201C": @"\u201D"};

    // Scan
    NSMutableArray *results = [[NSMutableArray alloc] init];
    NSString *substring = nil;
    while (scanner.scanLocation < searchString.length) {
        // Check for quote at beginning of string
        unichar unicharacter = [self characterAtIndex:scanner.scanLocation];
        NSString *startQuote = [NSString stringWithFormat:@"%C", unicharacter];
        NSString *endQuote = [quotePairs objectForKey:startQuote];
        if (endQuote != nil) { // if it's a valid start quote we'll have an end quote
            // Scan quoted phrase into substring (skipping start & end quotes)
            [scanner scanString:startQuote intoString:nil];
            [scanner scanUpToString:endQuote intoString:&substring];
            [scanner scanString:endQuote intoString:nil];
        } else {
            // Single word that is non-quoted
            [scanner scanUpToCharactersFromSet:whitespace intoString:&substring];
        }
        // Process and add the substring to results
        if (substring) {
            substring = [substring stringByTrimmingCharactersInSet:whitespace];
            if (substring.length) [results addObject:substring];
        }
        // Skip to next word
        [scanner scanCharactersFromSet:whitespace intoString:nil];
    }

    // Return non-mutable array
    return results.copy;

}

@end

2

如果您正在寻找将字符串的语言特征(单词、段落、字符、句子和行)分割开来的方法,请使用字符串枚举:

NSString * string = @" \n word1!    word2,%$?'/word3.word4   ";

[string enumerateSubstringsInRange:NSMakeRange(0, string.length)
                           options:NSStringEnumerationByWords
                        usingBlock:
 ^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
     NSLog(@"Substring: '%@'", substring);
 }];

 // Logs:
 // Substring: 'word1'
 // Substring: 'word2'
 // Substring: 'word3'
 // Substring: 'word4' 

这个API可以处理其他语言中不一定以空格为分隔符的情况(例如日语)。而使用NSStringEnumerationByComposedCharacterSequences是正确的枚举字符的方式,因为许多非西方字符超过一个字节长。


从iOS12开始,我们还可以考虑使用NLTokenizer(https://developer.apple.com/documentation/naturallanguage/tokenizing_natural_language_text)进行自然语言处理。 - qix

0
我曾经遇到这样的情况,仅仅按组件分离字符串是不够的,还需要完成许多任务,例如
1)对令牌进行分类
2)添加新的令牌
3)在自定义闭包之间分离字符串,例如在“{”和“}”之间的所有单词。
对于此类要求,我发现Parse Kit非常有用。

我成功地使用它来解析.PGN(便携式游戏注释)文件,速度非常快且轻巧。


0

我曾经遇到这样一个情况,需要在LDAP查询(ldapsearch)的控制台输出中进行拆分。首先设置并执行NSTask(我在这里找到了一个很好的代码示例:从Cocoa应用程序执行终端命令)。但是,我必须拆分和解析输出,以便仅提取Ldap查询输出中的打印服务器名称。不幸的是,这是相当繁琐的字符串操作,如果我们使用简单的C数组操作来操作C-strings/arrays,则不会有任何问题。因此,这是我的使用Cocoa对象的代码。如果您有更好的建议,请告诉我。

//as the ldap query has to be done when the user selects one of our Active Directory Domains
//(an according comboBox should be populated with print-server names we discover from AD)
//my code is placed in the onSelectDomain event code

//the following variables are declared in the interface .h file as globals
@protected NSArray* aDomains;//domain combo list array
@protected NSMutableArray* aPrinters;//printer combo list array
@protected NSMutableArray* aPrintServers;//print server combo list array

@protected NSString* sLdapQueryCommand;//for LDAP Queries
@protected NSArray* aLdapQueryArgs;
@protected NSTask* tskLdapTask;
@protected NSPipe* pipeLdapTask;
@protected NSFileHandle* fhLdapTask;
@protected NSMutableData* mdLdapTask;

IBOutlet NSComboBox* comboDomain;
IBOutlet NSComboBox* comboPrinter;
IBOutlet NSComboBox* comboPrintServer;
//end of interface globals

//after collecting the print-server names they are displayed in an according drop-down comboBox
//as soon as the user selects one of the print-servers, we should start a new query to find all the
//print-queues on that server and display them in the comboPrinter drop-down list
//to find the shares/print queues of a windows print-server you need samba and the net -S command like this:
// net -S yourPrintServerName.yourBaseDomain.com -U yourLdapUser%yourLdapUserPassWord -W adm rpc share -l
//which dispalays a long list of the shares

- (IBAction)onSelectDomain:(id)sender
{
    static int indexOfLastItem = 0; //unfortunately we need to compare this because we are called also if the selection did not change!

    if ([comboDomain indexOfSelectedItem] != indexOfLastItem && ([comboDomain indexOfSelectedItem] != 0))
    {

        indexOfLastItem = [comboDomain indexOfSelectedItem]; //retain this index for next call

    //the print-servers-list has to be loaded on a per univeristy or domain basis from a file dynamically or from AN LDAP-QUERY

    //initialize an LDAP-Query-Task or console-command like this one with console output
    /*

     ldapsearch -LLL -s sub -D "cn=yourLdapUser,ou=yourOuWithLdapUserAccount,dc=yourDomain,dc=com" -h "yourLdapServer.com" -p 3268 -w "yourLdapUserPassWord" -b "dc=yourBaseDomainToSearchIn,dc=com" "(&(objectcategory=computer)(cn=ps*))" "dn"

//our print-server names start with ps* and we want the dn as result, wich comes like this:

     dn: CN=PSyourPrintServerName,CN=Computers,DC=yourBaseDomainToSearchIn,DC=com

     */

    sLdapQueryCommand = [[NSString alloc] initWithString: @"/usr/bin/ldapsearch"];


    if ([[comboDomain stringValue] compare: @"firstDomain"] == NSOrderedSame) {

      aLdapQueryArgs = [NSArray arrayWithObjects: @"-LLL",@"-s", @"sub",@"-D", @"cn=yourLdapUser,ou=yourOuWithLdapUserAccount,dc=yourDomain,dc=com",@"-h", @"yourLdapServer.com",@"-p",@"3268",@"-w",@"yourLdapUserPassWord",@"-b",@"dc=yourFirstDomainToSearchIn,dc=com",@"(&(objectcategory=computer)(cn=ps*))",@"dn",nil];
    }
    else {
      aLdapQueryArgs = [NSArray arrayWithObjects: @"-LLL",@"-s", @"sub",@"-D", @"cn=yourLdapUser,ou=yourOuWithLdapUserAccount,dc=yourDomain,dc=com",@"-h", @"yourLdapServer.com",@"-p",@"3268",@"-w",@"yourLdapUserPassWord",@"-b",@"dc=yourSecondDomainToSearchIn,dc=com",@"(&(objectcategory=computer)(cn=ps*))",@"dn",nil];

    }


    //prepare and execute ldap-query task

    tskLdapTask = [[NSTask alloc] init];
    pipeLdapTask = [[NSPipe alloc] init];//instead of [NSPipe pipe]
    [tskLdapTask setStandardOutput: pipeLdapTask];//hope to get the tasks output in this file/pipe

    //The magic line that keeps your log where it belongs, has to do with NSLog (see https://dev59.com/kHRC5IYBdhLWcg3wFdFx and here http://www.cocoadev.com/index.pl?NSTask )
    [tskLdapTask setStandardInput:[NSPipe pipe]];

    //fhLdapTask  = [[NSFileHandle alloc] init];//would be redundand here, next line seems to do the trick also
    fhLdapTask = [pipeLdapTask fileHandleForReading];
    mdLdapTask  = [NSMutableData dataWithCapacity:512];//prepare capturing the pipe buffer which is flushed on read and can overflow, start with 512 Bytes but it is mutable, so grows dynamically later
    [tskLdapTask setLaunchPath: sLdapQueryCommand];
    [tskLdapTask setArguments: aLdapQueryArgs];

#ifdef bDoDebug
    NSLog (@"sLdapQueryCommand: %@\n", sLdapQueryCommand);
    NSLog (@"aLdapQueryArgs: %@\n", aLdapQueryArgs );
    NSLog (@"tskLdapTask: %@\n", [tskLdapTask arguments]);
#endif

    [tskLdapTask launch];

    while ([tskLdapTask isRunning]) {
      [mdLdapTask appendData: [fhLdapTask readDataToEndOfFile]];
    }
    [tskLdapTask waitUntilExit];//might be redundant here.

    [mdLdapTask appendData: [fhLdapTask readDataToEndOfFile]];//add another read for safety after process/command stops

    NSString* sLdapOutput = [[NSString alloc] initWithData: mdLdapTask encoding: NSUTF8StringEncoding];//convert output to something readable, as NSData and NSMutableData are mere byte buffers

#ifdef bDoDebug
    NSLog(@"LdapQueryOutput: %@\n", sLdapOutput);
#endif

    //Ok now we have the printservers from Active Directory, lets parse the output and show the list to the user in its combo box
    //output is formatted as this, one printserver per line
    //dn: CN=PSyourPrintServer,OU=Computers,DC=yourBaseDomainToSearchIn,DC=com

    //so we have to search for "dn: CN=" to retrieve each printserver's name
    //unfortunately splitting this up will give us a first line containing only "" empty string, which we can replace with the word "choose"
    //appearing as first entry in the comboBox

    aPrintServers = (NSMutableArray*)[sLdapOutput componentsSeparatedByString:@"dn: CN="];//split output into single lines and store it in the NSMutableArray aPrintServers

#ifdef bDoDebug
    NSLog(@"aPrintServers: %@\n", aPrintServers);
#endif

    if ([[aPrintServers objectAtIndex: 0 ] compare: @"" options: NSLiteralSearch] == NSOrderedSame){
      [aPrintServers replaceObjectAtIndex: 0 withObject: slChoose];//replace with localized string "choose"

#ifdef bDoDebug
      NSLog(@"aPrintServers: %@\n", aPrintServers);
#endif

    }

//Now comes the tedious part to extract only the print-server-names from the single lines
    NSRange r;
    NSString* sTemp;

    for (int i = 1; i < [aPrintServers count]; i++) {//skip first line with "choose". To get rid of the rest of the line, we must isolate/preserve the print server's name to the delimiting comma and remove all the remaining characters
      sTemp = [aPrintServers objectAtIndex: i];
      sTemp = [sTemp stringByTrimmingCharactersInSet: [NSCharacterSet whitespaceAndNewlineCharacterSet]];//remove newlines and line feeds

#ifdef bDoDebug
      NSLog(@"sTemp: %@\n", sTemp);
#endif
      r = [sTemp rangeOfString: @","];//now find first comma to remove the whole rest of the line
      //r.length = [sTemp lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
      r.length = [sTemp length] - r.location;//calculate number of chars between first comma found and lenght of string
#ifdef bDoDebug
      NSLog(@"range: %i, %i\n", r.location, r.length);
#endif

      sTemp = [sTemp stringByReplacingCharactersInRange:r withString: @"" ];//remove rest of line
#ifdef bDoDebug
      NSLog(@"sTemp after replace: %@\n", sTemp);
#endif

      [aPrintServers replaceObjectAtIndex: i withObject: sTemp];//put back string into array for display in comboBox

#ifdef bDoDebug
      NSLog(@"aPrintServer: %@\n", [aPrintServers objectAtIndex: i]);
#endif

    }

    [comboPrintServer removeAllItems];//reset combo box
    [comboPrintServer addItemsWithObjectValues:aPrintServers];
    [comboPrintServer setNumberOfVisibleItems:aPrintServers.count];
    [comboPrintServer selectItemAtIndex:0];

#ifdef bDoDebug
    NSLog(@"comboPrintServer reloaded with new values.");
#endif


//release memory we used for LdapTask
    [sLdapQueryCommand release];
    [aLdapQueryArgs release];
    [sLdapOutput release];

    [fhLdapTask release];

    [pipeLdapTask release];
//    [tskLdapTask release];//strangely can not be explicitely released, might be autorelease anyway
//    [mdLdapTask release];//strangely can not be explicitely released, might be autorelease anyway

    [sTemp release];

    }
}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接