使用Hpple解析器和NSXMLParser迭代解析内部HTML

4
我一直在为iPad平台的学校报纸应用程序工作。我正在使用NSXMLParser获取每篇文章的标题、简短描述和链接。为了从每个已解析链接中获取HTML项目,我决定使用Hpple解析器。我认为我正在正确解析和存储RSS项,但当我尝试使用for循环解析每个已解析链接中的HTML项时,它告诉我我有一个RSS项为空的数组。然而,我可以在控制台上显示RSS项目持有者的内容。因此,它不是空的。我将放一些我的代码部分和来自控制台的显示。请帮助我。这个项目的截止日期很快就到了。提前致谢。
以下是我如何开始加载我的RSS解析器(articleParser):
- (void)loadData {
    [self loadInitData];

    //[self loadDataWithLink];

}

- (void)loadInitData {
    if (sections == nil) {
        [activityIndicator startAnimating];

        NSLog(@"STARTING ARTICLE PARSER FROM MAIN URL!!!");

        Parser *articleParser = [[Parser alloc] init];
        [articleParser parseRssFeed:@"http://theaggie.org/rss/headlines.xml" withDelegate:self];
        [articleParser release];
    } else {

    }

}

以下是我如何将接收到的文章项存储在名为“sections”的NSMutable数组中。然后我使用for循环来遍历解析文章的每个链接。

- (void)receivedArticleItems:(Article *)theArticle {
    if (sections == nil) {
        sections = [[NSMutableArray alloc] init];
    }
    [sections addObject:theArticle];

    NSLog(@"We recieved the article!");
    NSLog(@"Article: %@", theArticle);
    NSLog(@"What is in sections: %@", sections);

for (int i = 1; i < 5; i++) {
        NSLog(@"articleItems: %@",[sections objectAtIndex:0]);
        NSLog(@"articleItems at index 0: %@",[[[sections objectAtIndex:0] articleItems] objectAtIndex:0]);

        [self loadDataWithLink:[[[[sections objectAtIndex:0] articleItems] objectAtIndex:0] objectForKey:@"link"]];
    }
    [activityIndicator stopAnimating];
}

以下是我使用TFFHpple解析器从每个解析链接中获取HTML项的方法:
- (void)loadDataWithLink:(NSString *)urlString{

 NSData *htmlData = [NSData dataWithContentsOfURL:[NSURL URLWithString:urlString]];

 // Create parser
 TFHpple *xpathParser = [[TFHpple alloc] initWithHTMLData:htmlData];

 //Get all the cells main body
 htmlElements  = [xpathParser search:@"//div[@id='main']/div[@id='mainCol1']/div[@id='main-body']"];

 // Access the first cell
 TFHppleElement *htmlElement = [htmlElements objectAtIndex:0];

 // NSString *title = [htmlElement content];

 NSLog(@"What is in element: %@", htmlElement);

 [xpathParser release];
 //[htmlData release];
}

这是我在控制台上得到的内容:

2011-05-02 22:58:35.355 TheCalAggie[2443:207] Parsing started for article!
2011-05-02 22:58:35.356 TheCalAggie[2443:207] Adding story title: Students say, 'No time for books'
2011-05-02 22:58:35.356 TheCalAggie[2443:207] From the link: http://theaggie.org/article/2011/05/03/students-say-no-time-for-books
2011-05-02 22:58:35.357 TheCalAggie[2443:207] Summary: The last book managerial economics major Kiyan Parsa read for fun was The Lord of the Rings. That was in high school.
2011-05-02 22:58:35.358 TheCalAggie[2443:207] Published on: Tue, 03 May 2011 00:00:00 -0700
2011-05-02 22:58:35.359 TheCalAggie[2443:207] Parsing started for article!
2011-05-02 22:58:35.360 TheCalAggie[2443:207] Adding story title: UC Davis craft center one of largest college crafting centers
2011-05-02 22:58:35.360 TheCalAggie[2443:207] From the link: http://theaggie.org/article/2011/05/02/uc-davis-craft-center-one-of-largest-college-crafting-centers
2011-05-02 22:58:35.361 TheCalAggie[2443:207] Summary: Hidden away in the South Silo, the UC Davis Craft Center offers 10 craft studios and more than a hundred classes for students looking to learn or perfect their crafting skills.
2011-05-02 22:58:35.362 TheCalAggie[2443:207] Published on: Mon, 02 May 2011 00:00:00 -0700
2011-05-02 22:58:35.362 TheCalAggie[2443:207] We recieved the article!
2011-05-02 22:58:35.363 TheCalAggie[2443:207] Article: *nil description*
2011-05-02 22:58:35.364 TheCalAggie[2443:207] What is in sections: (
    (null)
)
2011-05-02 22:58:35.374 TheCalAggie[2443:207] articleItems: *nil description*
2011-05-02 22:58:35.375 TheCalAggie[2443:207] articleItems at index 0: {
    link = "http://theaggie.org/article/2011/05/03/peaceful-rally-held-on-campus-after-killing-of-bin-laden\n";
    pubDate = "Tue, 03 May 2011 00:00:00 -0700";
    summary = "The announcement of Osama bin Laden's death sent a wave of patriotism across the nation and UC Davis. Bin Laden was the leader of al-Qaeda - the organization allegedly behind the Sept. 11, 2001 attacks that killed over 3,000 Americans.\n";
    title = "Peaceful rally held on campus after killing of bin Laden \n";
}
2011-05-02 22:59:35.376 TheCalAggie[2443:207] Unable to parse.
2011-05-02 22:59:35.379 TheCalAggie[2443:207] *** Terminating app due to uncaught exception 'NSRangeException', reason: '*** -[NSMutableArray objectAtIndex:]: index 0 beyond bounds for empty array'
*** Call stack at first throw:

任何帮助都将不胜感激。再次感谢。
2个回答

3

2011-05-02 22:59:35.376 TheCalAggie[2443:207] 解析失败。

解析器在尝试解析HTML时遇到了困难。该解析器在解析HTML方面并不完美。对于一个解析器来说,在可能存在错误/无效的HTML文档上运行XPath是一件复杂的事情。

将您要解析的链接通过W3C验证器这里进行验证会出现一些错误,因此它并不完全是有效的HTML。如果它太破碎而无法使用该解析器解析,则需要调试并找出原因。为了真正深入了解此问题,您需要在您正在使用的TFHpple解析器中设置断点以获取更多信息。


非常感谢你,Damien!我深入研究了HTML源代码,最终成功解析了我需要的内容。现在,我在MVC方面遇到了其他问题。我发布了另一个问题。你能帮我解决吗?这是那个问题的链接:http://stackoverflow.com/questions/6132894/passing-uitextview-values-to-modalviewcontroller-from-parent-view-controller - SerPiero

0

Damien 是对的。首先你需要修复 HTML 才能让你的代码起作用。它解析的数据每次都不同。这证明了 HTML 有缺陷。所以这段代码可能只在某些情况下才能正常工作。尝试运行几次,你会看到它偶尔能够正常工作。


是的,我终于能够解析我需要的内容了。虽然我不得不深入HTML源代码中查找。谢谢你的回复。如果你能帮我回答另一个问题,我会很感激。这是链接:http://stackoverflow.com/questions/6132894/passing-uitextview-values-to-modalviewcontroller-from-parent-view-controller - SerPiero

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接