安卓RSS订阅解析

Question

安卓RSS订阅解析

5

我是一个新手，正在学习Android开发。我的应用程序需要解析数据并在屏幕上显示，但是在一个特定的标签中，我无法解析数据，因为该标签中包含一些特殊字符。以下是我的代码：

我的解析函数：

  protected ArrayList<String> doInBackground(Context... params) 
    {
//      context = params[0];
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();     
        test = new ArrayList<String>();
        try {
            DocumentBuilder builder = factory.newDocumentBuilder();
            Document document = builder.parse(new java.net.URL("input URL_confidential").openConnection().getInputStream());
            //Document document = builder.parse(new URL("http://www.gamestar.de/rss/gamestar.rss").openConnection().getInputStream());
            Element root = document.getDocumentElement();
            NodeList docItems = root.getElementsByTagName("item");
            Node nodeItem;
            for(int i = 0;i<docItems.getLength();i++)
            {
                nodeItem = docItems.item(i);
                if(nodeItem.getNodeType() == Node.ELEMENT_NODE)
                {
                    NodeList element = nodeItem.getChildNodes();                    
                    Element entry = (Element) docItems.item(i);
                    name=(element.item(0).getFirstChild().getNodeValue());




//                 System.out.println("description = "+element.item(2).getFirstChild().getNodeValue().replaceAll("&lt;div&gt;&lt;p&gt;"," "));
                    System.out.println("Description"+Jsoup.clean(org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(element.item(2).getFirstChild().getNodeValue()), new Whitelist()));             


                    items.add(name);


                }
            }
        } 
        catch (ParserConfigurationException e) 
        {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        catch (MalformedURLException e)
        {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        catch (SAXException e)
        {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        catch (IOException e)
        {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        return items;
    }

输入：

<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>my application</title>
<link>http:// some link</link>
<atom:link href="http:// XXXXXXXX" rel="self"></atom:link>
<language>en-us</language>
<lastBuildDate>Thu, 20 Dec 2012</lastBuildDate>
<item>
<title>lllegal settlements</title>
<link>http://XXXXXXXXXXXXXXXX</link>
<description> &lt;div&gt;&lt;p&gt;
India was joined by all members of the 15-nation UN Security Council except the US to condemn Israelâ€™s announcement of new construction activity in Palestinian territories and demand immediate dismantling of the â€œillegalâ€ settlements.
&lt;/p&gt;
&lt;p&gt;
UN Secretary General Ban Ki-moon also expressed his deep concern by the heightened settlement activity in West Bank, saying the move by Israel â€œgravely threatens efforts to establish a viable Palestinian state.â€
&lt;/p&gt;
&lt;p&gt;
</description>
</item>
</channel>

输出：

 lllegal settlements  ----> title tag text

     India was joined by all members of the 15-nation UN Security Council except the US to condemn Israel announcement of new construction activity in Palestinian territories and demand immediate dismantling of the illegal settlements. -----> description tag text

     UN Secretary General Ban Ki-moon also expressed his deep concern by the heightened settlement activity in West Bank, saying the move by Israel gravely threatens efforts to establish a viable Palestinian state.    ----> description tag text.

- neha88

独立调查9月11日美国驻班加西领事馆遇袭事件发现，国务院的系统性失误导致该使命的安全措施“极其”不足，造成了美国驻利比亚大使和其他三名美国人的死亡。 - neha88

3个回答

0

运行节点值两到三次，使用Html.fromHTML()方法，就可以解决问题。

说明：内置的Html.fromHTML()方法可以将混乱和破碎的HTML转换为可用的内容。伪代码如下：

sHTML = node.getNodeValue()
sHTML = Html.fromHTML(sHTML)
sHTML = Html.fromHTML(sHTML)
sHTML = Html.fromHTML(sHTML)

第三或第四次，不可读的内容将再次变得可读。您可以在TextView中显示它，也可以使用WebView加载数据。

- The Somberi

你的名字很好听。 - The Somberi

还有一个漂亮的脸庞，呵呵 ;) - Phantômaxx

0

仅仅替换有问题的字符是否可行？

string = string.replaceAll("&lt;", "");
string = string.replaceAll("div&gt;", "");
string = string.replaceAll("p&gt;", "");

- Aelexe

谢谢Aelexe。即使我无法获取数据...我尝试了上面的代码，但它没有显示任何内容。我有提取数据的问题。一旦我提取了数据，我就可以使用replaceall()方法。 - neha88

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Raffaele · Accepted Answer

您的文本节点包含了转义的HTML实体（>代表>，即大于号）和垃圾字符（“grossly”）。您应该先根据输入源调整编码，然后可以使用Apache Commons Lang StringUtils.escapeHtml4(String)来取消转义HTML。

这个方法会（希望能够）返回一个XML，您可以使用XPath等方式提取所需的文本节点，或者将整个字符串传递给JSOUP或Android的Html类。

// JSOUP, "html" is the unescaped string. Returns a string
Jsoup.parse(html).text();

// Android
android.text.Html.fromHtml(instruction).toString()

测试程序（需要JSOUP和Commons-Lang）

package stackoverflow;

import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;

public class EmbeddedHTML {

    public static void main(String[] args) {
        String src = "<description> &lt;div&gt;&lt;p&gt; An independent" +
                " inquiry into the September 11 attack on the US Consulate" +
                " in Benghazi that killed the US ambassador to Libya and" +
                " three other Americans has found that systematic failures" +
                " at the State Department led to â€œgrosslyâ€ inadequate" +
                " security at the mission. &lt;/p&gt;</description>";
        String unescaped = StringEscapeUtils.unescapeHtml4(src);
        System.out.println(Jsoup.clean(unescaped, new Whitelist()));
    }

}