Scala和HTML解析

22

如何将HTML DOM文档加载到Scala中?在尝试加载xmlns标签时,XML singleton出现错误。

import java.net._
import java.io._
import scala.xml._

object NetParse {

   import java.net.{URLConnection, URL}
   import scala.xml._

   def netParse(sUrl: String): Elem = {
       var url = new URL(sUrl)
       var connect = url.openConnection

       XML.load(connect.getInputStream)
   }
}

最终我找到了解决方案!-需要scala 2.7.7或更高版本才能运行(2.7.0存在致命错误):如何在Scala XML中使用TagSoup

5个回答

16

2
查看该页面时,大多数代码示例都已丢失。这里有一个链接,可以访问仍然具有所有原始内容的版本:http://web.archive.org/web/20111121010724/http://www.hars.de/2009/01/html-as-xml-in-scala.html - Nick Knowlson
对于那些想要将此库轻松引入项目的人,请参见http://mvnrepository.com/artifact/org.ccil.cowan.tagsoup/tagsoup/1.2.1。 - icl7126

6
尝试使用scala.xml.parsing.XhtmlParser代替。

3
值得注意的是,此解决方案不适用于“标记混乱”的情况——只有格式正确的XHTML才能成功解析。因此,它基本上只添加了标准HTML实体并显然保留了CDATA块,与scala.xml.XML.load*相比。(在我的情况下,这就是我所需要的,所以没问题!) - Chris W.

5

我刚试着在scala 2.8.1中使用这个答案,最终采用了以下工作:

http://www.hars.de/2009/01/html-as-xml-in-scala.html

我需要的有趣部分是:

val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val parser = parserFactory.newSAXParser()
val source = new org.xml.sax.InputSource("http://www.scala-lang.org")
val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
adapter.loadXML(source, parser)

1
这对我也有效。但是,我想能够将原始HTML转换为输入源,因此可以在val source=行或adapter.loadXML部分中进行。我尝试过adapter.loadString("<html>..."),但它无法处理格式不正确的内容。有什么想法吗? - jbnunn

5

Scala Scraper

我推荐使用Scala Scraper,它让您可以像这样优雅地解析HTML:

// Parse elements from files, URLs or plain strings
val browser = JsoupBrowser()
val doc = browser.parseFile("core/src/test/resources/example.html")
val doc2 = browser.get("http://example.com")
val doc3 = browser.parseString("<html><h1>parse me</h1></html>")

// Extract the text inside the element with id "header"
doc >> text("#header")

// Extract the <span> elements inside #menu
val items = doc >> elementList("#menu span")

// From each item, extract all the text inside their <a> elements
items.map(_ >> allText("a"))

以下示例摘自Scala Scraper的自述文件


2
/* 
Copyright (c) 2008 Florian Hars, BIK Aschpurwis+Behrens GmbH, Hamburg 
Copyright (c) 2002-2008 EPFL, Lausanne, unless otherwise specified. 
All rights reserved. 

This software was developed by the Programming Methods Laboratory of the 
Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland. 

Permission to use, copy, modify, and distribute this software in source 
or binary form for any purpose with or without fee is hereby granted, 
provided that the following conditions are met: 

1. Redistributions of source code must retain the above copyright 
  notice, this list of conditions and the following disclaimer. 

2. Redistributions in binary form must reproduce the above copyright 
  notice, this list of conditions and the following disclaimer in the 
  documentation and/or other materials provided with the distribution. 

3. Neither the name of the EPFL nor the names of its contributors 
  may be used to endorse or promote products derived from this 
  software without specific prior written permission. 


 THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 
 ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 
 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 
 ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 
 FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 
 DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 
 SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 
 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 
 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 
 OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 
 SUCH DAMAGE. 
*/ 

package tagsoup 

import org.xml.sax.InputSource 
import javax.xml.parsers.SAXParser 
import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl 
import scala.xml.parsing.FactoryAdapter 
import scala.xml._ 

class TagSoupFactoryAdapter extends FactoryAdapter { 

  val parserFactory = new SAXFactoryImpl 
  parserFactory.setNamespaceAware(false) 

  val emptyElements = Set("area", "base", "br", "col", "hr", "img", 
                      "input", "link", "meta", "param") 

  /** Tests if an XML element contains text. 
   * @return true if element named <code>localName</code> contains text. 
   */ 
  def nodeContainsText(localName: String) = !(emptyElements contains localName) 

  /** creates a node. 
  */ 
  def createNode(pre:String, label: String, attrs: MetaData, 
             scpe: NamespaceBinding, children: List[Node] ): Elem = { 
    Elem( pre, label, attrs, scpe, children:_* ); 
  } 

  /** creates a text node 
  */ 
  def createText( text:String ) = 
    Text( text ); 

  /** Ignore Processing Instructions 
  */ 
  def createProcInstr(target: String, data: String) = Nil 

  /** load XML document 
   * @param source 
   * @return a new XML document object 
   */ 
  override def loadXML(source: InputSource) = { 
    val parser: SAXParser = parserFactory.newSAXParser() 

    scopeStack.push(TopScope) 
    parser.parse(source, this) 
    scopeStack.pop 
    rootElem 
  } 

}

如何在Scala XML中使用TagSoup


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接