Nutch:在Java中调用,而不是使用命令行?

8

我是不是太蠢了,还是真的没有办法通过一些Java代码来调用Apache Nutch?在哪里可以找到相关文档(或指南或教程)?谷歌已经让我失望了。所以我尝试了必应。(是的,我知道,很可悲。)有什么建议吗?提前致谢。

(另外,如果Nutch是一个烂摊子,还有其他用Java编写的爬虫被证明在互联网规模上是可靠的,并有实际文档吗?)


请告诉我这不是答案。https://dev59.com/t1LTa4cB1Zd3GeqPZlvk - ChrisJF
你可以在我的GitHub存储库中看到它的工作原理:https://github.com/yegor256/nutch-in-java。我遇到了同样的问题,经过几个小时的调查,成功创建了完全运行的Java代码片段。 - yegor256
1个回答

8

如果你查看bin/nutch脚本的内部,你会发现它调用了一个对应于你的命令的Java类:

# figure out which class to run
if [ "$COMMAND" = "crawl" ] ; then
  CLASS=org.apache.nutch.crawl.Crawl
elif [ "$COMMAND" = "inject" ] ; then
  CLASS=org.apache.nutch.crawl.Injector
elif [ "$COMMAND" = "generate" ] ; then
  CLASS=org.apache.nutch.crawl.Generator
elif [ "$COMMAND" = "freegen" ] ; then
  CLASS=org.apache.nutch.tools.FreeGenerator
elif [ "$COMMAND" = "fetch" ] ; then
  CLASS=org.apache.nutch.fetcher.Fetcher
elif [ "$COMMAND" = "fetch2" ] ; then
  CLASS=org.apache.nutch.fetcher.Fetcher2
elif [ "$COMMAND" = "parse" ] ; then
  CLASS=org.apache.nutch.parse.ParseSegment
elif [ "$COMMAND" = "readdb" ] ; then
  CLASS=org.apache.nutch.crawl.CrawlDbReader
elif [ "$COMMAND" = "convdb" ] ; then
  CLASS=org.apache.nutch.tools.compat.CrawlDbConverter
elif [ "$COMMAND" = "mergedb" ] ; then
  CLASS=org.apache.nutch.crawl.CrawlDbMerger
elif [ "$COMMAND" = "readlinkdb" ] ; then
  CLASS=org.apache.nutch.crawl.LinkDbReader
elif [ "$COMMAND" = "readseg" ] ; then
  CLASS=org.apache.nutch.segment.SegmentReader
elif [ "$COMMAND" = "segread" ] ; then
  echo "[DEPRECATED] Command 'segread' is deprecated, use 'readseg' instead."
  CLASS=org.apache.nutch.segment.SegmentReader
elif [ "$COMMAND" = "mergesegs" ] ; then
  CLASS=org.apache.nutch.segment.SegmentMerger
elif [ "$COMMAND" = "updatedb" ] ; then
  CLASS=org.apache.nutch.crawl.CrawlDb
elif [ "$COMMAND" = "invertlinks" ] ; then
  CLASS=org.apache.nutch.crawl.LinkDb
elif [ "$COMMAND" = "mergelinkdb" ] ; then
  CLASS=org.apache.nutch.crawl.LinkDbMerger
elif [ "$COMMAND" = "index" ] ; then
  CLASS=org.apache.nutch.indexer.Indexer
elif [ "$COMMAND" = "solrindex" ] ; then
  CLASS=org.apache.nutch.indexer.solr.SolrIndexer
elif [ "$COMMAND" = "dedup" ] ; then
  CLASS=org.apache.nutch.indexer.DeleteDuplicates
elif [ "$COMMAND" = "solrdedup" ] ; then
  CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
elif [ "$COMMAND" = "merge" ] ; then
  CLASS=org.apache.nutch.indexer.IndexMerger
elif [ "$COMMAND" = "plugin" ] ; then
  CLASS=org.apache.nutch.plugin.PluginRepository
elif [ "$COMMAND" = "server" ] ; then
  CLASS='org.apache.nutch.searcher.DistributedSearch$Server'
else
  CLASS=$COMMAND
fi

# run it
exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS -classpath "$CLASSPATH" $CLASS "$@"

从那里开始,只需要查看API文档,如果必要的话,查看这些类的源代码。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接