如何使用Tika从HTML中提取主要文本

Question

如何使用Tika从HTML中提取主要文本

html-parsingapache-tikaboilerpipe

5

我想知道如何使用Tika从HTML中提取主要文本和纯文本？

也许一种可能的解决方案是使用BoilerPipeContentHandler，但你有一些示例/演示代码来展示它吗？

非常感谢您的帮助。

- user2651995

2个回答

2

这里是一个示例：

public String[] tika_autoParser() {
    String[] result = new String[3];
    try {
        InputStream input = new FileInputStream(new File("/Users/nazanin/Books/Web crawler.pdf"));
        ContentHandler textHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        parser.parse(input, textHandler, metadata, context);
        result[0] = "Title: " + metadata.get(metadata.TITLE);
        result[1] = "Body: " + textHandler.toString();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

    return result;
}

- UserNeD

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- peatb · Accepted Answer

BodyContentHandler类不使用Boilerpipe代码，因此您需要明确使用BoilerPipeContentHandler。以下代码对我有效：

public String[] tika_autoParser() {
    String[] result = new String[3];
    try {
        InputStream input = new FileInputStream(new File("test.html"));
        ContentHandler textHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        parser.parse(input, new BoilerpipeContentHandler(textHandler), metadata, context);
        result[0] = "Title: " + metadata.get(metadata.TITLE);
        result[1] = "Body: " + textHandler.toString();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

    return result;
}