从输入文本中提取关键词的Java库

33

我正在寻找一个Java库,用于从一篇文本块中提取关键词。

流程应如下:

停用词清理 -> 词干提取 -> 基于英语语言学统计信息搜索关键词 - 即如果一个单词在文本中出现的概率大于在英语语言中出现的概率,则它是关键词候选。

是否有一个执行此任务的库?

3个回答

43

以下是使用Apache Lucene的可能解决方案。我使用的不是最新版本,而是3.6.2版本,因为这是我最熟悉的版本。除了/lucene-core-x.x.x.jar之外,还需要将下载的归档文件中的/contrib/analyzers/common/lucene-analyzers-x.x.x.jar添加到您的项目中:它包含特定语言的分析器(您的情况下尤其是英语分析器)。

请注意,这仅会基于相应的词干查找输入文本单词的频率。之后需将这些频率与英语语言统计数据进行比较(顺便说一下,这个答案也许可以帮助您)。


数据模型

一个关键词对应一个词干。不同的单词可能具有相同的词干,因此使用terms集合。每次发现新术语时都会增加关键词频率(即使已经发现 - 集合会自动删除重复项)。

public class Keyword implements Comparable<Keyword> {

  private final String stem;
  private final Set<String> terms = new HashSet<String>();
  private int frequency = 0;

  public Keyword(String stem) {
    this.stem = stem;
  }

  public void add(String term) {
    terms.add(term);
    frequency++;
  }

  @Override
  public int compareTo(Keyword o) {
    // descending order
    return Integer.valueOf(o.frequency).compareTo(frequency);
  }

  @Override
  public boolean equals(Object obj) {
    if (this == obj) {
      return true;
    } else if (!(obj instanceof Keyword)) {
      return false;
    } else {
      return stem.equals(((Keyword) obj).stem);
    }
  }

  @Override
  public int hashCode() {
    return Arrays.hashCode(new Object[] { stem });
  }

  public String getStem() {
    return stem;
  }

  public Set<String> getTerms() {
    return terms;
  }

  public int getFrequency() {
    return frequency;
  }

}

实用工具

对一个词进行词干处理:

public static String stem(String term) throws IOException {

  TokenStream tokenStream = null;
  try {

    // tokenize
    tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term));
    // stem
    tokenStream = new PorterStemFilter(tokenStream);

    // add each token in a set, so that duplicates are removed
    Set<String> stems = new HashSet<String>();
    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      stems.add(token.toString());
    }

    // if no stem or 2+ stems have been found, return null
    if (stems.size() != 1) {
      return null;
    }
    String stem = stems.iterator().next();
    // if the stem has non-alphanumerical chars, return null
    if (!stem.matches("[a-zA-Z0-9-]+")) {
      return null;
    }

    return stem;

  } finally {
    if (tokenStream != null) {
      tokenStream.close();
    }
  }

}

搜索一个集合(将被潜在关键字列表使用):

public static <T> T find(Collection<T> collection, T example) {
  for (T element : collection) {
    if (element.equals(example)) {
      return element;
    }
  }
  collection.add(example);
  return example;
}

核心

这是主要的输入方法:

public static List<Keyword> guessFromString(String input) throws IOException {

  TokenStream tokenStream = null;
  try {

    // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
    input = input.replaceAll("-+", "-0");
    // replace any punctuation char but apostrophes and dashes by a space
    input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");
    // replace most common english contractions
    input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");

    // tokenize input
    tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input));
    // to lowercase
    tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream);
    // remove dots from acronyms (and "'s" but already done manually above)
    tokenStream = new ClassicFilter(tokenStream);
    // convert any char to ASCII
    tokenStream = new ASCIIFoldingFilter(tokenStream);
    // remove english stop words
    tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet());

    List<Keyword> keywords = new LinkedList<Keyword>();
    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      String term = token.toString();
      // stem each term
      String stem = stem(term);
      if (stem != null) {
        // create the keyword or get the existing one if any
        Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
        // add its corresponding initial token
        keyword.add(term.replaceAll("-0", "-"));
      }
    }

    // reverse sort by frequency
    Collections.sort(keywords);

    return keywords;

  } finally {
    if (tokenStream != null) {
      tokenStream.close();
    }
  }

}

例子

Java维基百科文章的介绍部分上使用guessFromString方法,这里是找到的前10个最常见的关键词(即词干):

java         x12    [java]
compil       x5     [compiled, compiler, compilers]
sun          x5     [sun]
develop      x4     [developed, developers]
languag      x3     [languages, language]
implement    x3     [implementation, implementations]
applic       x3     [application, applications]
run          x3     [run]
origin       x3     [originally, original]
gnu          x3     [gnu]

遍历输出列表以了解每个词干的原始找到的单词,通过获取术语集(在上面的示例中用括号[...]显示)。


下一步

将词干频率/频率总和比率与英语语言统计数据进行比较,并让我知道你是否成功:我也可能会很感兴趣 :)


我不熟悉Lucene,但我看到代码严重依赖它,所以我假设是这种情况。有没有想法在哪里可以找到这样的英语词干字典? - Shay
@Shay,你可以在这里找到一个单词列表(http://www.wordfrequency.info/),但是你需要付费才能获取完整的列表。我还发现了这个网站(http://ucrel.lancs.ac.uk/bncfreq/flists.html),它似乎非常有趣,但我不能保证它具有足够相关的数据。 - sp00m
似乎版本3.x.x已不再提供。我不得不下载并尝试4.4.0版本。 我不知道问题出在哪里,但当我尝试执行代码时,会出现空指针异常。 - Shay
@Shay,你有没有点击我给的链接(link)?它似乎还是可以使用的。 - sp00m
无法使用Lucene 6.1.0解析ClassicTokenizer类,请求帮助。 - Utsav Gupta
显示剩余5条评论

6

以下是更新且可用的代码版本,与Apache Lucene 5.x…6.x兼容。

CardKeyword类:

import java.util.HashSet;
import java.util.Set;

/**
 * Keyword card with stem form, terms dictionary and frequency rank
 */
class CardKeyword implements Comparable<CardKeyword> {

    /**
     * Stem form of the keyword
     */
    private final String stem;

    /**
     * Terms dictionary
     */
    private final Set<String> terms = new HashSet<>();

    /**
     * Frequency rank
     */
    private int frequency;

    /**
     * Build keyword card with stem form
     *
     * @param stem
     */
    public CardKeyword(String stem) {
        this.stem = stem;
    }

    /**
     * Add term to the dictionary and update its frequency rank
     *
     * @param term
     */
    public void add(String term) {
        this.terms.add(term);
        this.frequency++;
    }

    /**
     * Compare two keywords by frequency rank
     *
     * @param keyword
     * @return int, which contains comparison results
     */
    @Override
    public int compareTo(CardKeyword keyword) {
        return Integer.valueOf(keyword.frequency).compareTo(this.frequency);
    }

    /**
     * Get stem's hashcode
     *
     * @return int, which contains stem's hashcode
     */
    @Override
    public int hashCode() {
        return this.getStem().hashCode();
    }

    /**
     * Check if two stems are equal
     *
     * @param o
     * @return boolean, true if two stems are equal
     */
    @Override
    public boolean equals(Object o) {

        if (this == o) return true;

        if (!(o instanceof CardKeyword)) return false;

        CardKeyword that = (CardKeyword) o;

        return this.getStem().equals(that.getStem());
    }

    /**
     * Get stem form of keyword
     *
     * @return String, which contains getStemForm form
     */
    public String getStem() {
        return this.stem;
    }

    /**
     * Get terms dictionary of the stem
     *
     * @return Set<String>, which contains set of terms of the getStemForm
     */
    public Set<String> getTerms() {
        return this.terms;
    }

    /**
     * Get stem frequency rank
     *
     * @return int, which contains getStemForm frequency
     */
    public int getFrequency() {
        return this.frequency;
    }
}

KeywordsExtractor class:

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;
import org.apache.lucene.analysis.standard.ClassicFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import java.io.IOException;
import java.io.StringReader;
import java.util.*;

/**
 * Keywords extractor functionality handler
 */
class KeywordsExtractor {

    /**
     * Get list of keywords with stem form, frequency rank, and terms dictionary
     *
     * @param fullText
     * @return List<CardKeyword>, which contains keywords cards
     * @throws IOException
     */
    static List<CardKeyword> getKeywordsList(String fullText) throws IOException {

        TokenStream tokenStream = null;

        try {
            // treat the dashed words, don't let separate them during the processing
            fullText = fullText.replaceAll("-+", "-0");

            // replace any punctuation char but apostrophes and dashes with a space
            fullText = fullText.replaceAll("[\\p{Punct}&&[^'-]]+", " ");

            // replace most common English contractions
            fullText = fullText.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");

            StandardTokenizer stdToken = new StandardTokenizer();
            stdToken.setReader(new StringReader(fullText));

            tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
            tokenStream.reset();

            List<CardKeyword> cardKeywords = new LinkedList<>();

            CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

            while (tokenStream.incrementToken()) {

                String term = token.toString();
                String stem = getStemForm(term);

                if (stem != null) {
                    CardKeyword cardKeyword = find(cardKeywords, new CardKeyword(stem.replaceAll("-0", "-")));
                    // treat the dashed words back, let look them pretty
                    cardKeyword.add(term.replaceAll("-0", "-"));
                }
            }

            // reverse sort by frequency
            Collections.sort(cardKeywords);

            return cardKeywords;
        } finally {
            if (tokenStream != null) {
                try {
                    tokenStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

    /**
     * Get stem form of the term
     *
     * @param term
     * @return String, which contains the stemmed form of the term
     * @throws IOException
     */
    private static String getStemForm(String term) throws IOException {

        TokenStream tokenStream = null;

        try {
            StandardTokenizer stdToken = new StandardTokenizer();
            stdToken.setReader(new StringReader(term));

            tokenStream = new PorterStemFilter(stdToken);
            tokenStream.reset();

            // eliminate duplicate tokens by adding them to a set
            Set<String> stems = new HashSet<>();

            CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

            while (tokenStream.incrementToken()) {
                stems.add(token.toString());
            }

            // if stem form was not found or more than 2 stems have been found, return null
            if (stems.size() != 1) {
                return null;
            }

            String stem = stems.iterator().next();

            // if the stem form has non-alphanumerical chars, return null
            if (!stem.matches("[a-zA-Z0-9-]+")) {
                return null;
            }

            return stem;
        } finally {
            if (tokenStream != null) {
                try {
                    tokenStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

    /**
     * Find sample in collection
     *
     * @param collection
     * @param sample
     * @param <T>
     * @return <T> T, which contains the found object within collection if exists, otherwise the initially searched object
     */
    private static <T> T find(Collection<T> collection, T sample) {

        for (T element : collection) {
            if (element.equals(sample)) {
                return element;
            }
        }

        collection.add(sample);

        return sample;
    }
}

函数调用:
String text = "…";
List<CardKeyword> keywordsList = KeywordsExtractor.getKeywordsList(text);

我尝试了这段代码,但在Lucene 6.x下无法正常运行。我不得不在标记流上添加一些reset()调用。此外,它似乎不能正确处理带破折号的单词...我注意到像“industry-recognized”这样的术语被替换为“industry-0recognized”,以防止分词器分解该单词,但我仍然得到了一个“0recognized”标记,所以这个hack似乎没有起作用。 - J.D. Corbin
我通过使用 WhitespaceTokenizer 成功解决了 StandardTokenizer 的问题。它似乎可以很好地处理带有破折号的单词,而无需使用hack技巧。 - J.D. Corbin
我很好奇这个程序使用了哪些jar包和版本。我尝试过lucene-core 4.9.0、5.5.5、6.5.5,但每种情况下都出现了许多编译错误。 - Michael Easter
@MichaelEaster,“我需要除了lucene-core之外的另一个jar包”,这很奇怪,据我记得,我使用的是标准的Lucene jar包。 - Mike
@MikeB 我已经使用Gradle构建了这个示例,链接在这里 - https://github.com/codetojoy/sandbox_lucene/tree/master/StackOverflow_17447045_original_post ... 在接下来的几周中,我可能会尝试使用最新版本的Lucene和其他一些想法(通过存储库中的其他示例)。 - Michael Easter
显示剩余2条评论

1
基于RAKE算法和opennlp模型封装在rapidrake-java库中的相对简单的方法。
import java.io.IOException;
import java.io.InputStream;

import org.apache.commons.io.IOUtils;

import io.github.crew102.rapidrake.RakeAlgorithm;
import io.github.crew102.rapidrake.model.RakeParams;
import io.github.crew102.rapidrake.model.Result;

public class KeywordExtractor {

    private static String delims = "[-,.?():;\"!/]";
    private static String posUrl = "model-bin/en-pos-maxent.bin";
    private static String sentUrl = "model-bin/en-sent.bin";

    public static void main(String[] args) throws IOException {
        InputStream stopWordsStream = KeywordExtractor.class.getResourceAsStream("/stopword-list.txt");
        String[] stopWords = IOUtils.readLines(stopWordsStream, "UTF-8").toArray(new String[0]);
        String[] stopPOS = {"VBD"};
        RakeParams params = new RakeParams(stopWords, stopPOS, 0, true, delims);
        RakeAlgorithm rakeAlg = new RakeAlgorithm(params, posUrl, sentUrl);
        Result aRes = rakeAlg.rake("I'm looking for a Java library to extract keywords from a block of text.");
        System.out.println(aRes);
        // OUTPUT:
        // [looking (1), java library (4), extract keywords (4), block (1), text (1)]
    }
}

从示例输出中可以看到,您将获得一个关键词地图及其相对权重。

https://github.com/crew102/rapidrake-java所述,您需要从opennlp下载页面下载en-pos-maxent.binen-sent.bin文件,并将它们放入项目根目录下的model-bin文件夹中(如果使用maven项目结构,则必须是src文件夹的同级目录)。停用词文件应该放在src/main/resources/stopword-list.txt下(假设使用maven结构),例如可以从https://github.com/terrier-org/terrier-desktop/blob/master/share/stopword-list.txt下载。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接