按频率排序单词?(从小到大)

4

有谁知道如何使用内置的 collection.sortcomparator<string> 接口将单词列表按照频率(从小到大)排序吗?

我已经有一个函数可以获取文本文件中某个单词的计数了。现在,我需要创建一个方法来比较每个单词的计数,然后将它们放入按最小频率到最大频率排序的列表中。

非常感谢您提供任何想法和提示。我在开始这个特定的方法上遇到了麻烦。

public class Parser implements Comparator<String> {

    public Map<String, Integer> wordCount;

    void parse(String filename) throws IOException {
        File file = new File(filename);
        Scanner scanner = new Scanner(file);

        //mapping of string -> integer (word -> frequency)
        Map<String, Integer> wordCount = new HashMap<String, Integer>();

        //iterates through each word in the text file
        while(scanner.hasNext()) {
            String word = scanner.next();
            if (scanner.next()==null) {
                wordCount.put(word, 1);
            }
            else {
                wordCount.put(word, wordCount.get(word) + 1);;
                }
            }
            scanner.next().replaceAll("[^A-Za-z0-9]"," ");
            scanner.next().toLowerCase();
        }

    public int getCount(String word) {
        return wordCount.get(word);
    }

    public int compare(String w1, String w2) {
        return getCount(w1) - getCount(w2);
    } 

        //this method should return a list of words in order of frequency from least to   greatest
    public List<String> getWordsInOrderOfFrequency() {
        List<Integer> wordsByCount = new ArrayList<Integer>(wordCount.values());
        //this part is unfinished.. the part i'm having trouble sorting the word frequencies
        List<String> result = new ArrayList<String>();


    }
}

4
创建一个包含一个字符串(用于单词)和一个整数(用于计数)的类,并使其实现Comparable<Yourclass>接口,然后在compareTo(...)方法中,按照整数值进行比较。 - Hovercraft Full Of Eels
嘿,我添加了我的代码。我不是在寻找可以复制的代码,而是想要改进和完成频率方法的建议。谢谢! - user1333781
使用 Collections.sort(justTheWords, this) - Torious
在你的注释//this part is unfinished处,只需执行:List<String> justWords = new ArrayList<String>(wordCount.keySet()); List<String> result = Collections.sort(justWords, this);... - Torious
1
这个想法是在排序期间,sort 会调用 this.compare() 来比较两个 String,然后通过首先查找 this(即 Parser 实例)中的计数来进行比较。这是假设先调用了 parse 的情况下。我看 compare 方法的意图就是这样。我有什么遗漏吗? - Torious
显示剩余4条评论
4个回答

7

首先,你对scanner.next()的使用似乎是不正确的。next()方法每次调用将返回下一个单词并移动到下一个单词,因此以下代码:

if(scanner.next() == null){ ... }

并且还有

scanner.next().replaceAll("[^A-Za-z0-9]"," ");
scanner.next().toLowerCase();

将会消耗并随后抛弃这些词语。你可能想要做的是:
String word = scanner.next().replaceAll("[^A-Za-z0-9]"," ").toLowerCase();

在你的while循环开始时,需要将对单词所做的更改保存在word变量中,而不是仅仅抛弃。

其次,wordCount映射的使用略有问题。你想要做的是检查word是否已经在映射中,以决定设置什么样的单词计数。为此,你应该查找映射中的内容,例如:

if(!wordCount.containsKey(word)){
  //no count registered for the word yet
  wordCount.put(word, 1);
}else{
  wordCount.put(word, wordCount.get(word) + 1);
}

您可以选择这样做:

或者,您可以采取以下方法:

Integer count = wordCount.get(word);
if(count == null){
  //no count registered for the word yet
  wordCount.put(word, 1);
}else{
  wordCount.put(word, count+1);
}

我更喜欢这种方法,因为它更加简洁,并且每个单词只需要进行一次映射查找,而第一种方法有时需要进行两次查找。
现在,要按出现频率降序获取单词列表,您可以先将地图转换为列表,然后应用Collections.sort(),就像这篇文章中建议的那样。以下是适合您需求的简化版本:
static List<String> getWordInDescendingFreqOrder(Map<String, Integer> wordCount) {

    // Convert map to list of <String,Integer> entries
    List<Map.Entry<String, Integer>> list = 
        new ArrayList<Map.Entry<String, Integer>>(wordCount.entrySet());

    // Sort list by integer values
    Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
        public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
            // compare o2 to o1, instead of o1 to o2, to get descending freq. order
            return (o2.getValue()).compareTo(o1.getValue());
        }
    });

    // Populate the result into a list
    List<String> result = new ArrayList<String>();
    for (Map.Entry<String, Integer> entry : list) {
        result.add(entry.getKey());
    }
    return result;
}

希望这能帮助到您。

编辑: 根据@dragon66的建议更改了比较函数。谢谢。


1
为什么不返回(o2.getValue()).compareTo(o1.getValue())而不是*-1? - dragon66
谢谢。这很有帮助。但是出现了一些问题,现在“Public class Parser implements Comparator<String>”中的“Parser”被用红线标记,告诉我它需要继承抽象方法?我不明白为什么会这样。 - user1333781
int compare(String w1, String w2) 方法还在吗?如果您正在使用Jdk6,您可能还需要在方法前加上 @Override 注释,否则它会报错。此外,您可以将 getWordInDescendingFreqOrder() 方法放入另一个类中,并从您的类中引用它。 - rodion

1

您可以从以下内容中进行比较和提取想法:

public class FrequencyCount {

    public static void main(String[] args) {

        // read in the words as an array
        String s = StdIn.readAll();
        // s = s.toLowerCase();
        // s = s.replaceAll("[\",!.:;?()']", "");
        String[] words = s.split("\\s+");

        // sort the words
        Merge.sort(words);

        // tabulate frequencies of each word
        Counter[] zipf = new Counter[words.length];
        int M = 0;                                        // number of distinct words
        for (int i = 0; i < words.length; i++) {
            if (i == 0 || !words[i].equals(words[i-1]))   // short-circuiting OR
                zipf[M++] = new Counter(words[i], words.length);
            zipf[M-1].increment();
        }

        // sort by frequency and print
        Merge.sort(zipf, 0, M);                           // sorting a subarray
        for (int j = M-1; j >= 0; j--) {
            StdOut.println(zipf[j]);
        }
    }
}

1
一个解决方案,接近于您原始发布的内容并进行了更正,同时按Torious在评论中建议的排序。
import java.util.*;

public class Parser implements Comparator <String> {

    public Map<String, Integer> wordCount;

    void parse ()
    {
        Scanner scanner = new Scanner (System.in);

        // don't redeclare it here - your attribute wordCount will else be shadowed
        wordCount = new HashMap<String, Integer> ();

        //iterates through each word in the text file
        while (scanner.hasNext ()) {
            String word = scanner.next ();
            // operate on the word, not on next and next of next word from Scanner
            word = word.replaceAll (" [^A-Za-z0-9]", " ");
            word = word.toLowerCase ();
            // look into your map:
            if (! wordCount.containsKey (word))
                wordCount.put (word, 1);
            else
                wordCount.put (word, wordCount.get (word) + 1);;
        }
    }

    public int getCount (String word) {
        return wordCount.get (word);
    }

    public int compare (String w1, String w2) {
        return getCount (w1) - getCount (w2);
    }

    public List<String> getWordsInOrderOfFrequency () {
        List<String> justWords = new ArrayList<String> (wordCount.keySet());
        Collections.sort (justWords, this);
        return justWords; 
    }

    public static void main (String args []) {
        Parser p = new Parser ();
        p.parse ();
        List<String> ls = p.getWordsInOrderOfFrequency ();
        for (String s: ls) 
            System.out.println (s);
    }
}

0

rodions Solution是一种泛型的地狱,但我没有简化它 - 只是不同而已。

最后,他的解决方案更简短、更好。

乍一看,TreeMap可能是适当的,但它按键排序,对于按值排序没有帮助,我们也不能切换键-值,因为我们通过键查找它。

所以下一个想法是生成一个HashMap,并使用Collections.sort,但它不接受Map,只接受List进行排序。从Map中,有entrySet,它产生另一个集合,这是一个Set,而不是List。那就是我改变方向的地方:

我实现了一个迭代器:我遍历entrySet,只返回值为1的键。如果值为2,我会缓存它们以备后用。如果迭代器耗尽,我会查看缓冲区,如果它不为空,我将来使用缓冲区的迭代器,增加我要查找的最小值,并创建一个新的缓冲区。

Iterator/Iterable配对的优点是,可以通过简化的for循环获得值。

import java.util.*;

// a short little declaration :) 
public class WordFreq implements Iterator <Map.Entry <String, Integer>>, Iterable <Map.Entry <String, Integer>>
{
    private Map <String, Integer> counter;
    private Iterator <Map.Entry <String, Integer>> it;
    private Set <Map.Entry <String, Integer>> buf;
    private int maxCount = 1; 

    public Iterator <Map.Entry <String, Integer>> iterator () {
        return this;
    }

    // The iterator interface expects a "remove ()" - nobody knows why
    public void remove ()
    {
        if (hasNext ())
            next ();
    } 

    public boolean hasNext ()
    {
        return it.hasNext () || ! buf.isEmpty ();
    }

    public Map.Entry <String, Integer> next ()
    {
        while (it.hasNext ()) {
            Map.Entry <String, Integer> mesi = it.next ();
            if (mesi.getValue () == maxCount)
                return mesi;
            else
                buf.add (mesi);
        }
        if (buf.isEmpty ())
            return null;
        ++maxCount;
        it = buf.iterator (); 
        buf = new HashSet <Map.Entry <String, Integer>> ();     
        return next ();
    } 

    public WordFreq ()
    {
        it = fill ();
        buf = new HashSet <Map.Entry <String, Integer>> ();
        // The "this" here has to be an Iterable to make the foreach work
        for (Map.Entry <String, Integer> mesi : this)
        {
            System.out.println (mesi.getValue () + ":\t" + mesi.getKey ());
        }
    }

    public Iterator <Map.Entry <String, Integer>> fill ()
    {
        counter = new HashMap <String, Integer> ();
        Scanner sc = new Scanner (System.in);
        while (sc.hasNext ())
        {
            push (sc.next ());
        }
        Set <Map.Entry <String, Integer>> set = counter.entrySet ();
        return set.iterator ();
    }

    public void push (String word)
    {
        Integer i = counter.get (word);
        int n = 1 + ((i != null) ? i : 0); 
        counter.put (word, n);
    }

    public static void main (String args[])
    {
        new WordFreq ();
    }
}

由于我的解决方案从标准输入读取,您可以使用以下方式调用它:

cat WordFreq.java | java WordFreq

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接