如何在文本文件中查找单词并使用数组打印出现最频繁的单词?

4
我有困难找到程序中最常见的单词和不区分大小写的最常见单词。我有一个扫描器来读取文本文件和一个while循环,但仍然不知道如何实现我想要找到的内容。我应该使用不同的字符串函数来读取和打印单词吗?
以下是我的代码:
public class letters {
public static void main(String[] args) throws FileNotFoundException {
    FileInputStream fis = new FileInputStream("input.txt");
    Scanner scanner = new Scanner(fis);
    String word[] = new String[500];
    while (scanner.hasNextLine()) {
        String s = scanner.nextLine();
        for (int i = 0; i < s.length(); i++) {
            char ch = s.charAt(i);
             }

          }
      String []roll = s.split("\\s");
       for(int i=0;i<roll.length;i++){
           String lin = roll[i];
           //System.out.println(lin);
      }
 }

这是我目前为止的内容。我需要输出结果是:
   Word:
   6 roll

  Case-insensitive word:
  18 roll

这是我的输入文件:

@
roll tide roll!
Roll Tide Roll!
ROLL TIDE ROLL!
ROll tIDE ROll!
 roll  tide  roll! 
 Roll  Tide  Roll! 
 ROLL  TIDE  ROLL! 
   roll    tide    roll!   
    Roll Tide Roll  !   
@
65-43+21= 43
65.0-43.0+21.0= 43.0
 65 -43 +21 = 43 
 65.0 -43.0 +21.0 = 43.0 
 65 - 43 + 21 = 43 
 65.00 - 43.0 + 21.000 = +0043.0000 
    65   -  43  +   21  =   43  

我只需要它能找到出现最多的单词(即最大连续字母序列,即“roll”),并打印出它出现的次数(即6)。如果有人能帮助我,那就太好了!谢谢。

4个回答

5
考虑使用Map<String,Integer>来计算单词,这样可以对任意数量的单词进行计数。 请参阅Map文档。像这样(需要修改为不区分大小写):
public Map<String,Integer> words_count = new HashMap<String,Integer>();

//read your line (you will have to determine if this line should be split or is equations
//also just noticed that the trailing '!' would need to be removed

String[] words = line.split("\\s+");
for(int i=0;i<words.length;i++)
{
     String s = words[i];
     if(words_count.ketSet().contains(s))
     {
          Integer count = words_count.get(s) + 1;
          words_count.put(s, count)
     }
     else
          words_count.put(s, 1)

}

您可以先统计字符串中每个单词出现的次数,然后按照次数从高到低排序,并选取最高的单词。

Integer frequency = null;
String mostFrequent = null;
for(String s : words_count.ketSet())
{
    Integer i = words_count.get(s);
    if(frequency == null)
         frequency = i;
    if(i > frequency)
    {
         frequency = i;
         mostFrequent = s;
    }
}

然后打印
System.out.println("The word "+ mostFrequent +" occurred "+ frequency +" times");

我原以为“空格”是“\s+”,而不是“\s”。 - Dave
\\s+ 只是表示一个或多个空格 - Java Devil
1
我以前从未使用过Hashmap。我需要查一下资料。不过还是谢谢你! - charond Richardson
你如何为System.out.println()的输出编写代码? - charond Richardson
我一直在尝试将你的代码放进去,但是 words_count 一直给我报错。我觉得我也不能在这个程序中使用 HashMaps。我只能使用数组和字符串。 - charond Richardson
什么错误?为什么不允许使用HashMap,确保包含适当的导入语句。 - Java Devil

1

首先,将所有单词累加到一个Map中,如下所示:

...
String[] roll = s.split("\\s+");
for (final String word : roll) {
    Integer qty = words.get(word);
    if (qty == null) {
        qty = 1;
    } else {
        qty = qty + 1;
    }
    words.put(word, qty);
}
...

然后你需要找出哪一个得分最高:

String bestWord;
int maxQty = 0;
for(final String word : words.keySet()) {
    if(words.get(word) > maxQty) {
        maxQty = words.get(word);
        bestWord = word;
    }
}
System.out.println("Word:");
System.out.println(Integer.toString(maxQty) + " " + bestWord);        

最后你需要合并所有相同的单词形式:

Map<String, Integer> wordsNoCase = new HashMap<String, Integer>();
for(final String word : words.keySet()) {
    Integer qty = wordsNoCase.get(word.toLowerCase());
    if(qty == null) {
        qty = words.get(word);
    } else {
        qty += words.get(word);
    }
    wordsNoCase.put(word.toLowerCase(), qty);
}
words = wordsNoCase;

然后重新运行之前的代码片段,以找到得分最高的单词。


有没有一种方法可以在不使用Hashmap的情况下完成这个任务? - charond Richardson

1

尽量使用HashMap以获得更好的结果。您需要使用BufferedReaderFilereader作为输入文件,如下所示:

FileReader text = new FileReader("file.txt");
BufferedReader textFile = new BufferedReader(text);

BufferedReader 对象 textfile 需要作为以下方法的参数传递:

public HashMap<String, Integer> countWordFrequency(BufferedReader textFile) throws IOException
{
/*This method finds the frequency of words in a text file
 * and saves the word and its corresponding frequency in 
 * a HashMap.
 */
    HashMap<String, Integer> mapper = new HashMap<String, Integer>();
    StringBuffer multiLine = new StringBuffer("");
    String line = null;
    if(textFile.ready())
    {
        while((line = textFile.readLine()) != null)
        {
            multiLine.append(line);
            String[] words = line.replaceAll("[^a-zA-Z]", " ").toLowerCase().split(" ");
            for(String word : words)
            {
                if(!word.isEmpty())
                {
                    Integer freq = mapper.get(word);
                    if(freq == null)
                    {
                        mapper.put(word, 1);
                    }
                    else
                    {
                        mapper.put(word, freq+1);
                    }
                }
            }
        }
        textFile.close();
    }
    return mapper;
}

该行代码line.replaceAll("[^a-zA-Z]", " ").toLowerCase().split(" ");主要用于替换非字母字符,然后将所有单词转换为小写(解决不区分大小写的问题),最后将由空格分隔的单词拆分。

/*This method finds the highest value in HashMap
 * and returns the same.
 */
public int maxFrequency(HashMap<String, Integer> mapper)
{
    int maxValue = Integer.MIN_VALUE;
    for(int value : mapper.values())
    {
        if(value > maxValue)
        {
            maxValue = value;
        }
    }
    return maxValue;
}

以上代码返回哈希映射中最高的值。
/*This method prints the HashMap Key with a particular Value.
 */
public void printWithValue(HashMap<String, Integer> mapper, Integer value)
{
    for (Entry<String, Integer> entry : mapper.entrySet()) 
    {
        if (entry.getValue().equals(value)) 
        {
            System.out.println("Word : " + entry.getKey() + " \nFrequency : " + entry.getValue());
        }
    }
}

现在,您可以像上面那样打印出最常见的单词及其频率。

-2
    /*  i have declared LinkedHashMap containing String as a key and occurrences as  a value.
     * Creating BufferedReader object
     * Reading the first line into currentLine
     * Declere while-loop & splitting the currentLine into words
     * iterated using for loop. Inside for loop, i have an if else statement
     * If word is present in Map increment it's count by 1 else set to 1 as value
     * Reading next line into currentLine
     */
    public static void main(String[] args) {

        Map<String, Integer> map = new LinkedHashMap<String, Integer>();

        BufferedReader reader = null;

        try {
            reader = new BufferedReader(new FileReader("F:\\chidanand\\javaIO\\Student.txt"));
              String currentLine = reader.readLine();
            while (currentLine!= null) {
                String[] input = currentLine.replaceAll("[^a-zA-Z]", " ").toLowerCase().split(" ");
                  for (int i = 0; i < input.length; i++) {
                    if (map.containsKey(input[i])) {
                        int count = map.get(input[i]);
                        map.put(input[i], count + 1);

                    } else {
                        map.put(input[i], 1);
                    }

                }
                   currentLine = reader.readLine();
            }

            String mostRepeatedWord = null;
             int count = 0;
                 for (Entry<String, Integer> m:map.entrySet())
                    {
                        if(m.getValue() > count)
                        {
                           mostRepeatedWord = m.getKey();

                            count = m.getValue();
                        }
                    }

                 System.out.println("The most repeated word in input file is : "+mostRepeatedWord);

                    System.out.println("Number Of Occurrences : "+count);

        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                reader.close();
            } catch (IOException e) {
                e.printStackTrace();
            }

        }

    }
}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接