字符串中出现频率最高的单词

Question

字符串中出现频率最高的单词

4

我是一名新手Ruby程序员并且尝试编写一个方法，它将返回字符串中最常见单词的数组。如果有一个单词出现次数最多，那么应该只返回这个单词。如果有两个单词出现次数相同最高，那么这两个单词都应该在一个数组中返回。

问题在于当我传入第二个字符串时，代码只计算了“单词”两次而不是三次。当传入第三个字符串时，代码会返回具有2次计数的“it”，这毫无意义，因为“it”应该只有1次计数。

def most_common(string)
  counts = {}
  words = string.downcase.tr(",.?!",'').split(' ')

  words.uniq.each do |word|
    counts[word] = 0
  end

  words.each do |word|
    counts[word] = string.scan(word).count
  end

  max_quantity = counts.values.max
  max_words = counts.select { |k, v| v == max_quantity }.keys
  puts max_words
end

most_common('a short list of words with some words') #['words']
most_common('Words in a short, short words, lists of words!') #['words']
most_common('a short list of words with some short words in it') #['words', 'short']

- Daniel Bonnell

相关问题：https://dev59.com/sGTVa4cB1Zd3GeqP_xl1 - Todd A. Jacobs

谢谢大家的帮助。经过仔细检查，我发现在 words.each 中，我看到的是未转换为小写的“string”，这似乎解决了我的两个问题。 - Daniel Bonnell

@NickVeys 给出了一个很好的答案（赢得了我的 +1），而且也是唯一回答了你的问题，所以你选择用绿色的勾选标志来奖励他是可以理解的。但是我建议，在未来你在选择答案之前应该等待一段时间（可能是一小时或更长时间），因为相对较快的选择容易阻止其他可能更好的答案，并且会抢占那些仍在准备答案的读者。 - Cary Swoveland

会的。我还是很新手，正在学习中。 - Daniel Bonnell

这个常见问题解答值得一读。 - Cary Swoveland

8个回答

3

既然Nick已经回答了你的问题，我想提出另一种实现方式。由于"high count"这个词组含糊不清，建议你返回一个哈希表，其中包含小写单词和它们各自的计数。自从Ruby 1.9版本以后，哈希表会保留键值对输入的顺序，因此我们可以利用这个特性，返回按值排序后的键值对哈希表。

代码

def words_by_count(str)
  str.gsub(/./) do |c|
    case c
    when /\w/ then c.downcase
    when /\s/ then c
    else ''
    end
  end.split
     .group_by {|w| w}
     .map {|k,v| [k,v.size]}
     .sort_by(&:last)
     .reverse
     .to_h
end
words_by_count('Words in a short, short words, lists of words!')

在Ruby 2.1中引入了方法Array#to_h。对于早期的Ruby版本，必须使用以下方法：

Hash[str.gsub(/./)... .reverse]

例子

words_by_count('a short list of words with some words')
  #=> {"words"=>2, "of"=>1, "some"=>1, "with"=>1,
  #    "list"=>1, "short"=>1, "a"=>1}
words_by_count('Words in a short, short words, lists of words!')
  #=> {"words"=>3, "short"=>2, "lists"=>1, "a"=>1, "in"=>1, "of"=>1}
words_by_count('a short list of words with some short words in it')
  #=> {"words"=>2, "short"=>2, "it"=>1, "with"=>1,
  #    "some"=>1, "of"=>1, "list"=>1, "in"=>1, "a"=>1}

解释

以下是第二个示例中发生的情况：

str = 'Words in a short, short words, lists of words!'

str.gsub(/./) do |c|...匹配字符串中的每个字符并将其发送到块中以决定如何处理。如您所见，单词字符被转为小写字母，空格保持不变，其他所有内容都被转换为空格。

s = str.gsub(/./) do |c|
      case c
      when /\w/ then c.downcase
      when /\s/ then c
      else ''
      end
    end
  #=> "words in a short short words lists of words"

这之后是：

a = s.split
 #=> ["words", "in", "a", "short", "short", "words", "lists", "of", "words"]
h = a.group_by {|w| w}
 #=> {"words"=>["words", "words", "words"], "in"=>["in"], "a"=>["a"],
 #    "short"=>["short", "short"], "lists"=>["lists"], "of"=>["of"]}
b = h.map {|k,v| [k,v.size]}
 #=> [["words", 3], ["in", 1], ["a", 1], ["short", 2], ["lists", 1], ["of", 1]]
c = b.sort_by(&:last)
 #=> [["of", 1], ["in", 1], ["a", 1], ["lists", 1], ["short", 2], ["words", 3]]
d = c.reverse
 #=> [["words", 3], ["short", 2], ["lists", 1], ["a", 1], ["in", 1], ["of", 1]]
d.to_h # or Hash[d]
 #=> {"words"=>3, "short"=>2, "lists"=>1, "a"=>1, "in"=>1, "of"=>1}

请注意，c = b.sort_by(&:last)和d = c.reverse可以被替换为以下代码：

d = b.sort_by { |_,k| -k }
 #=> [["words", 3], ["short", 2], ["a", 1], ["in", 1], ["lists", 1], ["of", 1]]

但通常情况下，sort后跟reverse会更快。

- Cary Swoveland

1

def count_words string
  word_list = Hash.new(0)
  words     = string.downcase.delete(',.?!').split
  words.map { |word| word_list[word] += 1 }
  word_list
end

def most_common_words string
  hash      = count_words string
  max_value = hash.values.max
  hash.select { |k, v| v == max_value }.keys
end

most_common 'a short list of words with some words'
#=> ["words"]

most_common 'Words in a short, short words, lists of words!'
#=> ["words"]

most_common 'a short list of words with some short words in it'
#=> ["short", "words"]

- Todd A. Jacobs

1

假设string是一个包含多个单词的字符串。

words = string.split(/[.!?,\s]/)
words.sort_by{|x|words.count(x)}

在这里，我们将字符串中的单词拆分并添加到数组中。然后根据单词数量对数组进行排序。最常见的单词将出现在最后。

- Darkmouse

0

同样的事情也可以用以下方式完成：

def most_common(string)
  counts = Hash.new 0
  string.downcase.tr(",.?!",'').split(' ').each{|word| counts[word] += 1}
  # For "Words in a short, short words, lists of words!"
  # counts ---> {"words"=>3, "in"=>1, "a"=>1, "short"=>2, "lists"=>1, "of"=>1} 
  max_value = counts.values.max
  #max_value ---> 3
  return counts.select{|key , value| value == counts.values.max}
  #returns --->  {"words"=>3}
end

这里有一个更简短的解决方案，你可能想要使用。希望能够帮到你 :)

- Harsh Trivedi

0

这是程序员喜欢的问题类型，不是吗 :) 用一种函数式方法怎么样？

# returns array of words after removing certain English punctuations
def english_words(str)
  str.downcase.delete(',.?!').split
end

# returns hash mapping element to count
def element_counts(ary)
  ary.group_by { |e| e }.inject({}) { |a, e| a.merge(e[0] => e[1].size) }
end

def most_common(ary)
  ary.empty? ? nil : 
    element_counts(ary)
      .group_by { |k, v| v }
      .sort
      .last[1]
      .map(&:first)
end

most_common(english_words('a short list of words with some short words in it'))
#=> ["short", "words"]

- Jared Beck

0

def common(string)
  counts=Hash.new(0)
  words=string.downcase.delete('.,!?').split(" ")
  words.each {|k| counts[k]+=1}
  p counts.sort.reverse[0]
end

- user8910400

0

def firstRepeatedWord(string)
  h_data = Hash.new(0)
  string.split(" ").each{|x| h_data[x] +=1}
  h_data.key(h_data.values.max)
end

- Jagdish

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nick Veys · Accepted Answer

您计算单词实例的方法存在问题。由于it在with中出现，因此它被重复计算了。

[1] pry(main)> 'with some words in it'.scan('it')
=> ["it", "it"]

虽然可以这样做，但更简单的方法是使用each_with_object函数将数组内容按值出现的次数分组，如下所示：

counts = words.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }

以下内容可以帮助你实现这个功能：该代码遍历数组中的每个元素，为哈希表中每个单词的条目值加1。

因此，以下代码可以满足您的要求：

def most_common(string)
  words = string.downcase.tr(",.?!",'').split(' ')
  counts = words.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
  max_quantity = counts.values.max
  counts.select { |k, v| v == max_quantity }.keys
end

p most_common('a short list of words with some words') #['words']
p most_common('Words in a short, short words, lists of words!') #['words']
p most_common('a short list of words with some short words in it') #['words', 'short']