在Matlab中计算单元数组中的单词数

4
我有一个500x1的单元格数组,每一行都有一个特定的单词。如何计算单词出现的次数并显示它,同时显示每个出现的百分比。
例如:
这些单词的出现次数是:
Ans =

     200 Green
     200 Red
     100 Blue

这些单词的百分比:
Ans = 

     40% Green
     40% Red
     20% Blue

你是否已经有原始的500x1单元格数组中包含哪些唯一单词的列表? - Gunther Struyf
其实,我刚刚找到了一个很棒的解决方案,也涵盖了你的问题,Peter的回答 - Barney Szabolcs
4个回答

5
想法是strcmpi逐个元素比较单元矩阵。这可以用于将输入名称与输入中的唯一名称进行比较。尝试下面的代码。
% generate some input
input={'green','red','green','red','blue'}';

% find the unique elements in the input
uniqueNames=unique(input)';

% use string comparison ignoring the case
occurrences=strcmpi(input(:,ones(1,length(uniqueNames))),uniqueNames(ones(length(input),1),:));

% count the occurences
counts=sum(occurrences,1);

%pretty printing
for i=1:length(counts)
    disp([uniqueNames{i} ': ' num2str(counts(i))])
end

我让您来计算百分比。

我会将出现的行更改为一个单一的情况,无论是小写还是大写。我会先执行以下操作:input = lower(input); 这将所有字符串转换为小写。这样做更容易,因为如果大小不匹配可能会出现问题。这只是我的个人意见。 - user2867655

1

首先在数据中找到独特的单词:

% set up sample data:
data = [{'red'}; {'green'}; {'blue'}; {'blue'}; {'blue'}; {'red'}; {'red'}; {'green'}; {'red'}; {'blue'}; {'red'}; {'green'}; {'green'}; ]
uniqwords = unique(data);

然后在数据中查找这些唯一单词的出现次数:

[~,uniq_id]=ismember(data,uniqwords);

然后简单地计算每个唯一单词出现的次数:

uniq_word_num = arrayfun(@(x) sum(uniq_id==x),1:numel(uniqwords));

要获得百分比,请将数据样本总数除以总数:
uniq_word_perc = uniq_word_num/numel(data)

Gunther,你如何计算Denahiro的答案的百分比? - Kirsty White
以与此处相同的方式进行翻译,将得到的计数除以样本总数。 - Gunther Struyf

0

巧妙的方法,不使用显式的for循环...

clc
close all
clear all

Paragraph=lower(fileread('Temp1.txt'));

AlphabetFlag=Paragraph>=97 & Paragraph<=122;  % finding alphabets

DelimFlag=find(AlphabetFlag==0); % considering non-alphabets delimiters
WordLength=[DelimFlag(1), diff(DelimFlag)];
Paragraph(DelimFlag)=[]; % setting delimiters to white space
Words=mat2cell(Paragraph, 1, WordLength-1); % cut the paragraph into words

[SortWords, Ia, Ic]=unique(Words);  %finding unique words and their subscript

Bincounts = histc(Ic,1:size(Ia, 1));%finding their occurence
[SortBincounts, IndBincounts]=sort(Bincounts, 'descend');% finding their frequency

FreqWords=SortWords(IndBincounts); % sorting words according to their frequency
FreqWords(1)=[];SortBincounts(1)=[]; % dealing with remaining white space

Freq=SortBincounts/sum(SortBincounts)*100; % frequency percentage

%% plot
NMostCommon=20;
disp(Freq(1:NMostCommon))
pie([Freq(1:NMostCommon); 100-sum(Freq(1:NMostCommon))], [FreqWords(1:NMostCommon), {'other words'}]);

0

这是我的解决方案,应该非常快。

% example input
example = 'This is an example corpus. Is is a verb?';
words = regexp(example, ' ', 'split');

%your program, result in vocabulary and counts. (input is a cell array called words)
vocabulary = unique(words);
n = length(vocabulary);
counts = zeros(n, 1);
for i=1:n
    counts(i) = sum(strcmpi(words, vocabulary{i}));
end

%process results
[val, idx]=max(counts);
most_frequent_word = vocabulary{idx};

%percentages:
percentages=counts/sum(counts);

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接