Perl段落n-gram

Question

Perl段落n-gram

5

假设我有一句文本：

$body = 'the quick brown fox jumps over the lazy dog';

我希望将该句子转化为“关键词”哈希表，但我想允许多个单词组成的关键词；以下是获取单个单词关键词的代码：

$words{$_}++ for $body =~ m/(\w+)/g;

完成后，我得到了一个类似以下的哈希表：

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1

下一步，为了得到两个单词的关键词，需要执行以下操作：

$words{$_}++ for $body =~ m/(\w+ \w+)/g;

但这只能得到每个“其他”对，结果如下所示：

'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1

我还需要一个单词的偏移量：

'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1

有没有比下面更简单的方法？

my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;

- Glen Solsberry

5个回答

3

你可以使用“lookaheads”实现一些有趣的操作，具体请参考Perl正则表达式文档中的相关部分。

如果我这样做：

$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;

这个表达式表示要向前查找两个单词（并捕获它们），但只消耗一个单词。

我的理解是：

%words: {
          'brown fox' => 1,
          'fox jumps' => 1,
          'jumps over' => 1,
          'lazy dog' => 1,
          'over the' => 1,
          'quick brown' => 1,
          'the lazy' => 1,
          'the quick' => 1
        }

看起来我可以通过为计数放入一个变量来概括这个问题：

my $n    = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;

- Axeman

2

我会使用向前查看来收集除第一个单词以外的所有内容。这样，位置会自动正确地前进：

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;

++$words{$1}         while $body =~ m/(\w+)/g;
++$words{"$1 $2"}    while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;

如果你想坚持使用单个空格而不是 \s+ （如果这样做，请不要忘记删除 /x 修饰符），你可以简化一下，因为你可以在 $2 中收集任意数量的单词，而不是每个单词使用一个组。

- cjm

2

你为什么要仅使用正则表达式来完成这个任务？在我看来，显而易见的方法是将文本拆分成数组，然后使用一对嵌套循环从中提取您需要计数的内容。代码可能如下：


text = "some text to count"
counts = {}
words = text.split()
for word in words:
    if word not in counts:
        counts[word] = 0
    counts[word] += 1

#!/usr/bin/env perl

use strict;
use warnings;

my $text = 'the quick brown fox jumps over the lazy dog';
my $max_words = 3;

my @words = split / /, $text;
my %counts;

for my $pos (0 .. $#words) {
  for my $phrase_len (0 .. ($pos >= $max_words ? $max_words - 1 : $pos)) {
    my $phrase = join ' ', @words[($pos - $phrase_len) .. $pos];
    $counts{$phrase}++;
  }
} 

use Data::Dumper;
print Dumper(\%counts);

输出：

$VAR1 = {
          'over the lazy' => 1,
          'the' => 2,
          'over' => 1,
          'brown fox jumps' => 1,
          'brown fox' => 1,
          'the lazy dog' => 1,
          'jumps over' => 1,
          'the lazy' => 1,
          'the quick brown' => 1,
          'fox jumps' => 1,
          'over the' => 1,
          'brown' => 1,
          'fox jumps over' => 1,
          'quick brown' => 1,
          'jumps' => 1,
          'lazy' => 1,
          'jumps over the' => 1,
          'lazy dog' => 1,
          'dog' => 1,
          'quick brown fox' => 1,
          'fox' => 1,
          'the quick' => 1,
          'quick' => 1
        };

编辑：根据cjm的评论，修复了$phrase_len循环以防止使用负索引，这导致了不正确的结果。

- Dave Sherohman

这个程序没有正确处理数组的边缘。请注意，你的输出包括像“dog the”和“lazy dog the”这样的短语，而这些短语实际上并不出现在文本中。 - cjm

@cjm：啊！我显然没有仔细检查输出。不过，对于一个两分钟的概念验证来说还不错。我已经修正了$phrase_len循环以解决这个问题。 - Dave Sherohman

1

使用pos运算符

pos SCALAR

返回变量上最后一个m//g搜索停止的偏移量（当未指定变量时，使用$_）。

以及@-特殊数组

@LAST_MATCH_START

@-

$-[0]是最后一次成功匹配的起始位置。 $-[n]是第n个子模式匹配的起始位置，如果子模式没有匹配，则为undef。

例如，下面的程序抓取每个对中的第二个单词并倒回匹配的位置，以便第二个单词成为下一个对中的第一个单词：

#! /usr/bin/perl

use warnings;
use strict;

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;
while ($body =~ /(\w+ (\w+))/g) {
  ++$words{$1};
  pos($body) = $-[2];
}

for (sort { index($body,$a) <=> index($body,$b) } keys %words) {
  print "'$_' => $words{$_}\n";
}

输出：

'the quick' => 1
'quick brown' => 1
'brown fox' => 1
'fox jumps' => 1
'jumps over' => 1
'over the' => 1
'the lazy' => 1
'lazy dog' => 1

- Greg Bacon

+0.4999...另外0.5是为了相关的文档参考来解释它如何工作。 :) - Ether

@Ether，他确实链接到了文档。Stack Overflow只是在code文本内部不太引人注目地显示链接。 - cjm

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Grrrr · Accepted Answer

尽管手动编写所述任务可能很有趣，但使用现有的CPAN模块来处理n-gram是否更好呢？看起来Text::Ngrams（而不是Text::Ngram）可以处理基于单词的n-gram分析。