使用PHP从文本中提取单词

4

大家好,我有一个小问题。我需要提取文本中的“anyone”单词。

我尝试使用strtok()、strstr()以及一些正则表达式来检索单词,但只能提取一些单词。

由于单词可能伴随着许多字符和符号,所以这个问题变得很复杂。

下面是需要提取单词的示例文本:

Main article: our 46,000 required, !but (1947-2011) mail@server.com March 8, 2014 Gutenberg's 34-DE 'a' 3,1415 Us: @unknown n go http://google.com or www.google.com and http://www.google.com (r) The 509th "composite" and; C-54 #dog v4.0 ¿as is done? ¿article... agriculture? x ¿cat? now! Hi!! (87 meters).

Sample text, for testing.

提取文本的结果应该是:
Main article our required but March Gutenberg's a go or and The composite and dog as is done article agriculture cat now Hi meters

Sample text for testing

我写的第一个函数是为了方便工作。
function PreText($text){
  $text = str_replace("\n", ".", $text);
  $text = str_replace("\r", ".", $text);

  $text = str_replace("'", "", $text);
  $text = str_replace("?", "", $text);
  $text = str_replace("¿", "", $text);
  $text = str_replace("(", "", $text);
  $text = str_replace(")", "", $text);
  $text = str_replace('"', "", $text);
  $text = str_replace(';', "", $text);
  $text = str_replace('!', "", $text);
  $text = str_replace('<', "", $text);
  $text = str_replace('>', "", $text);
  $text = str_replace('#', "", $text);

  $text = str_replace(",", "", $text);

  $text = str_replace(".c", "", $text);
  $text = str_replace(".C", "", $text);
  return $text;
}

分割函数:

function SplitWords($text){
  $words = explode(" ", $text);
  $ContWords = count($words);

  for ($i = 0; $i < $ContWords; $i++){
    if (ctype_alpha($words[$i])) {
      $NewText .= $words[$i].", ";
    }
  }
  return $NewText;
}

该程序:

<?
  include_once ('functions.php');

  $text = "Main article: our 46,000 ...";
  $text = PreText($text);
  $text = SplitWords($text);
  echo $text;
?>

代码还有很长的路要走。我们感谢您的帮助。

2个回答

5
如果我理解正确,您想从字符串中删除所有非字母字符。我会使用 preg_replace
$text = "Main article: our 46,000...";
$text = preg_replace("/[^a-zA-Z' ]/","",$text);

这将删除除字母、撇号和空格以外的所有内容。

看起来撇号也是有效的,例如 Gutenberg's - Sonic
谢谢,我错过了那个。 - Robbert
不客气。看测试数据'a'被转换为a,所以有点棘手。 - Sonic
这并没有按预期删除 mail@server.com - Toto

0

尝试这个,几乎符合您的要求

<?php
$text = <<<HEREDOC
Main article: our 46,000 required, !but (1947-2011) mail@server.com March 8, 2014 Gutenberg's 34-DE 'a' 3,1415 Us: @unknown n go http://google.com or www.google.com and
        http://www.google.com (r) The 509th composite" and; C-54 #dog v4.0 ¿as is done? ¿article... agriculture? x ¿cat? now! Hi!! (87 meters). Sample text, for testing.
HEREDOC;
//replace all kind of URLs and emails from text
$url_email = "((https?|ftp)\:\/\/)?"; // SCHEME
$url_email .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$url_email .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$url_email .= "(\:[0-9]{2,5})?"; // Port
$url_email .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$url_email .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$url_email .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor

$text = preg_replace("/$url_email/","",$text);
//replace anything like Us: @unknown
$text = preg_replace("/Us:.?@\\w+/","",$text);
//replace all Non-Alpha characters
$text = preg_replace("/[^a-zA-Z' ]/","",$text);
echo $text;
?>

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接