从数组中删除相似元素

7
Array
(
    [0] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
    [1] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
    [2] => The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.
    [3] => Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.
    [4] => The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.
    [5] => For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:
    [6] => The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.
    [7] => The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.
    [8] => For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:
    [9] => The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system’s base warranty and any Lenovo warranty upgrade.
)

虽然不是完全相同,但有些元素被另一个包含完全相同数据和更多信息的元素所取代,或者只是有几个单词不同。这些该如何过滤呢?


类似于这样?http://code.stephenmorley.org/php/diff-implementation/ - user2506641
在你的例子中,你究竟想要移除什么? - simon
1
@simon - 第一个元素。 - sinisake
8个回答

11
首先,这个问题并不简单,也没有被很好地阐述:您不想删除相同的元素,您想删除类似的元素,因此您的第一个问题是确定哪些元素是相似的。
鉴于相似性可能发生在字符串的任何位置,仅要求它们以相同的字符集开头是不够的。例如,看看这两个句子(改编自您的问题):
Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.
The rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.

他们非常相似,但不以相同的字符串开头。确定相似度的一种方法是使用Smith-Waterman算法,这里有一个PHP实现链接
--- 后续编辑 ---
这是使用PHP内置的similar_text()实现的。
/**
 * @param mixed $array          input array
 * @param int $minSimilarity    minimum similarity for an item to be removed (percentage)
 * @return array
 */
function applyFilter ($array, $minSimilarity = 90) {
    $result = [];

    foreach ($array as $outerValue) {
        $append = true;
        foreach ($result as $key => $innerValue) {
            $similarity = null;
            similar_text($innerValue, $outerValue, $similarity);
            if ($similarity >= $minSimilarity) {
                if (strlen($outerValue) > strlen($innerValue)) {
                    // always keep the longer one
                    $result[$key] = $outerValue;
                }
                $append = false;
                break;
            }
        }

        if ($append) {
            $result[] = $outerValue;
        }
    }

    return $result;
}

$test = [
    'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
    'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
    'The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.',
    'Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.',
    'The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.',
    'For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:',
    'The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.',
    'The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.',
    'For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:',
    'The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system’s base warranty and any Lenovo warranty upgrade.',
];

var_dump(applyFilter($test));

这是使用Smith–Waterman算法的完整的工作代码:

---后续编辑EOF---

class SmithWatermanGotoh
{
    private $gapValue;
    private $substitution;

    /**
     * Constructs a new Smith Waterman metric.
     *
     * @param gapValue
     *            a non-positive gap penalty
     * @param substitution
     *            a substitution function
     */
    public function __construct($gapValue=-0.5,
                $substitution=null)
    {
        if($gapValue > 0.0) throw new Exception("gapValue must be <= 0");
        //if(empty($substitution)) throw new Exception("substitution is required");
        if (empty($substitution)) $this->substitution = new SmithWatermanMatchMismatch(1.0, -2.0);
        else $this->substitution = $substitution;
        $this->gapValue = $gapValue;
    }

    public function compare($a, $b)
    {
        if (empty($a) && empty($b)) {
            return 1.0;
        }

        if (empty($a) || empty($b)) {
            return 0.0;
        }

        $maxDistance = min(mb_strlen($a), mb_strlen($b))
                * max($this->substitution->max(), $this->gapValue);
        return $this->smithWatermanGotoh($a, $b) / $maxDistance;
    }

    private function smithWatermanGotoh($s, $t)
    {
        $v0 = [];
        $v1 = [];
        $t_len = mb_strlen($t);
        $max = $v0[0] = max(0, $this->gapValue, $this->substitution->compare($s, 0, $t, 0));

        for ($j = 1; $j < $t_len; $j++) {
            $v0[$j] = max(0, $v0[$j - 1] + $this->gapValue,
                    $this->substitution->compare($s, 0, $t, $j));

            $max = max($max, $v0[$j]);
        }

        // Find max
        for ($i = 1; $i < mb_strlen($s); $i++) {
            $v1[0] = max(0, $v0[0] + $this->gapValue, $this->substitution->compare($s, $i, $t, 0));

            $max = max($max, $v1[0]);

            for ($j = 1; $j < $t_len; $j++) {
                $v1[$j] = max(0, $v0[$j] + $this->gapValue, $v1[$j - 1] + $this->gapValue,
                        $v0[$j - 1] + $this->substitution->compare($s, $i, $t, $j));

                $max = max($max, $v1[$j]);
            }

            for ($j = 0; $j < $t_len; $j++) {
                $v0[$j] = $v1[$j];
            }
        }

        return $max;
    }
}

class SmithWatermanMatchMismatch
{
    private $matchValue;
    private $mismatchValue;

    /**
     * Constructs a new match-mismatch substitution function. When two
     * characters are equal a score of <code>matchValue</code> is assigned. In
     * case of a mismatch a score of <code>mismatchValue</code>. The
     * <code>matchValue</code> must be strictly greater then
     * <code>mismatchValue</code>
     *
     * @param matchValue
     *            value when characters are equal
     * @param mismatchValue
     *            value when characters are not equal
     */
    public function __construct($matchValue, $mismatchValue) {
        if($matchValue <= $mismatchValue) throw new Exception("matchValue must be > matchValue");

        $this->matchValue = $matchValue;
        $this->mismatchValue = $mismatchValue;
    }

    public function compare($a, $aIndex, $b, $bIndex) {
        return ($a[$aIndex] === $b[$bIndex] ? $this->matchValue
                : $this->mismatchValue);
    }

    public function max() {
        return $this->matchValue;
    }

    public function min() {
        return $this->mismatchValue;
    }
}

/**
 * @param mixed $array          input array
 * @param int $minSimilarity    minimum similarity for an item to be removed (percentage)
 * @return array
 */
function applyFilter ($array, $minSimilarity = 90) {
    $swg = new SmithWatermanGotoh();

    $result = [];

    foreach ($array as $outerValue) {
        $append = true;
        foreach ($result as $key => $innerValue) {
            $similarity = $swg->compare($innerValue, $outerValue) * 100;
            if ($similarity >= $minSimilarity) {
                if (strlen($outerValue) > strlen($innerValue)) {
                    // always keep the longer one
                    $result[$key] = $outerValue;
                }
                $append = false;
                break;
            }
        }

        if ($append) {
            $result[] = $outerValue;
        }
    }

    return $result;
}


$test = [
    'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
    'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
    'The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.',
    'Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.',
    'The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.',
    'For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:',
    'The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.',
    'The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.',
    'For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:',
    'The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system’s base warranty and any Lenovo warranty upgrade.',
];

var_dump(applyFilter($test));

现在,您只需要根据自己的需求调整$minSimilarity变量即可。例如,在您的情况下,如果保持默认值90%,将删除第一个元素(与第二个元素相似度为99.86%)。但是,设置较低的值(80%)也会删除第8个元素。
希望能有所帮助!

哇,我也在思考和你一样的问题,并且给出了我的答案。百分比是我认为唯一的解决方案。 - Niklesh Raut
刚刚发现 PHP 内置了 similar_text 函数,非常适合我的需求,只是不确定如何对所有数组元素运行它,使用 array_map - eozzy

1
假设该值总是出现在最开始位置,您可以这样做:
$arr = ["Some Text.", "Some Text. And more details."];

foreach($arr as $key => $value) {

    // Look for the value in every element
    foreach($arr as $key2 => $value2) {

        // Remove element if its value appears at the beginning of another element
        if ($key !== $key2 && strpos($value2, $value) === 0) {
            unset($arr[$key]);
            continue 2;
        }
    }
}

// Re-index array 
$arr = array_values($arr);

如果元素顺序相反,这也同样适用。

1

有时只是几个单词不同。

正如您所说,少数单词可能会变成另一段文本。但在编程中,您需要确切的条件来进行过滤。

您可以将匹配的百分比用于过滤

这里有一个基本示例,可以让您了解。

<?php
    $data = ["this is test","this is another test","one test","two test","this is two test"];
    $percentageMatched = 100;//Here you can put your percentage matched to delete
    for($i=0;$i<count($data)-1;$i++){
      $value = explode(" ",$data[$i]);
      /* check each word in another text */
      for($k=$i+1;$k<count($data);$k++){
        $nextArray = explode(" ",$data[$k]);
        $foundCount = 0;
        for($j=0;$j<count($value);$j++){  
          if(in_array($value[$j],$nextArray)){
            $foundCount++;    
          }
        }
        $fromLine = $i;
        $toLine = $k;
        $percentage = $foundCount/count($value)*100;
        echo "EN $fromLine matched $percentage % with EN $toLine  \n";  
        if($percentage >= $percentageMatched){  
          $data[$i] = "";
          break;
          //array_values($data);
        }  
      }

      echo ".............\n";
    }
    print_r(array_filter($data));
?>

演示链接: https://eval.in/706478

如果输入数据为:

Array
(
    [0] => this is test
    [1] => this is another test
    [2] => one test
    [3] => two test
    [4] => this is two test
)

它输出:在这里索引0和3匹配了100%并被过滤掉,匹配百分比为100%。
EN 0 matched 100 % with EN 1  
.............
EN 1 matched 25 % with EN 2  
EN 1 matched 25 % with EN 3  
EN 1 matched 75 % with EN 4  
.............
EN 2 matched 50 % with EN 3  
EN 2 matched 50 % with EN 4  
.............
EN 3 matched 100 % with EN 4  
.............
Array
(
    [1] => this is another test
    [2] => one test
    [4] => this is two test
)

1
你仍然可以使用array_filter并使用自定义回调函数,使用substr_count来查找该值在数组中是否出现超过一次。
$input = array("a","b","c","d","ax","cz");

$str = implode("|",array_unique($input));

$output = array_filter($input, function($var) use ($str){
                        return substr_count($str, $var) == 1;
                    });

print_r($output);

有一些情况下这种方法不起作用,需要考虑一个解决方法。 - phobia82
@3zzy 如果有相同的元素,它将同时删除两个,例如["a","a"]将返回[]。 - phobia82
在这种情况下,我应该先将其通过array_unique运行以先去除重复项,对吗? - eozzy

0

我认为在这种情况下,词干提取和词形还原可能会有所帮助。如果我们以数组中的前两个元素为例,唯一的区别是单数的“tape”和复数的“tapes”。
数组 ( [0] => N2225和N2226 SAS/SATA HBA是低成本、高性能的主机总线适配器,用于System x®服务器与磁带驱动器和RAID存储系统之间的高性能连接。N2225提供两个x4外部迷你SAS HD连接器,具有8条12 Gbps SAS通道。N2226提供四个x4外部迷你SAS HD连接器,具有16条12 Gbps SAS通道。 [1] => N2225和N2226 SAS/SATA HBA是低成本、高性能的主机总线适配器,用于System x®服务器与磁带驱动器和RAID存储系统之间的高性能连接。N2225提供两个x4外部迷你SAS HD连接器,具有8条12 Gbps SAS通道。N2226提供四个x4外部迷你SAS HD连接器,具有16条12 Gbps SAS通道。

如果您对字符串进行标记化并通过诸如Php Stemmer之类的词干处理器传递,那么单词“tape”和“tapes”都将被减少到它们的词干即“tape”。经过词干处理后,您可以比较数组元素。我相信它会删除许多冗余元素。
您还可以进一步执行Lemmatisation。例如,在英语中,“to walk”这个动词可能出现为“walk”,“walked”,“walks”,“walking”。在字典中查找的基本形式“walk”被称为该单词的引理(来自维基百科)。
我个人使用了Stanford NLP java。也有一个PHP实现PHP-Stanford-NLP

0

0

使用array_filter是一个不错的选择

$temp = "";

function prefixmatch($x){
  global $temp;
  $temp = $x;
  // do an optimist linear search to determine if there's a prefix match
  $bool = true;
  for($i=0; $i < min([strlen($x), strlen($temp)]); $i++){
    $bool = $bool & ($x[i] === $temp[i]);
  }
  // negate the result just because of array_filter
  return(!$bool);
}

print_r(array_filter($array1, "prefixmatch"));

使用 global 不是一个好的实践。你把 $x 放在 $temp 中,然后你比较 $x$temp。它们当然是相等的。count() 对于数组很有用,对于字符串它总是返回 1$boolfor 循环的每次迭代中都被覆盖;只有你在最后一次迭代中设置的值才会被计算;(无论如何只有一次迭代,参见上面关于 count() 的说明)。没有定义常量 i;PHP 将其转换为字符串,然后触发警告,因为字符串是字符串中的非法偏移量。总之,该函数等同于 function prefixmatch($x) { return false; } - axiac

-4
在PHP中,您可以使用array_unique方法从数组中删除重复项。
来自php.net的示例:
<?php
   $input = array("a" => "green", "red", "b" => "green", "blue", "red");
   $result = array_unique($input);
   print_r($result);
?>

输出结果为:

Array
( 
   [a] => green
   [0] => red
   [1] => blue
)

希望这正是您所寻找的。

只有前缀相同,这并没有回答问题。 - Rápli András

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接