在PHP中比较字符串,如果它们相似,则从数组中删除其中一个。

7
假设我有这样一个数组:
  • Band of Horses - Is There a Ghost
  • Band Of Horses - No One's Gonna Love You
  • Band of Horses - The Funeral
  • Band of Horses - The Funeral (lyrics in description)
  • Band of Horses - Laredo
  • Band Of Horses - Laredo on Letterman 5.20.10
  • Band of Horses - "The Great Salt Lake" Sub Pop Records
  • Band Of Horses - "No One's Gonna Love You"
  • Band of Horses perform Marry Song at Tromso Wedding
  • Band Of Horses - No One's Gonna Love You
  • 'Laredo' by Band of Horses on Q TV
  • Band of Horses, On My Way Back Home
  • Band of Horses - cigarettes wedding bands
  • Band Of Horses - "Cigarettes Wedding Bands"
  • Band Of Horses - I Go To The Barn Because I Like The
  • Our Swords - Band of Horses
  • Band Of Horses - "Marry song"
  • Band of Horses - Monsters
新数组将会是:
  • Band of Horses - Is There a Ghost
  • Band Of Horses - No One's Gonna Love You
  • Band of Horses - The Funeral
  • Band of Horses - Laredo
  • Band of Horses - "The Great Salt Lake" Sub Pop Records
  • Band of Horses, On My Way Back Home
  • Band of Horses - cigarettes wedding bands
  • Band Of Horses - I Go To The Barn Because I Like The
  • Our Swords - Band of Horses
  • Band Of Horses - "Marry song"
  • Band of Horses - Monsters
如何在PHP中比较列表中的每个字符串,并如果它们相似,则将其删除?
我认为这些是相似的:
  • Band of Horses - The Funeral
  • Band of Horses - The Funeral (lyrics in description)
另一个例子:
  • Band of Horses - Laredo
  • Band Of Horses - Laredo on Letterman 5.20.10

5
你所说的“类似”,具体指什么?在示例列表输入中,你希望输出是什么? - Peter Ajtai
好的,我已经更新了我的问题,并附上了我想要的结果。 - Cody.Stewart
1
仅仅因为我不想逐个比较你的两个数组中的每个项目来弄清楚你的意思,你能否给我们一些你认为相似的示例字符串? - deceze
4个回答

13
你有多种选择。
对于每个选项,你应该在进行比较之前对专辑名称进行处理。你可以通过去除标点符号、按字母顺序对专辑名称中的单词进行排序(在某些情况下),等等来实现这一点。
在每种情况下,当你进行比较时,如果从数组中移除一个专辑名称,那么你的比较是有顺序的,除非你制定一个规则来决定移除哪个专辑名称。因此,如果两个专辑名称被认为是“相似的”,那么总体上来说,最好移除较长的专辑名称。
主要的比较选项有:
1. 简单的子字符串比较。首先去除标点符号,然后进行不区分大小写的比较(请参见下面的第二段代码)。
2. 使用levenshtein()函数检查专辑名称的相似性。这种字符串比较方法比similar_text()更高效。您应该去除标点符号并按字母顺序排列单词。
3. 使用similar_text()函数检查专辑名称的相似性。我在这个方法上取得了最好的效果。事实上,我成功选择了您想要的确切专辑名称(请参见下面的第一段代码)。
4. 还有其他各种字符串比较函数可以供您尝试,包括soundex()metaphone()

无论如何... 这里有两个解决方案。

第一个使用 similar_text()... 但它只在去除所有标点符号、将单词按字母顺序排列并转为小写后计算相似度...... 缺点是你必须调整阈值相似度... 第二个在去除所有标点符号和空格后使用简单的不区分大小写的子字符串测试。

这两个代码片段的工作方式是使用 array_walk() 在数组中的每个专辑上运行 compare() 函数。然后在 compare() 函数内部,我使用 foreach() 将当前专辑与所有其他专辑进行比较。这里有足够的空间来提高效率。

请注意,我应该将 array_walk 的第三个参数作为引用使用,有人可以帮我做这个吗?当前的解决方法是使用全局变量:


实时示例(相似度阈值69%)


function compare($value, $key)
{
    global $array; // Should use 3rd argument of compare instead
    
    $value = strtolower(preg_replace("/[^a-zA-Z0-9 ]/", "", $value));
    $value = explode(" ", $value);
    sort($value);
    $value = implode($value);
    $value = preg_replace("/[\s]/", "", $value); // Remove any leftover \s
    
    foreach($array as $key2 => $value2)
    {
        if ($key != $key2)
        {
            // collapse, and lower case the string            
            $value2 = strtolower(preg_replace("/[^a-zA-Z0-9 ]/", "", $value2));
            $value2 = explode(" ", $value2);
            sort($value2);
            $value2 = implode($value2);            
            $value2 = preg_replace("/[\s]/", "", $value2);
            
              // Set up the similarity
            similar_text($value, $value2, $sim);
            if ($sim > 69)
            {     // Remove the longer album name
                unset($array[ ((strlen($value) > strlen($value2))?$key:$key2) ]);
            }
        }
    }
}
array_walk($array, 'compare');
$array = array_values($array);
print_r($array);

以上的输出为:
Array
(
    [0] => Band of Horses - Is There a Ghost
    [1] => Band Of Horses - No One's Gonna Love You
    [2] => Band of Horses - The Funeral
    [3] => Band of Horses - Laredo
    [4] => Band of Horses - "The Great Salt Lake" Sub Pop Records
    [5] => Band of Horses perform Marry Song at Tromso Wedding
    [6] => Band of Horses, On My Way Back Home
    [7] => Band of Horses - cigarettes wedding bands
    [8] => Band Of Horses - I Go To The Barn Because I Like The
    [9] => Our Swords - Band of Horses
    [10] => Band of Horses - Monsters
)

请注意,玛丽的歌曲的短版似乎不见了...所以它可能是对其他东西的误报,因为长版仍然在列表中...但它们正是你想要的专辑名称。
子字符串方法: 实时示例
function compare($value, $key)
{
      // I should be using &$array as a 3rd variable.
      // For some reason couldn't get that to work, so I do this instead.
    global $array;   
      // Take the current album name and remove all punctuation and white space
    $value = preg_replace("/[^a-zA-Z0-9]/", "", $value);        
      // Compare current album to all othes
    foreach($array as $key2 => $value2)
    {
        if ($key != $key2)
        {

              // collapse the album being compared to
            $value2 = preg_replace("/[^a-zA-Z0-9]/", "", $value2);

            $subject = $value2;
            $pattern = '/' . $value . '/i';

              // If there's a much remove the album being compared to
            if (preg_match($pattern, $subject))
            {
                unset($array[$key2]);
            }
        }
    }
}
array_walk($array, 'compare');
$array = array_values($array);
echo "<pre>";
print_r($array);
echo "</pre>";

对于您的示例字符串,上述输出如下(显示了2个您不想显示的):
Array  
(  
    [0] => Band of Horses - Is There a Ghost  
    [1] => Band Of Horses - No One's Gonna Love You  
    [2] => Band of Horses - The Funeral  
    [3] => Band of Horses - Laredo  
    [4] => Band of Horses - "The Great Salt Lake" Sub Pop Records  
    [5] => Band of Horses perform Marry Song at Tromso Wedding      // <== Oops
    [6] => 'Laredo' by Band of Horses on Q TV                       // <== Oops  
    [7] => Band of Horses, On My Way Back Home  
    [8] => Band of Horses - cigarettes wedding bands  
    [9] => Band Of Horses - I Go To The Barn Because I Like The  
    [10] => Our Swords - Band of Horses  
    [11] => Band Of Horses - "Marry song"  
    [12] => Band of Horses - Monsters  
)

3
您可以尝试使用similar_text,也许结合levenshtein,通过实验确定您认为足够相似的分数阈值。此外,查看用户讨论获取更多提示。然后,您可以循环遍历数组,将每个元素与其他每个元素进行比较,并删除您认为过于相似的元素。
希望这为您提供了一个起点。问题相当复杂,因为有许多事物可能被认为具有相同的内容,但具有完全不同的语法("Our Swords - Band of Horses" vs. "Band of Horses - Our Swords")。这取决于这种相对简单的解决方案是否足够满足您的需求。

3
@Peter,我之前只知道levenshtein这个函数,但是把“text similarity”输入到PHP文档搜索框中就可以找到很多有用的内容了。另外,记得查看每个函数下的“See also”部分。 :) - deceze

0

这是我的(有点复杂?)解决方案。

它将输入字符串拆分成单词数组(getWords)。然后,它将它们相互比较,按“相等性”(titlesMatch)分组,它不关心单词顺序。它为每组匹配存储数组,以便您可以查看类似的标题。

以下是脚本(假设$array是输入):

function getWords($str) {
    // Remove non-alpha characters and split by spaces
    $normalized = preg_replace('/[^a-z0-9\s]/', '', strtolower($str));
    $words = preg_split('/\s+/', $normalized, -1, PREG_SPLIT_NO_EMPTY);

    return $words;
}

function titlesMatch($words1, $words2) {
    $intersection = array_intersect($words1, $words2);

    sort($words1);
    sort($words2);
    sort($intersection);

    return $intersection === $words1 || $intersection === $words2;
}

$wordedArray = array_map('getWords', $array);

$uniqueItems = array();

foreach ($wordedArray as $words1Index => $words1) {
    $isUnique = true;

    foreach ($uniqueItems as &$words2Indices) {
        foreach ($words2Indices as $words2Index) {
            if (titlesMatch($words1, $wordedArray[$words2Index])) {
                $words2Indices[] = $words1Index;
                $isUnique = false;

                break;
            }
        }
    }

    if ($isUnique) {
        $uniqueItems[] = array($words1Index);
    }
}

// Show the first matches as an example
foreach ($uniqueItems as $indices) {
    echo $array[$indices[0]] . "\n";
}

使用您的输入数据输出:

Band of Horses - Is There a Ghost
Band Of Horses - No One's Gonna Love You
Band of Horses - The Funeral
Band of Horses - Laredo
Band of Horses - "The Great Salt Lake" Sub Pop Records
Band of Horses perform Marry Song at Tromso Wedding
Band of Horses, On My Way Back Home
Band of Horses - cigarettes wedding bands
Band Of Horses - I Go To The Barn Because I Like The
Our Swords - Band of Horses
Band of Horses - Monsters

(注:这看起来像是O(n³),但实际上是O(n²)。)

这个与顺序有关。首先按短语排序可能有助于减轻这种情况。它也会在单词拼写错误或略有不同的情况下失败,但当然,这也使其不太容易出现误报... - Matthew

0

最佳实现方案将在很大程度上取决于您的数据。您对数据了解得越多,使用最少的工作量就能获得更好的结果。无论如何,这里是我编写的一个示例脚本:

<?php
    $list = array(); # source data

    $groups = array();

    foreach ($list as $item)
    {
        $words = array_unique(explode(' ', trim(preg_replace('/[^a-z]+/', ' ', strtolower($item)))));

        $matches = array();

        foreach ($groups as $i => $group)
        {
            foreach ($group as $g)
            {
                if (count($words) < count($g['words']))
                {
                    $a = $words;
                    $b = $g['words'];
                }
                else
                {
                    $a = $g['words'];
                    $b = $words;
                }

                $c = 0;
                foreach ($a as $word1)
                {
                    foreach ($b as $word2)
                    {
                        if (levenshtein($word1, $word2) < 2)
                        {
                            ++$c;
                            break;
                        }
                    }
                }

                if ($c / count($a) > 0.85)
                {
                    $matches[] = $i;
                    continue 2;
                }
            }           
        }

        $me = array('item' => $item, 'words' => $words);

        if (!$matches)
            $groups[] = array($me);
        else
        {
            for ($i = 1; $i < count($matches); ++$i)
            {
                $groups[$matches[0]] = array_merge($groups[$matches[0]], $groups[$matches[$i]]);
                unset($groups[$matches[$i]]);
            }

            $groups[$matches[0]][] = $me;
        }
    }

    foreach ($groups as $group)
    {
        echo $group[0]['item']."\n";
        for ($i = 1; $i < count($group); ++$i)
            echo "\t".$group[$i]['item']."\n";
    }
?>

使用您的列表输出:

Band of Horses - Is There a Ghost
Band Of Horses - No One's Gonna Love You
    Band Of Horses - "No One's Gonna Love You"
    Band Of Horses - No One's Gonna Love You
    Band Of Horses - No One's Gonna Love You
Band of Horses - The Funeral
    Band of Horses - The Funeral (lyrics in description)
Band of Horses - Laredo
    Band Of Horses - Laredo on Letterman 5.20.10
    'Laredo' by Band of Horses on Q TV
Band of Horses - "The Great Salt Lake" Sub Pop Records
Band of Horses perform Marry Song at Tromso Wedding
    Band Of Horses - "Marry song"
Band of Horses, On My Way Back Home
Band of Horses - cigarettes wedding bands
    Band Of Horses - "Cigarettes Wedding Bands"
Band Of Horses - I Go To The Barn Because I Like The
Our Swords - Band of Horses
Band of Horses - Monsters

这里的基本原则是将类似的列表项分组在一起。任何新项目进来时,都会与现有的组进行比较。较短的项目将与较大的项目进行比较。如果足够的单词(85%)接近(2个字符不同),则被视为匹配,并添加到列表中。

如果您调整参数,则这可能已经足够好了。其他要考虑的事情:完全忽略小词,相似短语等。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接