preg_match查找和替换字符串模式

5
我有一个WordPress数据库,其中包含一些来自SoundCloud的嵌入式iframe。我希望将这些iframe替换为某种简码(shortcode)。我甚至创建了一个简码,它的效果非常好。
问题是我有一个旧的数据库,大约有2000篇文章已经嵌入代码。我想编写一段代码,以便用简码替换iframe。
以下是我正在使用的代码,用于从内容中查找URL,但它总是返回空白。
$string = 'Think Kavinsky meets Futurecop! meets your favorite 80s TV show theme song and you might be pretty close to Swedish producer Johan Bengtsson\'s retro project, <a href="https://soundcloud.com/daataa"><strong>Mitch Murder</strong></a>. Title track, "The Touch," is genuinely lighthearted and fun, crossing over from 80s synth work into a bit of French Touch influence; also including a big time guitar solo straight out of your dad\'s record collection. B-side "Race Day" could very easily be the soundtrack to a video montage of all of your favorite beach scenes from every 80s movie you\'ve ever watched, or as the PR put it, "quite possibly a contender to be the title screen music to a Wave Race 64 sequel." Sounds awesome to me. Also included in this package out today on <a href="https://soundcloud.com/maddecent/">Mad Decent</a>\'s Jeffree\'s sub-label are two remixes of the A-side from Lifelike and Nite Sprite. Download below.
<iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Fplaylists%2F8087281&amp;color=000000&amp;auto_play=false&amp;show_artwork=true" frameborder="no" scrolling="no" width="100%" height="350"></iframe>';

preg_match("/url=(.*?)/", $string, $matches);

print_r($matches);

以上代码无法运行,我不太熟悉正则表达式,如果有人能找出问题所在,那就太好了。而且如果有人能指导我正确的处理方法,那就更好了。


1
不要使用正则表达式解析HTML。请使用适当的HTML解析模块。你无法可靠地使用正则表达式解析HTML,未来你会遇到悲伤和挫败。一旦HTML与你的期望不同,你的代码就会出现问题。请访问http://htmlparsing.com/php了解如何使用已经编写、测试和调试过的PHP模块正确解析HTML的示例。 - Andy Lester
5个回答

4

由于您在处理HTML,我建议使用DOM函数:

$doc = new DOMDocument;
$doc->loadHTML($string);

foreach ($doc->getElementsByTagName('iframe') as $iframe) {
    $url = $iframe->getAttribute('src');
    // parse the query string
    parse_str(parse_url($url, PHP_URL_QUERY), $args);
    // save the modified attribute
    $iframe->setAttribute('src', $args['url']);
}

echo $doc->saveHTML();

这将输出完整的文档,因此您需要对其进行修剪:

$body = $doc->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $node) {
    echo $doc->saveHTML($node);
}

输出:

<p>Think Kavinsky meets Futurecop! meets your favorite 80s TV show theme song and you might be pretty close to Swedish producer Johan Bengtsson's retro project, <a href="https://soundcloud.com/daataa"><strong>Mitch Murder</strong></a>. Title track, "The Touch," is genuinely lighthearted and fun, crossing over from 80s synth work into a bit of French Touch influence; also including a big time guitar solo straight out of your dad's record collection. B-side "Race Day" could very easily be the soundtrack to a video montage of all of your favorite beach scenes from every 80s movie you've ever watched, or as the PR put it, "quite possibly a contender to be the title screen music to a Wave Race 64 sequel." Sounds awesome to me. Also included in this package out today on <a href="https://soundcloud.com/maddecent/">Mad Decent</a>'s Jeffree's sub-label are two remixes of the A-side from Lifelike and Nite Sprite. Download below.
<iframe src="http://api.soundcloud.com/playlists/8087281" frameborder="no" scrolling="no" width="100%" height="350"></iframe></p>

2

这应该满足您所指定的要求

$new_string = preg_replace('/(?:<iframe[^\>]+src="[^\"]*url=([^\"]*soundcloud\.com[^\"]*))"[^\/]*\/[^\>]*>/i', '[soundcloud url="$1"]', $string);

它仅限于具有src属性中包含url = ... soundcloud ...部分的iframe,并使用[soundcloud url =“{url =}之后的部分”]替换整个iframe代码


2

如果只是一次性的修复,您可以考虑使用SQL解决方案。以下SQL有一些假设:

  • 每篇文章中仅有一个需要替换的iframe(如果有多个iframe,则可以运行多次SQL进行替换)。
  • 需要替换的iframes全部采用以下格式:

<iframe src="https://w.soundcloud.com/player/?url="..." other-stuff</iframe>

  • 您只关心url参数中引号之间的内容。
  • 最终结果为[soundcloud url="..."]

如果以上所有条件都成立,则以下SQL应该能够解决问题。如果您需要不同的短代码等,可以对其进行微调。

在执行任何批量更新之前,请务必备份wp_posts表。

CREATE TABLE wp_posts_backup SELECT * FROM wp_posts
;

备份完成后,以下SQL语句可以一次性修复所有文章:

UPDATE wp_posts p

   SET p.post_content = CONCAT( SUBSTRING_INDEX( p.post_content, '<iframe src="https://w.soundcloud.com/player/?url=', 1 )
                               ,'[soundcloud url="'
                               , REPLACE( REPLACE(
                                 SUBSTRING_INDEX( SUBSTR( p.post_content
                                                        , LOCATE( '<iframe src="https://w.soundcloud.com/player/?url=', p.post_content ) + 50
                                                        )
                                                , '&amp;', 1
                                                )
                               , '%3A', ':' ), '%2F', '/' )
                               ,'?'
                               ,SUBSTRING_INDEX( SUBSTR( p.post_content
                                                       , LOCATE( '<iframe src="https://w.soundcloud.com/player/?url=', p.post_content ) + 50
                                                       + LOCATE( '&amp;', SUBSTR( p.post_content
                                                                                , LOCATE( '<iframe src="https://w.soundcloud.com/player/?url=', p.post_content ) + 50
                                                                                )
                                                               ) + 4
                                                       )
                                               , ' ', 1
                                               )
                               ,']'
                               ,SUBSTR( p.post_content, LOCATE( '</iframe>', p.post_content ) + 9 )
                              )

 WHERE p.post_content LIKE '%<iframe src="https://w.soundcloud.com/player/?url=%</iframe>%'
;

我建议您在对所有帖子应用此操作之前,先测试几篇文章。测试的简单方法是在上面的WHERE子句中添加以下内容(在“;”之前立即更改“?”为要测试的帖子ID),以进行测试。
AND p.ID IN (?,?,?)

如果出于任何原因您需要恢复帖子,您可以执行以下操作:

UPDATE wp_posts p
  JOIN wp_posts_backup b
    ON b.ID = p.ID
   SET p.post_content = b.post_content
;

还有一件事需要考虑。我不确定您是否想要传递当前URL的参数,因此我将其包含在内。您可以通过更改以下内容轻松删除它们:

                               ,'?'
                               ,SUBSTRING_INDEX( SUBSTR( p.post_content
                                                       , LOCATE( '<iframe src="https://w.soundcloud.com/player/?url=', p.post_content ) + 50
                                                       + LOCATE( '&amp;', SUBSTR( p.post_content
                                                                                , LOCATE( '<iframe src="https://w.soundcloud.com/player/?url=', p.post_content ) + 50
                                                                                )
                                                               ) + 4
                                                       )
                                               , ' ', 1
                                               )
                               ,']'

to:

                           ,'"]'

导致:
UPDATE wp_posts p

   SET p.post_content = CONCAT( SUBSTRING_INDEX( p.post_content, '<iframe src="https://w.soundcloud.com/player/?url=', 1 )
                               ,'[soundcloud url="'
                               , REPLACE( REPLACE(
                                 SUBSTRING_INDEX( SUBSTR( p.post_content
                                                        , LOCATE( '<iframe src="https://w.soundcloud.com/player/?url=', p.post_content ) + 50
                                                        )
                                                , '&amp;', 1
                                                )
                               , '%3A', ':' ), '%2F', '/' )
                               ,'"]'
                               ,SUBSTR( p.post_content, LOCATE( '</iframe>', p.post_content ) + 9 )
                              )

 WHERE p.post_content LIKE '%<iframe src="https://w.soundcloud.com/player/?url=%</iframe>%'
;

更新以允许URL中无参数

UPDATE wp_posts p

   SET p.post_content = CONCAT( SUBSTRING_INDEX( p.post_content, '<iframe src="https://w.soundcloud.com/player/?url=', 1 )
                               ,'[soundcloud url="'
                               , REPLACE( REPLACE(
                                 SUBSTRING_INDEX(
                                     SUBSTRING_INDEX( SUBSTR( p.post_content
                                                            , LOCATE( '<iframe src="https://w.soundcloud.com/player/?url=', p.post_content ) + 50
                                                            )
                                                    , '&amp;', 1
                                                    )
                                                , '"', 1
                                                )
                               , '%3A', ':' ), '%2F', '/' )
                               ,'"]'
                               ,SUBSTR( p.post_content, LOCATE( '</iframe>', p.post_content ) + 9 )
                              )

 WHERE p.post_content LIKE '%<iframe src="https://w.soundcloud.com/player/?url=%</iframe>%'
;

祝你好运。


这个很好用啊。你能不能告诉我我们是否可以更具体一些?目前在运行脚本后,我得到的是 [soundcloud url="http://api.soundcloud.com/tracks/107374286?color=000000&auto_play=false&show_artwork=true"],难道我们不能把它改成 [soundcloud url="http://api.soundcloud.com/tracks/107374286"] 吗? - Nirmal Ram
可以的。正如我在答案末尾所述,“还有一件事需要考虑。我不确定您是否想要传递当前URL的参数,因此我将它们包括在内。通过更改:...”,您可以轻松地将其删除。因此,请使用,'"]'替换 ,'?',SUBSTRING_INDEX ... ,']' 即可得到没有参数的URL。 - gwc
是的,那个方法可以正常工作,但还有另一个问题。当我对某些条目执行此操作时,它变成了这样: [soundcloud url="http://api.soundcloud.com/playlists/3608420" frameborder="no" scrolling="no" width="100%" height="350"></iframe>我能再做些什么来纠正它吗? - Nirmal Ram
因此,似乎并非所有的URL都包含参数。更新了SQL以允许没有参数的URL。 - gwc
每个问题都是独特的 - 常量、长度等会因问题而异。但是,基本概念仍然适用。只需要根据您的具体数据要求进行调整即可。 - gwc
显示剩余3条评论

1
<?php
    preg_match("/url\=([^\"]+)/i", $string, $matches);

所以基本上您想要匹配url=后的任何字符(1+),但不包括“

”后面的字符。

你好,感谢您的回答,但您能否指导我如何使其更具体,例如 w.soundcloud.com/player/?url= - Nirmal Ram
什么是用最佳方式将iframe替换为类似于[soundcloud url="$matches[0]"]的东西? - Nirmal Ram
preg_replace("/w.soundcloud.com/?url=([^"]+)/i", "[soundcloud url="$1" ]", $string); - lePunk

1
我建议您查看simplehtmldom。它是一个DOM解析器,使用类似于jQuery和CSS的选择器。

http://simplehtmldom.sourceforge.net/

$html = load($html_from_database);
// Find all frames
foreach($html->find('frame') as $element){
   $source = $element->src; // extract the source from the frame.
   // This is where you do your magic like changing links. 
   $element->href = $source ; // This is where you replace the old source
}


// UPDATE $html back into the table.

在解析之后更新任何表格之前,请确保对所有表格进行完整备份 :)

http://simplehtmldom.sourceforge.net/manual.htm


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接