我该如何使用wget命令中的--accept-regex选项来下载一个网站？

Question

我该如何使用wget命令中的--accept-regex选项来下载一个网站？

4

我正在尝试使用wget下载我的网站存档 - 3dsforums.com，但有数百万个页面我不想下载。因此，我想告诉wget只下载与特定URL模式匹配的页面，但我遇到了一些问题。

例如，这是我想要下载的一个URL： http://3dsforums.com/forumdisplay.php?f=46 因此，我尝试使用 --accept-regex 选项：

wget -mkEpnp --accept-regex "(forumdisplay\.php\?f=(\d+)$)" http://3dsforums.com

但它只下载了网站的主页。

到目前为止，唯一远程生效的命令是以下命令：

wget -mkEpnp --accept-regex "(\w+\.php$)" http://3dsforums.com

这将提供以下响应：

Downloaded 9 files, 215K in 0.1s (1.72 MB/s)
Converting links in 3dsforums.com/faq.php.html... 16-19
Converting links in 3dsforums.com/index.html... 8-88
Converting links in 3dsforums.com/sendmessage.php.html... 14-15
Converting links in 3dsforums.com/register.php.html... 13-14
Converting links in 3dsforums.com/showgroups.php.html... 14-29
Converting links in 3dsforums.com/index.php.html... 16-80
Converting links in 3dsforums.com/calendar.php.html... 17-145
Converting links in 3dsforums.com/memberlist.php.html... 14-99
Converting links in 3dsforums.com/search.php.html... 15-16
Converted links in 9 files in 0.009 seconds.

我的正则表达式有问题吗？或者我误解了--accept-regex选项的用法？今天我尝试了各种变化，但是我还没有完全掌握实际问题是什么。

- David Turnbull

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- zwer · Accepted Answer

默认情况下，wget 使用 POSIX 正则表达式。其中，\d 表示数字类，用 [:digit:] 表示；\w 表示单词类，用 [:word:] 表示。为什么还需要分组呢？如果你的 wget 软件支持 PCRE，则可以使用以下命令更方便：

wget -mkEpnp --regex-type pcre --accept-regex "forumdisplay.php\?f=\d+(&s=.*)?$" http://3dsforums.com

但实际上并不行，因为论坛软件会自动创建会话 ID (s=<session_id>) 并将其注入到所有链接中，所以你需要同时考虑它们。

wget -mkEpnp --regex-type pcre --accept-regex "forumdisplay\.php\?(s=.*)?f=\d+(s=.*)?$" http://3dsforums.com

唯一的问题是，现在您的文件将以会话ID命名保存，因此当wget完成后，您需要添加另一个步骤-批量重命名所有带有会话ID名称的文件。您可能可以通过将wget导入sed来完成，但我会留给您自己去尝试 :)

如果您的wget不支持PCRE，则此模式将变得相当长，但让我们希望它支持...