下载一个可工作的网页本地副本

Question

下载一个可工作的网页本地副本

downloadwgetoffline-browsing

233

我想下载一个网页的本地备份，并获取其中所有的css、图片、JavaScript等内容。

在之前的讨论中（例如这里和这里，两个都超过两年了），通常会提出两个建议：使用wget -p和httrack。但是，这些建议都不能满足我的需求。我非常希望能够使用这些工具来完成任务，或者使用其他替代方案。

选项1：wget -p

wget -p可以成功下载页面的所有先决条件（包括css、图片和js等）。然而，当我加载本地备份时，页面无法加载先决条件，因为这些先决条件的路径没有从网上版本中进行修改。

例如：

在页面的html中，<link rel="stylesheet href="/stylesheets/foo.css" />需要更正以指向新的相对路径foo.css
在css文件中，background-image: url(/images/bar.png)也需要做出相应的调整。

有没有方法可以修改wget -p，使路径正确无误？

选项2：httrack

httrack 似乎是一个很好的用于镜像整个网站的工具，但我不清楚如何使用它来创建单个页面的本地副本。httrack 论坛中有很多关于此主题的讨论（例如这里），但没有人似乎有一个百分之百可靠的解决方案。

选项3：另一种工具？

有些人提出要使用付费工具，但我无法相信没有免费的解决方案。

- brahn

23

如果答案无效，尝试：wget -E -H -k -K -p http://example.com - 这是唯一对我有效的方法。来源：http://superuser.com/a/136335/94039 - its_me

还有一种软件可以做到这一点，Teleport Pro。 - pbies

5

使用命令wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com可以在模拟Mozilla浏览器的情况下，递归地下载指定网站的所有页面并保存为本地文件，其中--random-wait参数可以随机等待一段时间，避免对服务器造成过大负担，-r参数表示递归下载，-p参数表示下载页面所需的所有元素，-e robots=off参数表示不遵循robots.txt规则。 - davidcondrey

可能是[下载网页及其依赖项，包括CSS和图像]的重复问题(https://dev59.com/E3I-5IYBdhLWcg3w-92b)。 - jww

1

这个问题的关闭方式，迄今为止已经有203K次浏览，相对于其他提出和链接的解决方案，具有明确的递增要求。 - Merlin

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- serk · Accepted Answer

wget可以做你所要求的事情。只需尝试以下操作：

wget -p -k http://www.example.com/

-p参数将为您获取所有必要的元素，以正确查看网站（包括CSS、图片等）。 -k参数将更改所有链接（包括用于CSS和图像的链接），使您能够离线查看在线上显示的页面。

来自Wget文档：

‘-k’
‘--convert-links’
After the download is complete, convert the links in the document to make them
suitable for local viewing. This affects not only the visible hyperlinks, but
any part of the document that links to external content, such as embedded images,
links to style sheets, hyperlinks to non-html content, etc.

Each link will be changed in one of the two ways:

    The links to files that have been downloaded by Wget will be changed to refer
    to the file they point to as a relative link.

    Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also
    downloaded, then the link in doc.html will be modified to point to
    ‘../bar/img.gif’. This kind of transformation works reliably for arbitrary
    combinations of directories.

    The links to files that have not been downloaded by Wget will be changed to
    include host name and absolute path of the location they point to.

    Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to
    ../bar/img.gif), then the link in doc.html will be modified to point to
    http://hostname/bar/img.gif. 

Because of this, local browsing works reliably: if a linked file was downloaded,
the link will refer to its local name; if it was not downloaded, the link will
refer to its full Internet address rather than presenting a broken link. The fact
that the former links are converted to relative links ensures that you can move
the downloaded hierarchy to another directory.

Note that only at the end of the download can Wget know which links have been
downloaded. Because of that, the work done by ‘-k’ will be performed at the end
of all the downloads.