在bash中检查远程文件是否存在

Question

在bash中检查远程文件是否存在

12

我正在使用这个脚本下载文件：

parallel --progress -j16 -a ./temp/img-url.txt 'wget -nc -q -P ./images/ {}; wget -nc -q -P ./images/ {.}_{001..005}.jpg'

是否有可能不下载文件，仅在远程端检查它们，并在存在时创建一个虚拟文件而不是下载？

类似于：

if wget --spider $url 2>/dev/null; then
  #touch img.file
fi

应该可以工作，但我不知道如何将这段代码与GNU Parallel组合使用。

编辑：

根据Ole的答案，我编写了这段代码：

#!/bin/bash
do_url() {
  url="$1"
  wget -q -nc  --method HEAD "$url" && touch ./images/${url##*/}   
  #get filename from $url
  url2=${url##*/}
  wget -q -nc  --method HEAD ${url%.jpg}_{001..005}.jpg && touch ./images/${url2%.jpg}_{001..005}.jpg
}
export -f do_url

parallel --progress -a urls.txt do_url {}

它能够工作，但对于某些文件却失败了。我无法找到其为什么对于一些文件可以工作，而对于另一些则失败的一致性原因。也许这与最后一个文件名有关。第二个wget尝试访问正确的URL，但在此之后的touch命令并没有创建所需的文件。第一个wget总是（正确地）下载主图像，而不包括_001.jpg、_002.jpg。

urls.txt的示例：

http://host.com/092401.jpg（可以正常工作，下载了_001.jpg.._005.jpg） http://host.com/HT11019.jpg（不行，只下载了主图像）

- Adrian

1

使用 --method HEAD 发送 HEAD 请求而不是 GET 请求。 - chepner

可能是 https://dev59.com/qWct5IYBdhLWcg3wSrl3 的重复问题。 - iamauser

@iamauser 你是认真的吗？在那个问题里哪有提到检查远程端文件序列的话？ - Adrian

是的，我是。我认为你的问题应该是如何循环遍历一系列文件，因为这是每次wget/curl调用的输入。 - iamauser

2

在提供了一些答案之后完全改变你的问题是不好的行为，这让这里提供的大多数答案看起来都是错误的。然而，问题在于你在提供答案之后才更改了问题。 - darnir

5个回答

5

很难理解你真正想要实现什么。让我试着重新表述一下你的问题。

I have urls.txt containing:
http://example.com/dira/foo.jpg
http://example.com/dira/bar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.org/dira/foo.jpg
On example.com these URLs exist:
http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_005.jpg
http://example.com/dira/bar_000.jpg
http://example.com/dira/bar_002.jpg
http://example.com/dira/bar_004.jpg
http://example.com/dira/fubar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.com/dirb/baz_001.jpg
http://example.com/dirb/baz_005.jpg
On example.org these URLs exist:
http://example.org/dira/foo_001.jpg
Given urls.txt I want to generate the combinations with _001.jpg .. _005.jpg in addition to the original URL. E.g.:
http://example.com/dira/foo.jpg
becomes:
http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_002.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_004.jpg
http://example.com/dira/foo_005.jpg
Then I want to test if these URLs exist without downloading the file. As there are many URLs I want to do this in parallel.

If the URL exists I want an empty file created.

(Version 1): I want the empty file created in a the similar directory structure in the dir images. This is needed because some of the images have the same name, but in different dirs.

So the files created should be:
images/http:/example.com/dira/foo.jpg
images/http:/example.com/dira/foo_001.jpg
images/http:/example.com/dira/foo_003.jpg
images/http:/example.com/dira/foo_005.jpg
images/http:/example.com/dira/bar_000.jpg
images/http:/example.com/dira/bar_002.jpg
images/http:/example.com/dira/bar_004.jpg
images/http:/example.com/dirb/foo.jpg
images/http:/example.com/dirb/baz.jpg
images/http:/example.com/dirb/baz_001.jpg
images/http:/example.com/dirb/baz_005.jpg
images/http:/example.org/dira/foo_001.jpg
(Version 2): I want the empty file created in the dir images. This can be done because all the images have unique names.

So the files created should be:
images/foo.jpg
images/foo_001.jpg
images/foo_003.jpg
images/foo_005.jpg
images/bar_000.jpg
images/bar_002.jpg
images/bar_004.jpg
images/baz.jpg
images/baz_001.jpg
images/baz_005.jpg
(Version 3): I want the empty file created in the dir images called the name from urls.txt. This can be done because only one of _001.jpg .. _005.jpg exists.
images/foo.jpg
images/bar.jpg
images/baz.jpg

#!/bin/bash

do_url() {
  url="$1"

  # Version 1:
  # If you want to keep the folder structure from the server (similar to wget -m):
  wget -q --method HEAD "$url" && mkdir -p images/"$2" && touch images/"$url"

  # Version 2:
  # If all the images have unique names and you want all images in a single dir
  wget -q --method HEAD "$url" && touch images/"$3"

  # Version 3:
  # If all the images have unique names when _###.jpg is removed and you want all images in a single dir
  wget -q --method HEAD "$url" && touch images/"$4"

}
export -f do_url

parallel do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg

GNU Parallel每个作业只需要几毫秒的时间。当您的作业如此短时，开销会影响到计时。如果您的CPU核心没有一个运行在100％，则可以并行运行更多的作业：

parallel -j0 do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg

您可以“展开”循环。这将每个URL节省5个开销：

do_url() {
  url="$1"
  # Version 2:
  # If all the images have unique names and you want all images in a single dir
  wget -q --method HEAD "$url".jpg && touch images/"$url".jpg
  wget -q --method HEAD "$url"_001.jpg && touch images/"$url"_001.jpg
  wget -q --method HEAD "$url"_002.jpg && touch images/"$url"_002.jpg
  wget -q --method HEAD "$url"_003.jpg && touch images/"$url"_003.jpg
  wget -q --method HEAD "$url"_004.jpg && touch images/"$url"_004.jpg
  wget -q --method HEAD "$url"_005.jpg && touch images/"$url"_005.jpg
}
export -f do_url

parallel -j0 do_url {.} :::: urls.txt

现在你可以运行超过250个作业：https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround

- Ole Tange

无法将所有图像保存到images/目录中吗？我的URL很长，这个脚本会创建一个奇怪的文件夹结构。 - Adrian

添加了 images。 - Ole Tange

我需要“版本2”。它运行良好，谢谢。我进行了一些基准测试，但速度让我失望。如果您感兴趣，这是结果：https://pastebin.ca/3971248。您认为瓶颈在哪里？ - Adrian

有了250个作业（-j0），运行时间现在减半了，但不幸的是与wget --no-clog（如果存在则不下载）相比仍然较慢。但这是一个很好的答案，我将来一定会使用它。最新示例中有些奇怪：$ ls images/ _001.jpg _002.jpg _003.jpg _004.jpg _005.jpg. - Adrian

3

从我所看到的，你的问题并不是关于如何使用wget测试文件是否存在，而是关于如何在shell脚本中执行正确的循环。以下是一个简单的解决方案：

urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[@]}"; do
    if wget -q --method=HEAD "$url"; then
        touch .images/${url##*/}
    fi
done

这段代码使用 --method=HEAD 选项调用 Wget。使用 HEAD 请求，服务器只会简单地报告文件是否存在，而不返回任何数据。

当然，在大量数据集的情况下，这样做非常低效。您需要为每个要尝试的文件创建一个新的与服务器的连接。相反，可以像其他答案中建议的那样使用 GNU Wget2。使用 wget2，您可以并行测试所有这些文件，并使用新的 --stats-server 选项查找所有文件及服务器提供的特定返回码的列表。例如：

$ wget2 --spider --progress=none -q --stats-site example.com/{,1,2,3}                                                             
Site Statistics:

  http://example.com:
    Status    No. of docs
       404              3
         http://example.com/3  0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
         http://example.com/1  0 bytes (gzip) : 0 bytes (decompressed), 241ms (transfer) : 241ms (response)
         http://example.com/2  0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
       200              1
         http://example.com/  0 bytes (identity) : 0 bytes (decompressed), 231ms (transfer) : 231ms (response)

您甚至可以将这些数据打印为CSV或JSON格式，以便更轻松地解析。

- darnir

最终我成功编译了Wget2。为了进行快速测试，我运行了以下命令：wget2 --spider --progress=none --stats-site=csv:stat.csv ${url%.jpg}_{001..005}.jpg。它可以正常查询URL（例如example.com/hello_001.jpg等），但在stat.csv中只有一个最后的查询结果+我认为是主图像（exampe.com/hello.jpg）。我还需要再运行一次Wget2。 - Adrian

我在想，Wget2应该比Wget&Parallel工作得更快。目前，Wget&Parallel&TouchDummyFile比Wget&Parallel&DownloadFiles慢。基准测试结果在@OleTange的回答中。 - Adrian

如果图像非常小（~5kB），并行+触摸可能比仅下载文件更慢。这是因为您仍然需要为每个要测试的文件与服务器建立新连接，然后启动新进程。有时这比仅下载该文件要慢。在这种情况下，Wget2确实应该更快，因为它只需要建立一次连接。 - darnir

1

你所看到的统计数据问题是一个bug。我会提交报告，应该在一两天内修复。在此期间，如果您不使用json或csv，仍然可以查看完整的统计数据。 - darnir

谢谢，我会在修复错误后回报。 - Adrian

2

只需循环遍历名称即可吗？

for uname in ${url%.jpg}_{001..005}.jpg
do
  if wget --spider $uname 2>/dev/null; then
    touch ./images/${uname##*/}
  fi
done

- Burghard Hoffmann

我问这个问题，是因为我不想下载任何文件，只想在远程端检查并创建一个本地虚拟文件（与同名文件存在）。 - Adrian

-2

你可以通过ssh发送命令来查看远程文件是否存在，如果存在，则可以使用cat命令查看其内容：

ssh your_host 'test -e "somefile" && cat "somefile"' > somefile

也可以尝试使用支持通配符表达式和递归的scp命令。

- Cole Tierney

不行，远程主机只支持HTTP。 - Adrian

curl -I 可以告诉您文件是否存在。 - Cole Tierney

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- AnythingIsFine · Accepted Answer

您可以使用curl来检查您正在解析的URL是否存在，而无需下载任何文件，例如：

if curl --head --fail --silent "$url" >/dev/null; then
    touch .images/"${url##*/}"
fi

解释：

--fail 会在请求失败时使退出状态非零。
--head 将避免下载文件内容。
--silent 将避免检查本身发出状态或错误。

为了解决“循环”问题，您可以执行以下操作：

urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[@]}"; do
    if curl --head --silent --fail "$url" > /dev/null; then
        touch .images/${url##*/}
    fi
done