pingr::ping()
只使用被合理组织网络封锁的ICMP,因为攻击者曾经利用ICMP来窃取数据并与命令和控制服务器通信。
pingr::ping_port()
不使用HTTP Host:
头,因此IP地址可能会响应,但目标虚拟Web主机可能不在其上运行,并且它绝对不会验证路径是否存在于目标URL中。
当只有非200:299范围的HTTP状态代码时,您应该澄清希望发生什么。以下是一种假设。
注意:您以亚马逊作为例子,我希望那只是您脑海中首先想到的网站,因为爬取亚马逊是不道德的,也是犯罪的。如果您实际上只是一个厚颜无耻的内容盗窃者,请不要将我的代码引入您的世界。如果您正在窃取内容,则不太可能在这里诚实地表达自己,但外部机会是您同时在窃取和有良心,请告诉我,以便我可以删除此答案,以免其他内容窃贼使用它。
这里是一个自包含的检查URL函数:
url_exists <- function(x, non_2xx_return_value = FALSE, quiet = FALSE,...) {
suppressPackageStartupMessages({
require("httr", quietly = FALSE, warn.conflicts = FALSE)
})
capture_error <- function(code, otherwise = NULL, quiet = TRUE) {
tryCatch(
list(result = code, error = NULL),
error = function(e) {
if (!quiet)
message("Error: ", e$message)
list(result = otherwise, error = e)
},
interrupt = function(e) {
stop("Terminated by user", call. = FALSE)
}
)
}
safely <- function(.f, otherwise = NULL, quiet = TRUE) {
function(...) capture_error(.f(...), otherwise, quiet)
}
sHEAD <- safely(httr::HEAD)
sGET <- safely(httr::GET)
res <- sHEAD(x, ...)
if (is.null(res$result) ||
((httr::status_code(res$result) %/% 200) != 1)) {
res <- sGET(x, ...)
if (is.null(res$result)) return(NA)
if (((httr::status_code(res$result) %/% 200) != 1)) {
if (!quiet) warning(sprintf("Requests for [%s] responded but without an HTTP status code in the 200-299 range", x))
return(non_2xx_return_value)
}
return(TRUE)
} else {
return(TRUE)
}
}
试一试:
c(
"http://content.thief/",
"http://rud.is/this/path/does/not_exist",
"https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=content+theft",
"https://www.google.com/search?num=100&source=hp&ei=xGzMW5TZK6G8ggegv5_QAw&q=don%27t+be+a+content+thief&btnK=Google+Search&oq=don%27t+be+a+content+thief&gs_l=psy-ab.3...934.6243..7114...2.0..0.134.2747.26j6....2..0....1..gws-wiz.....0..0j35i39j0i131j0i20i264j0i131i20i264j0i22i30j0i22i10i30j33i22i29i30j33i160.mY7wCTYy-v0",
"https://rud.is/b/2018/10/10/geojson-version-of-cbc-quebec-ridings-hex-cartograms-with-example-usage-in-r/"
) -> some_urls
data.frame(
exists = sapply(some_urls, url_exists, USE.NAMES = FALSE),
some_urls,
stringsAsFactors = FALSE
) %>% dplyr::tbl_df() %>% print()
## A tibble: 5 x 2
## exists some_urls
## <lgl> <chr>
## 1 NA http://content.thief/
## 2 FALSE http://rud.is/this/path/does/not_exist
## 3 TRUE https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=con…
## 4 TRUE https://www.google.com/search?num=100&source=hp&ei=xGzMW5TZK6G8ggegv5_QAw&q=don%27t…
## 5 TRUE https://rud.is/b/2018/10/10/geojson-version-of-cbc-quebec-ridings-hex-cartograms-wi…
## Warning message:
## In FUN(X[[i]], ...) :
## Requests for [http://rud.is/this/path/does/not_exist] responded but without an HTTP status code in the 200-299 range
Error in UseMethod("status_code") : no applicable method for 'status_code' applied to an object of class "NULL"
由于某种原因,在if (is.null(res))语句之前,代码仍然会出现错误。 - J. Doe