在URL中,^符号代表转义字符。我需要从网页爬取一些链接数据,我使用了一个简单的手写PHP爬虫。这个爬虫通常运行良好,但当我遇到像这样的URL时:http://www.example.com/example.asp?x7=3^^^^^select%20col1,col2%20from%20table%20where%20recordid%3E=20^^^^^,我的爬虫无法检索到该页面,并出现“HTTP请求失败错误”。
^
字符应该进行编码,参见RFC 1738统一资源定位符(URL):
其他字符也是不安全的,因为网关和其他传输代理有时会修改这些字符。这些字符包括“{”,“}”,“|”,“\”,“^”,“~”,“[”,“]”和“`”。
所有不安全的字符必须始终在URL中进行编码。
您可以尝试对^
字符进行URL编码。
根据上下文,我猜测它们是一种自制的尝试将引号进行URL编码。
Caret (^)不是URL中的保留字符,因此原样使用 应该 是可以接受的。但是,如果你遇到问题,只需用其十六进制编码%5E
替换即可。
而且,在URL中添加原始SQL就像放一个大型闪烁霓虹灯标志,上面写着“请利用我!”。
插入符号(Caret)既不是保留字符,也不是“非保留”字符,在URL中属于“不安全字符”。因此,在URL中出现时应进行编码。根据RFC2396的规定:
2.2. Reserved Characters
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
"$" | ","
The "reserved" syntax class above refers to those characters that are
allowed within a URI, but which may not be allowed within a
particular component of the generic URI syntax; they are used as
delimiters of the components described in Section 3.
Characters in the "reserved" set are not reserved in all contexts.
The set of characters actually reserved within any given URI
component is defined by that component. In general, a character is
reserved if the semantics of the URI changes if the character is
replaced with its escaped US-ASCII encoding.
2.3. Unreserved Characters
Data characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include upper and lower case
letters, decimal digits, and a limited set of punctuation marks and
symbols.
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
Unreserved characters can be escaped without changing the semantics
of the URI, but this should not be done unless the URI is being used
in a context that does not allow the unescaped character to appear.
2.4. Escape Sequences
Data must be escaped if it does not have a representation using an
unreserved character; this includes data that does not correspond to
a printable character of the US-ASCII coded character set, or that
corresponds to any US-ASCII character that is disallowed, as
explained below.