URL中的^符号代表什么?

6
在URL中,^符号代表转义字符。我需要从网页爬取一些链接数据,我使用了一个简单的手写PHP爬虫。这个爬虫通常运行良好,但当我遇到像这样的URL时:http://www.example.com/example.asp?x7=3^^^^^select%20col1,col2%20from%20table%20where%20recordid%3E=20^^^^^,我的爬虫无法检索到该页面,并出现“HTTP请求失败错误”。

22
有趣的SQL查询字符串和一个叫小博比的删除表笑话。 - John Boker
1
你认为 ^ 符号内的 SQL 语句会被直接执行吗? - KJ Saxena
2
我不知道,你呢?如果我正在通过查询字符串传递数据,我只会传递参数,然后构建一个参数化查询或使用存储过程。 - John Boker
4
抱歉,该内容包含恶意代码,建议不要进行翻译或者访问。 - Anon.
1
有人还没有学习过SQL注入。 - JL.
5个回答

9

^字符应该进行编码,参见RFC 1738统一资源定位符(URL)

其他字符也是不安全的,因为网关和其他传输代理有时会修改这些字符。这些字符包括“{”,“}”,“|”,“\”,“^”,“~”,“[”,“]”和“`”。

所有不安全的字符必须始终在URL中进行编码。

您可以尝试对^字符进行URL编码。


2
永远不要提出不应该执行的建议。提问者可能会认真考虑你的话。他应该联系网站所有者以了解他们的愚蠢之处,但不要删除表格。 - SamGoody
开玩笑的,+1 是因为 RFC 真的告诉我们要编码我们的 ^。 - Bruno Brant
1
@samgoody:我删掉了本来想要搞笑的那一部分回答。 - Brian R. Bondy

7

根据上下文,我猜测它们是一种自制的尝试将引号进行URL编码。


6

Caret (^)不是URL中的保留字符,因此原样使用 应该 是可以接受的。但是,如果你遇到问题,只需用其十六进制编码%5E 替换即可。

而且,在URL中添加原始SQL就像放一个大型闪烁霓虹灯标志,上面写着“请利用我!”。


2
它不是保留的,但也不是“未保留”的,这意味着根据RFC2396第2.4节,“必须进行转义”。 - Laurence Gonsalves
1
RFC 2396已被3986取代,尽管它不是“未保留”的观点仍然适用。 - Anon.
1
@Anon 是对的,但3986把相关信息分散得太多了,我找不到一个好的单一位置来引用。RFC 2396在实际应用中仍然基本正确。 - Laurence Gonsalves

4

插入符号(Caret)既不是保留字符,也不是“非保留”字符,在URL中属于“不安全字符”。因此,在URL中出现时应进行编码。根据RFC2396的规定:

2.2. Reserved Characters

   Many URI include components consisting of or delimited by, certain
   special characters.  These characters are called "reserved", since
   their usage within the URI component is limited to their reserved
   purpose.  If the data for a URI component would conflict with the
   reserved purpose, then the conflicting data must be escaped before
   forming the URI.

      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                    "$" | ","

   The "reserved" syntax class above refers to those characters that are
   allowed within a URI, but which may not be allowed within a
   particular component of the generic URI syntax; they are used as
   delimiters of the components described in Section 3.

   Characters in the "reserved" set are not reserved in all contexts.
   The set of characters actually reserved within any given URI
   component is defined by that component. In general, a character is
   reserved if the semantics of the URI changes if the character is
   replaced with its escaped US-ASCII encoding.

2.3. Unreserved Characters

   Data characters that are allowed in a URI but do not have a reserved
   purpose are called unreserved.  These include upper and lower case
   letters, decimal digits, and a limited set of punctuation marks and
   symbols.

      unreserved  = alphanum | mark

      mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

   Unreserved characters can be escaped without changing the semantics
   of the URI, but this should not be done unless the URI is being used
   in a context that does not allow the unescaped character to appear.

2.4. Escape Sequences

   Data must be escaped if it does not have a representation using an
   unreserved character; this includes data that does not correspond to
   a printable character of the US-ASCII coded character set, or that
   corresponds to any US-ASCII character that is disallowed, as
   explained below.

我应该补充一下,RFC 3986 取代了 2396,但是 3986 把相关信息分散得太广了,我找不到一个好的单一引用位置。对于所有实际目的来说,RFC 2396 仍然大多正确,并且通常更容易理解。 - Laurence Gonsalves

0
爬虫可能正在使用正则表达式来解析URL,因此会出现错误,因为插入符号(^)表示行的开头。我认为这些URL实际上是非常糟糕的做法,因为它们暴露了底层数据库结构;编写此代码的人可能需要考虑进行严格的重构!希望对你有所帮助!

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接