使用正则表达式解析或验证Base64数据

136

Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.

I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF; in that it may have only a CR, or LF, or maybe neither.

So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.

Content-Transfer-Encoding: base64

VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu

OK, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char, works perfectly. But, the next example throws a wrench into the mix.

Content-Transfer-Encoding: base64

http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu

This a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers desire to parse mime at all costs, versus ones that go strictly by the book, or rather RFC; if you will.

My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!

[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8

Anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However if you took that approach, which output do you trust? It seems that ASCII heuristics is about the best solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?


UPDATE:

Do to the number of views this question continues to get, I've decided to post the simple RegEx that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by Gumbo the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[] contains valid Base64 data or not, the following RegEx patterns work very well for me.

^[-A-Za-z0-9+/=]|=[^=]|={3,}$

Or a more simplified pattern as suggested by kael:

^[-A-Za-z0-9+/]*={0,3}$

And yes, this is just for a STRING of Base64 data, NOT a properly formatted RFC1341 message. So, if you are dealing with data of this type, please take that into account before attempting to use the above RegEx. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is highly recommend that you read RFC4648 that Gumbo mentioned in his answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.


1
为什么不在你的语言中使用标准解决方案?为什么需要基于正则表达式的手写解析器? - jfs
我该如何用空字符串替换非Base64字符? - Sapphire
1
很好的问题。虽然我尝试了对NPM返回的base64编码SHA运行UPDATE正则表达式,但是它失败了,而所选答案中的正则表达式完全有效 - vhs
6
不确定为什么“UPDATE”正则表达式仍未被更正发布,但看起来作者是想将^放在方括号外作为起始锚点。然而,一个更好的正则表达式,不需要像已接受的答案那样复杂,可以是^[-A-Za-z0-9+/]*={0,3}$ - kael
1
谢谢 @IanPringle。我刚刚更新了整个答案,包括两种表达方式,让读者可以进行比较和对比。我本来可以只删除我那旧模式,然后用kael的模式替换它。但我觉得向读者展示两种模式可能会有一些价值,并且给予kael他的模式的认可。 - undefined
显示剩余7条评论
10个回答

186

根据RFC 4648

在许多情况下,数据的基本编码用于存储或传输数据,而这些环境,出于某些历史原因,受到US-ASCII数据的限制。

因此,如果编码数据的用途并不危险,那么应该考虑使用该数据。

但是,如果您只是想要一个正则表达式来匹配Base64编码的单词,您可以使用以下内容:

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

11
最简单的解决方案是在验证之前去除所有空格(根据RFC被忽略)。 - Ben Blank
6
起初我对复杂度持怀疑态度,但是它的验证效果相当好。如果你只是想匹配类似于base64的内容,我建议使用正则表达式^[a-zA-Z0-9+/]={0,3}$,这更好! - Lodewijk
3
иҝҷжҳҜеӣ дёәnameжҳҜ(еҚҒе…ӯиҝӣеҲ¶)еӯ—иҠӮеәҸеҲ—9d a9 9eзҡ„жңүж•ҲBase64зј–з ҒгҖӮ - Marten
4
我可以翻译,这句话的意思是:“我可以问一个让我感到困扰的问题吗?‘Paul’怎么成为有效的Base64编码?” - The Bearded Llama
6
^(?:[A-Za-z0-9+\/]{4})*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{4})$ 必须转义反斜杠。 - Syed Khizaruddin
显示剩余8条评论

48
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

这个正则表达式是不错的,但会匹配一个空字符串。

而这个正则表达式不会匹配空字符串:

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{4})$

2
为什么空字符串无效? - Josh Lee
13
不是这样。但如果您正在使用正则表达式来查找给定字符串是否为base64,那么您很可能不对空字符串感兴趣。至少我知道我不感兴趣。 - njzk2
4
如果这样做,你就会强制要求Base64字符串至少包含一个4个字符的块,这将导致像MQ==这样的有效值不符合您的表达式。 - njzk2
5
@ruslan 不应该这样做。这不是一个有效的base64字符串(大小为23,不符合//4的标准)。AQENVg688MSGlEgdOJpjIUC=是正确的形式。 - njzk2
1
@JinKwon base64以0、1或2个=结尾。最后的?允许没有=。将其替换为{1}需要1或2个结尾的= - njzk2
显示剩余3条评论

10
迄今为止呈现的答案未检查Base64字符串是否有所有填充位都设置为0,这是将其作为Base64的规范表示所必需的(在某些环境中非常重要,请参见https://www.rfc-editor.org/rfc/rfc4648#section-3.5)。因此,它们允许存在不同编码的同一二进制字符串的别名。这可能是某些应用程序中的安全问题。
以下是验证给定字符串不仅是有效的base64,而且还是二进制数据的规范base64字符串的正则表达式:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$

引用的RFC认为空字符串是有效的(参见https://www.rfc-editor.org/rfc/rfc4648#section-10),因此上述正则表达式也是如此。
Base64url的等效正则表达式(同样,请参阅上述RFC)如下:
^(?:[A-Za-z0-9_-]{4})*(?:[A-Za-z0-9_-][AQgw]==|[A-Za-z0-9_-]{2}[AEIMQUYcgkosw048]=)?$

6

这里有一个备选的正则表达式:

^(?=(.{4})*$)[A-Za-z0-9+/]*={0,2}$

它满足以下条件:

  • 字符串长度必须是四的倍数 - (?=^(.{4})*$)
  • 内容必须是字母数字字符或者+或/ - [A-Za-z0-9+/]*
  • 最多可以有两个填充(=)字符在结尾处 - ={0,2}
  • 它可以接受空字符串

5
到目前为止,我能找到的最好的正则表达式是在这里:https://www.npmjs.com/package/base64-regex,目前版本如下:
module.exports = function (opts) {
  opts = opts || {};
  var regex = '(?:[A-Za-z0-9+\/]{4}\\n?)*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)';

  return opts.exact ? new RegExp('(?:^' + regex + '$)') :
                    new RegExp('(?:^|\\s)' + regex, 'g');
};

也许不加 \\n 会更好? - Jin Kwon
这将无法处理JSON字符串。 - idleberg

5

检查 RFC-4648 规范强制执行基准编码(即所有填充位设为 0)的最短正则表达式:

^(?=(.{4})*$)[A-Za-z0-9+/]*([AQgw]==|[AEIMQUYcgkosw048]=)?$

实际上,这是这个那个回答的混合体。


5

在有效的Base64编码中,既不会出现":"也不会出现".",因此我认为你可以毫无歧义地删除http://www.stackoverflow.com这一行。例如,在Perl中可以这样写:

my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;

say decode_base64($sanitized_str);

可能是您想要的。它生成

这是用于StackOverflow示例的简单ASCII Base64。


我可以同意那里,但URL中的所有其他字母碰巧都是有效的base64...那么,你在哪里划界呢?只是在换行符处?(我看过其中有一些只在行中间有几个随机字符的情况。因为此原因而放弃整行是不可取的,在我看来...) - LarryF
@LarryF:除非对base-64编码数据进行完整性检查,否则无法确定如何处理包含不正确字符的任何base-64数据块。哪种启发式方法最好:忽略不正确的字符(允许任何和所有正确的字符),还是拒绝这些行,或者拒绝全部? - Jonathan Leffler
继续上文:简短的回答是“这取决于”——取决于数据来源以及您在其中发现的混乱程度。 - Jonathan Leffler
我从问题的评论中看到您想接受任何可能是base-64的内容。因此,只需映射每个不在您的base-64字母表中的字符(请注意,还有URL安全和其他变体编码),包括换行符和冒号,并取出剩下的部分即可。 - Jonathan Leffler

5

验证Base64图片的正则表达式如下:

/^data:image\/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}$/

  private validBase64Image(base64Image: string): boolean {
    const regex = /^data:image\/(?:gif|png|jpeg|bmp|webp|svg\+xml)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}/;
    return base64Image && regex.test(base64Image);
  }

1
谢谢!关于在base64图像字符串开头的元属性非常有帮助。一个建议:至少缺少一种MIME类型svg+xml,因此第一个捕获组可能应该扩展为(?:gif|png|jpeg|bmp|webp|svg\+xml) - HynekS
@HynekS。是的。我更新了我的答案。谢谢 :-) - Jayani Sumudini
2
(?:[A-Za-z0-9]|[+/]) can be simplified to [A-Za-z0-9+/] - Tofandel

1

找到了一个非常有效的解决方案

^(?:([a-z0-9A-Z+\/]){4})*(?1)(?:(?1)==|(?1){2}=|(?1){3})$

它将匹配以下字符串

VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
YW55IGNhcm5hbCBwbGVhcw==
YW55IGNhcm5hbCBwbGVhc3U=
YW55IGNhcm5hbCBwbGVhc3Vy

虽然它不会与任何这些无效的匹配

YW5@IGNhcm5hbCBwbGVhcw==
YW55IGNhc=5hbCBwbGVhcw==
YW55%%%%IGNhcm5hbCBwbGVhc3V
YW55IGNhcm5hbCBwbGVhc3
YW55IGNhcm5hbCBwbGVhc
YW***55IGNhcm5hbCBwbGVh=
YW55IGNhcm5hbCBwbGVhc==
YW55IGNhcm5hbCBwbGVhc===

1

我简化的Base64正则表达式:

^[A-Za-z0-9+/]*={0,2}$

这个正则表达式的简化之处在于它不检查其长度是否是4的倍数。如果您需要这个功能,请使用其他答案。我的重点是简单易懂。

测试方法:https://regex101.com/r/zdtGSH/1


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接