最近我一直在思考压缩可能是如何实现的,目前我的假设是使用“字节签名”键的哈希表和内存位置值来实现压缩,其中该“字节签名”应在扩展被压缩项时被替换。
这距离真相有多远呢?
通常如何实现压缩?不需要一页的答案,简单的概述即可。
最近我一直在思考压缩可能是如何实现的,目前我的假设是使用“字节签名”键的哈希表和内存位置值来实现压缩,其中该“字节签名”应在扩展被压缩项时被替换。
这距离真相有多远呢?
通常如何实现压缩?不需要一页的答案,简单的概述即可。
压缩算法试图找到重复的子序列,并将它们替换为更短的表示形式。
例如,我们可以从An Explanation of the Deflate Algorithm中提取一个长度为25个字节(200位)的字符串Blah blah blah blah blah!
来举例说明。
一种朴素的方法是用相同长度的码字对每个字符进行编码。因为我们有7个不同的字符,所以需要长度为ceil(ld(7)) = 3
的代码。我们的代码可以如下:
000 → "B"
001 → "l"
010 → "a"
011 → "h"
100 → " "
101 → "b"
110 → "!"
111 → not used
现在我们可以按以下方式编码字符串:
000 001 010 011 100 101 001 010 011 100 101 001 010 011 100 101 001 010 110
B l a h _ b l a h _ b l a h _ b l a !
这只需要25 x 3个比特 = 75个比特来编码,再加上7 x 8个比特 = 56个比特的字典,因此总共需要 131个比特(65.5%)。
或者对于序列:
00 → "lah b"
01 → "B"
10 → "lah!"
11 → not used
编码后的单词:01 00 00 00 00 10
B lah b lah b lah b lah b lah!
现在我们只需要6x2个比特位=12个比特位来表示编码后的单词,以及10x8个比特位=80个比特位再加上3x8个比特位=24个比特位来表示每个单词的长度,因此总共需要116个比特位(58.0%)。
Huffman编码用于对出现频率更高的字符/子串进行短代码编码,而对于出现频率较低的字符/子串则使用较长的编码:
5 × "l", "a", "h"
4 × " ", "b"
1 × "B", "!"
// or for sequences
4 × "lah b"
1 × "B", "lah!"
可能的哈夫曼编码如下:
0 → "l"
10 → "a"
110 → "h"
1110 → " "
11110 → "b"
111110 → "B"
111111 → "!"
或者用于序列:
0 → "lah b"
10 → "B"
11 → "lah!"
现在我们的Blah blah blah blah blah!
可以编码为:
111110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 111111
B l a h _ b l a h _ b l a h _ b l a h _ b l a h !
或者对于序列:
10 0 0 0 0 11
B lah b lah b lah b lah b lah!
现在我们的第一段代码只需要78比特或8比特,而不是像我们最初的字符串那样需要25 x 8 = 200比特。但我们仍然需要添加包含字符/序列的字典。对于单个字符的示例,我们需要额外的7个字节(7 x 8比特=56比特),而对于每个序列的示例,则需要再次添加7个字节以及每个序列长度的3个字节(因此为59比特)。这将导致:
56 + 78 = 134 bit (67.0%)
59 + 8 = 67 bit (33.5%)
这些实际数字可能不准确。欢迎自由编辑/更正。
请查看this维基页面...
Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely without error. Lossless compression is possible because most real-world data has statistical redundancy. For example, in English text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is very small.
Another kind of compression, called lossy data compression or perceptual coding, is possible if some loss of fidelity is acceptable. Generally, a lossy data compression will be guided by research on how people perceive the data in question. For example, the human eye is more sensitive to subtle variations in luminance than it is to variations in color. JPEG image compression works in part by "rounding off" some of this less-important information. Lossy data compression provides a way to obtain the best fidelity for a given amount of compression. In some cases, transparent (unnoticeable) compression is desired; in other cases, fidelity is sacrificed to reduce the amount of data as much as possible.
Lossless compression schemes are reversible so that the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression.
However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually (text files usually can be compressed more after being compressed, due to fewer symbols), result in an expansion, as will attempts to compress all but the most trivially encrypted data.
In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, like for example always removing the last byte of a file, will always compress a file up to the point where it is empty.
An example of lossless vs. lossy compression is the following string:
25.888888888
This string can be compressed as:
25.[9]8
Interpreted as, "twenty five point 9 eights", the original string is perfectly recreated, just written in a smaller form. In a lossy system, using
26
instead, the original data is lost, at the benefit of a smaller file size.
"Monday Night","Baseball","7:00pm"
"Tuesday Night","Baseball","7:00pm"
"Monday Night","Softball","8:00pm"
"Monday Night","Softball","8:00pm"
"Monday Night","Baseball","5:00pm"
这段内容大致只有150个字符,但如果您按照以下方式进行简单的替换:
A="星期一晚上",B="星期二晚上",C="棒球",D="垒球",E="晚上7点",F="晚上8点",G="下午5点"
那么相同的内容可以编码为:
A,C,E
B,C,E
A,D,F
A,D,F
A,C,G
使用25个字符!聪明的观察者也可以看出如何更容易地将它进一步减少到15个字符,如果我们假设有关文件格式的一些其他信息。显然,还有替换键的开销,但通常非常大的文件有很多这样的替换。这可以是一种非常高效的压缩大文件或数据结构的方法,并仍然使它们“有点”可读。