Compression idea thingy

23rd January 2021
Uncategorised
0

Super basics of the Huffman compression algorithm (do more research if this sparks your interest). Lets say that we have a string of characters:

012102

Now lets do a simple mapping of characters to bits,

0 = 0
1 = 10
2 = 11

So the above string would end up looking like this in binary: 0101110011

Once we got that lets have a quick think about how many bits it takes to represent our string:

10 (bits) / 6 (characters) = 1.666 bits per character.

To get a bit tighter squeeze I will use a different angle on this. How many 3s can do pack into 8 bits? 5! Yes, 5. And I mean always 5 regardless if they are all 2s and 1s or just 0s. The reason for this is that 3^5 = 243 and last time I have checked 243 < 256. Let me demonstrate:

21221 = 1110111110
00000 = 00000

Both of the above will take up 8 bits using the method described above. I know it might look counter intuitive but keep in your mind that I am "compressing" the combination of characters and not touching the bits. The 21221 (5 chars) turns into 10 bits: 1110111110, however I would only need 8 bits to represent it due to the fact that it is just 3 * 3 * 3 * 3 * 3. Kinda like when you work our your binary numbers starting from 0000 till 1111 but this time you are working the numbers out instead and assigning them to a binary code (mapping)

0 * 0 * 0 * 0 * 0 = 00000000
0 * 0 * 0 * 0 * 1 = 00000001
0 * 0 * 0 * 0 * 2 = 00000010
0 * 0 * 0 * 1 * 0 = 00000011
0 * 0 * 0 * 1 * 1 = 00000100
0 * 0 * 0 * 1 * 2 = 00000101
0 * 0 * 0 * 2 * 0 = 00000110
0 * 0 * 0 * 2 * 1 = 00000111
0 * 0 * 0 * 2 * 2 = 00001000
0 * 0 * 1 * 0 * 0 = 00001001
0 * 0 * 1 * 0 * 1 = 00001010
0 * 0 * 1 * 0 * 2 = 00001011
.
.
.
2 * 2 * 2 * 2 * 2 = 11110010

So you can always pack 5 characters into 8 bits... 8 / 5 = 1.6 bits per character
This method allows for a rather simple and efficient way of processing input data as it really does not have to create a mapping table (I am working with totally random numbers here). The only question that I had was: Can you push it further? And ohh hell yeah you can. I wrote myself a simple PHP script that basically has two loops. One it does ^ 2 of the 256 and the second one starts from 3 and does ^ 3. The loop is broken when I find at which point you can squeeze the first column into the other. Same thing that I did with the first example. 256 (2 ^ 8) can accommodate 243 (3 ^ 5). I run it (5 loops max) and here are the results:

256 (8), 243 (5), 1.6000 (our first example right here)
512 (9), 243 (5), 1.8000
1024 (10), 729 (6), 1.6667
2048 (11), 729 (6), 1.8333
4096 (12), 2187 (7), 1.7143

As you can tell the 3^5 has the best bits per character ratio. However this was only set to run 5 loops... Now lets do... 100k with a filter that will only show lines that are below the 1.6 ratio and see what we get:

134217728 (27), 129140163 (17), 1.5882
34359738368 (35), 31381059609 (22), 1.5909
8796093022208 (43), 7625597484987 (27), 1.5926
70368744177664 (46), 68630377364883 (29), 1.5862
2251799813685248 (51), 1853020188851841 (32), 1.5938
18014398509481984 (54), 16677181699666569 (34), 1.5882
576460752303423488 (59), 450283905890997363 (37), 1.5946
4611686018427387904 (62), 4.052555153019E+18 (39), 1.5897

And... these are not all of the results. I think I need to narrow it down with a "display criteria"...

1.3803492693581E+70 (233), 1.3703277223523E+70 (147), 1.5850
2.6699837949011E+95 (317), 2.6561398887587E+95 (200), 1.5850
9.8505015490986E+114 (382), 9.6877380539957E+114 (241), 1.5851
5.1644997561738E+120 (401), 5.1484611991535E+120 (253), 1.5850
1.9053641054175E+140 (466), 1.8777980666473E+140 (294), 1.5850
9.9895953610112E+145 (485), 9.9793888233711E+145 (306), 1.5850
7.0295528039737E+159 (531), 6.8488922081885E+159 (335), 1.5851
3.6855101804898E+165 (550), 3.6397821240119E+165 (347), 1.5850
1.3597132616109E+185 (615), 1.3275376022278E+185 (388), 1.5851
7.1288134650347E+190 (634), 7.0550791086553E+190 (400), 1.5850
5.0164565101131E+204 (680), 4.8419274156612E+204 (429), 1.5851
2.6300679507742E+210 (699), 2.5731987477064E+210 (441), 1.5850
1.3789130657755E+216 (718), 1.3675033156798E+216 (453), 1.5850
9.7032380768794E+229 (764), 9.3852268602836E+229 (482), 1.5851
5.087291284851E+235 (783), 4.987694347856E+235 (494), 1.5850
2.6672057731519E+241 (802), 2.6506652719189E+241 (506), 1.5850
3.5798629898094E+249 (829), 3.4230734527405E+249 (523), 1.5851
1.8768792072012E+255 (848), 1.8191615787979E+255 (535), 1.5850
9.8402524578509E+260 (867), 9.6677704859791E+260 (547), 1.5850
5.1591262806217E+266 (886), 5.1378496148392E+266 (559), 1.5850
6.9244620785014E+274 (913), 6.6350273672983E+274 (576), 1.5851
3.6304123742133E+280 (932), 3.5261255791044E+280 (588), 1.5850
1.9033816428516E+286 (951), 1.8739277038848E+286 (600), 1.5850
9.9792015476736E+291 (970), 9.9588201288024E+291 (612), 1.5850
2.5546755962044E+294 (978), 2.419993291299E+294 (617), 1.5851
1.3393857589828E+300 (997), 1.2860836547212E+300 (629), 1.5851
7.0222388080559E+305 (1016), 6.834775835487E+305 (641), 1.5850

It does seem like I have found the sweet spot at 1.5850 bits per character. The first entry that manages to do that is:

1.3803492693581E+70 (233), 1.3703277223523E+70 (147), 1.5850

which means that 233 (2s) can contain 147 (3s). I have not tried this "method" for 5s, 6s or 7s... Will see if that will give me any interesting results.

FYI: I am not a mathematician nor a scientist. I just found this to spark my curiosity.

Hey, like this? Why not share it with a buddy?