decomp bomb notes


Header
 GZIP magic numbers
  |     Compression method (deflate)
  |      |   Flags (FNAME,FCOMMENT)
  |      |    |       Timestamp (unix epoch)
  |      |    |        |        XFL.  Unused for us.
  |      |    |        |         |   Creating OS (FF=unknown)
  |      |    |        |         |    |
  v      v    v        v         v    v
1F 8B | 08 | 18 | 00 00 00 00 | 00 | FF

More header
 Original file name (zero-terminated, see FNAME)
  |         File comment, zero-terminated (see FCOMMENT)
  |          |
  v          v
"bomb\0" | "comment\0"

Deflated stream begins here
 Uncompressed, final block (woop, this one's binary)
  |    Length of literal data block [LEN] (1 byte)
  |     |      One's complement of LEN [NLEN]
  |     |       |     LEN bytes of garbage
  |     |       |      |
  v     v       v      v
100 | 01 00 | FE FF | 00 |
----+-------+-------+----+
010100FEFF00


First compressed block
 Final block made with fixed Huffman codes (binary again)
  |      Fixed huffman for 285 or 'length code 258'
  |       |     Distance code, 1 byte back
  |       |      |   More like the previous two
  |       |      |    |      End of block
  |       |      |    |       |
  v       v      v    v       v
101 | 11000101 | 0 | ... | 0000000

File tail:
 CRC32 of uncompressed data [python -c"print hex(int(\"`cksum $(FILE)`\".split()[0]))"]
  |          |
  v          v
 CRC32 | FILESIZE

So, a somewhat minimal archive:
1F8B        Magic numbers for a block
08          Compressed with DEFLATE
08          Contains original file name only
00000000    Timestamp
00          No extended flags
03          Created on NTFS
"LittleBoy" Original file name
==Now is the deflated stream==
01          Terminal block, it's uncompressed (elements are packed into bytes
            from LSB to MSB and type 0 aligns the data to a byte boundary)
0100        LEN: Size of block (little-endian)
FEFF        ~LEN, integrity check apparently
00          Single byte of data
==Now the tail on the stream==
48E23EFB    CRC32 of the block (little-endian)
01000000    Uncompressed size of block (little-endian)

Now for something a little more interesting:
1F8B08080000000000034100    Member header as above, name is "A"
==Time for a Huffman block, given as the bitfields to stream==
1                           Final block in the stream
01                          Static Huffman
00110000                    -> literal 0
11000101                    Literal length 285 -> 258
00000                       Literal distance 0 -> 1
0000000                     Literal 256- end of block
==Checksum/tail==
9bbdd589                    0x89d5bd9b
03010000                    0x0103

..so we can pack that into a deflate stream, as RFC 1951 says:
    "print out the compressed data as a sequence of bytes, starting with the
     first byte at the right margin and proceeding to the left with the
     most-significant bit of each byte on the left as usual, (such that) one
     would be able to parse the result from right to left, with fixed-width
     elements in the correct MSB-to-LSB order and Huffman codes in bit-reversed
     order (i.e., with the first bit of the code in relative LSB position)."

So, here's the stream packed RTL, then split into bytes and made into the byte
stream.
+----------------------------------------------------------------------+
|                       /--------+- Huffman codes are packed MSB-first |
|                       |        |                                     |
|                       v        v                                     |
|  0000000 | 00000 | 10100011 | 00001100 | 01 | 1                      |
|  \     /   \         /\         /\            /                      |
|   \___/     \_______/  \_______/  \__________/                       |
|     |           |          |            |                            |
|     |           \__________|________    |                            |
|     |               _______|________\___/                            |
|     \______________/_______|________|_____________                   |
|                    |       |        |             \                  |
|                    |       |        |             |                  |
|                    v       v        v             v                  |
|              01100011 | 00011000 | 00000101 | 0000000                |
+----------------------------------------------------------------------+
That works out to be 63 18 05 00 with the MSB of the last byte unused.  Throw
in the CRC and final size, then..

==Block tail==
  9    b    b    d    d    5    8    9    0    3    0    1    0    0    0    0
1001 1011 1011 1101 1101 0101 1000 1001 0000 0011 0000 0001 0000 0000 0000 0000
The GZIP block tail is always byte-aligned, so this is pretty easy.

So the final byte stream works out to be...
63180500 9bbdd58903010000

=====A proper bomb=====
Now that we've handmade a working gzip file, let's make it explode.  The simple
way to do that would be simply repeating the 258-repeat sequence used above,
but that's not very good, since it yields only a 2064:13 compression ratio.  To
reach our target of 16 TB, that would require a gzip file weighing in > 100 GB.
While maybe not exactly suspicious, it's far from useful.  Looking back at RFC
1951, we can encode our own Huffman tree in the stream to hopefully improve the
situation.

==Constructing a tree==
We use a total of four tokens in the tree (parenthesized values are the actual
value contained- recall that the byte value and backreference length alphabets
are joined into the range 0-285):
    literal 0 (0), end-of-block (256), length code 258 (285), distance code 1
We'll call those 0, S (stop), L (length), and B (backreference), respectively
for some clarity.  Our stream will look something like this:

0LBLB....LBLBLBS

So knowing that, it's pretty easy to generate Huffman codes for each of these.
We know that symbols 0 and S will only appear once in the stream each (at the
beginning and end, respectively), while L and B will have equal frequency
with the exact value approaching infinity as the archive becomes larger.  With
that in mind, we can generate Huffman tree to give optimal coding.  Simply by
reasoning through it, L and B must be very short codes, while 0 and S could be
nearly anything, since they act only as constant additions to the stream.

 Symbol | Appearances |      /\
 ------ | ----------- |     0  1
     0  |          1  |    /    \
     L  |          n  |   L     /\
     B  |          n  |        0  1
     S  |          1  |       /    \
                      |      B     /\
                      |           0  1
                      |          /    \
                      |         0      S

In this way, each LB pair requires only 3 bits (compared with 13 with static
Huffman encoding), more than quadrupling the compression factor, which now
asymptotically approaches 2064:3 as we make the final stream longer.

blahblahblah


=====Rzip bomb=====
Rzip is a nice easy compression scheme, well-suited to encoding heavy
redundancy across long intervals.  We don't care so much here about long
intervals, but the rzip utility then runs its output through bzip2, which will
hopefully yield similar results to a pure bzip2 bomb without the manual
construction complexity of bzip2.

==Constructing the rzip==
The rzip format is pretty easy.  There are two types of blocks: literal and
match.  A literal block just encodes some data to be streamed out with no
change.  A match block refers back to data appearing earlier in the plaintext,
up to 2^32 bytes back.  The end-of-stream marker is a literal block of length
0.  A CRC32 of the plaintext is tacked on the end, then rzip feeds its output
into bzip2 for further compression.  The skeleton of such a bomb would be as
follows:

+----------------------------------------------------+
| Type of block.  0=literal, 1=match.                |
|  |    Size of literal (1 byte)                     |
|  |     |     Literal data                          |
|  |     |      |                                    |
|  v     v      v                                    |
| 00 | 01 00 | 00 |                                  |
+----+-------+----+----------------------------------+
| Match block                                        |
|  |    Size of block (65535)                        |
|  |     |         Offset to copy from (back 1 byte) |
|  |     |          |                                |
|  v     v          v                                |
| 01 | FF FF | 01 00 00 00 | *match block repeats*   |
+----+-------+-------------+-------------------------+
|   Zero-length literal is end-of-stream             |
|    |          CRC32 of uncompressed data           |
|    |           |                                   |
|    v           v                                   |
| 00 | 00 00 | CRC32 |                               |
+----+-------+-------+-------------------------------+