Advertisement
cr88192

Simple LZ, TK uLZ 0

Feb 11th, 2019
283
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.97 KB | None | 0 0
  1. TK uLZ
  2. * Small Huffman-coded LZ77 variant.
  3. * Aim is to be smaller/simpler to decode than Deflate.
  4. * Will still be Huffman coded, unlike LZ4.
  5. * Will not (necessarily) aim for speed, but should not be overly slow.
  6.  
  7.  
  8. Bitstream
  9. * LSB first.
  10. * Huffman symbols are 1-12 bits.
  11. ** Rather than 1-15 as in Deflate.
  12. ** This allows simplifying the table encoding some.
  13. ** Also allows for a single-stage lookup table to be smaller.
  14.  
  15.  
  16. === Top Level Bitstream ===
  17.  
  18. Top level bitstream will consist of a series of 4 bit tags.
  19. * 0: End of Stream.
  20. * 1: Raw Uncompressed Data
  21. * 2: LZ Compressed Data
  22. * 3: Reserved
  23. * 4: Packed Huffman Table, Tag
  24. * 5: Packed Huffman Table, Literal
  25. * 6: Packed Huffman Table, Distance
  26. * 7: Reserved
  27. * 8: Reserved
  28. * 9: Raw Uncompressed Data, Stream Ends.
  29. * A: LZ Packed Data, Stream Ends.
  30. * B..F: Reserved
  31.  
  32. Raw Uncompressed Data:
  33. * 2 bit prefixed length:
  34. ** 0: 12 bit length (0 to 4K)
  35. ** 1: 16 bit length (0 to 64K)
  36. ** 2: 20 bit length (0 to 1M)
  37. ** 3: 24 bit length (0 to 16M)
  38. * Bitstream then realigns to a byte boundary.
  39. ** This many raw bytes follows.
  40.  
  41.  
  42. === Packed Huffman Table ===
  43.  
  44. Huffman tables will encode a series of lengths, but the tables will not be entropy coded. Each table will represent up to 256 lengths.
  45.  
  46. Table contents will be encoded via 4 bit codes:
  47. * 0: Non-encoded symbol
  48. * 1..D: Symbol Length, 1..13 bits.
  49. ** Length 13: Reserved for now.
  50. * E: 3..18 zeroes (4 bits follow).
  51. * F: Escaped Run, 2 bit code follows:
  52. ** 0: 19..82 zeroes (6 bits).
  53. ** 1: 4..66 repeats of prior length (6 bits).
  54. *** Bias is 3, with 0 encoding an "End of Table".
  55. ** 2: Reserved
  56. ** 3: Reserved
  57. ** Rest of table is filled with zeroes.
  58.  
  59. A table does not necessarily end with an End of Table marker, but may also end upon reaching its total number of lengths. A table may not encode more than 256 lengths.
  60.  
  61. If an End of Table is seen, any remaining lengths are set to zero.
  62.  
  63.  
  64. Major Tables:
  65. * Tag(T): Run Prefix Tags
  66. * Literal(L): Literal Bytes
  67. * Distance(D): Distance Prefix
  68.  
  69. Symbols will be assigned code patterns such that:
  70. * Shorter symbols precede longer ones;
  71. * Symbols with the same length will be be ordered by symbol value.
  72.  
  73. Symbols will be effectively transposed in the bitstream, such that the high-order bit of the symbol's codeword appears in LSB position in the bitstream.
  74.  
  75.  
  76. === LZ Compressed Data ===
  77.  
  78. Runs will be encoded as:
  79. * Tag Prefix
  80. * If Needed, Overflowed Literal Length
  81. * If Needed, Overflowed Match Length
  82. * Match Distance
  83. * If Special Case, Extra Distance
  84. * Zero or more literal symbols
  85. * (Match copy happens here)
  86.  
  87. The Tag prefix will be encoded as:
  88. * High 4 bits: Literal Length, or 15 for overflow.
  89. * Low 4 bits: Match Length (3..17), or 15 for overflow.
  90.  
  91. In the overflow cases, the length will be encoded using a distance.
  92. * The overflow length will add a bias, 15 for literal, and 18 for match.
  93.  
  94.  
  95. Lengths and distances will be encoded as a prefix followed by zero or more "extra bits".
  96.  
  97. Distance Prefix:
  98. * High 5 bits: Exponent (ExtraBits-1)
  99. * Low 3 bits: Fraction
  100.  
  101. Thus (Prefix, Range, Extra Bits):
  102. * 00..07: 0.. 7 (0)
  103. * 08..0F: 8.. 15 (0)
  104. * 10..17: 16.. 31 (1)
  105. * 18..1F: 32.. 63 (2)
  106. * 20..27: 64.. 127 (3)
  107. * 28..2F: 128.. 255 (4)
  108. * 30..37: 256.. 511 (5)
  109. * 38..3F: 512.. 1023 (6)
  110. * 40..47: 1024.. 2047 (7)
  111. * 48..4F: 2048.. 4095 (8)
  112. * 50..57: 4096.. 8191 (9)
  113. * 58..5F: 8192..16383 (10)
  114. * 60..67: 16384..32767 (11)
  115. * 68..6F: 32768..65535 (12)
  116. * ...
  117.  
  118.  
  119. If the LZ match distance is zero, this will encode an escape case.
  120. What happens next will depend on the match length.
  121. * 3(0): End of Block
  122. * 4(1): Predicted match length and distance.
  123. * 5(2): Predicted match length (match distance given as ExtraDistance).
  124. * 6(3): Predicted distance (match length given as ExtraDistance+3).
  125. * Other values are reserved.
  126.  
  127. If a predicted value is used, the previously encoded value will be used as a predictor. At the start of a block these values are undefined and may not be used.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement