Advertisement
amigojapan

amigojapan derpson file compression specification

Apr 8th, 2015
252
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.84 KB | None | 0 0
  1. amigojapan derpson file compression specification:
  2. cat enwik8 | awk '{ print tolower($0) }' | LC_ALL=C tr " " "\n" | LC_ALL=C sort | LC_ALL=C uniq | LC_ALL=C wc -l
  3.  
  4. 1358029
  5. 1358029 unique results, in binary 1358029 is 1 0100 1011 1000 1100 1101, which means we can address it in 21 bits. adding the 3 bits for the state flags, (7 states), this should be a total of 24 bits for each word… off the top of my head I would say the flags should mean this
  6. states:
  7. *00 space after word, non capital first letter
  8. *01 space after word, capital first letter
  9. *10 comma after word, non capital first letter
  10. *11 comma after word, capital first letter
  11. 0** next byte is not compressed(the following 2 bytes will be the offset to the next compressed data) then the uncompressed data will follow
  12. 1** next 24bits represent compressed data
  13.  
  14.  
  15. also, here is my idea for a header:
  16. header: encoded-dictionary-filename,checksum, position of first compressed data(first data could be non compressible)
  17.  
  18. maybe store the offset from the current bit to the next compressed data in the same data…. this can’t be too long though.
  19.  
  20. ….compressed data,uncompressible flag,offset:3bytesABCcompressed data,uncompressible flag,offset:5ABCDEF.compressed data
  21.  
  22. if the offset is more than 2 bytes long, we can more or less say this file is not very compressible, and is probably not mostly plain text
  23. hmmm, if there is only one word, it may be better not to compress it, it may end up larger with all the data to jump to the next word, I will need to think about this…
  24.  
  25. but this should actually be calculable, if the bytes of the word are < the end result of the compression bytes of the next word, then don’t compress...
  26.  
  27.  
  28. boost dynamic bitset bytes converter 2 way
  29. http://pastebin.com/5SYsjtfZ
  30. compile flags for boost:
  31. clang++ ajderpcompress.cpp -I /opt/local/include/
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement