Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- amigojapan derpson file compression specification:
- cat enwik8 | awk '{ print tolower($0) }' | LC_ALL=C tr " " "\n" | LC_ALL=C sort | LC_ALL=C uniq | LC_ALL=C wc -l
- 1358029
- 1358029 unique results, in binary 1358029 is 1 0100 1011 1000 1100 1101, which means we can address it in 21 bits. adding the 3 bits for the state flags, (7 states), this should be a total of 24 bits for each word… off the top of my head I would say the flags should mean this
- states:
- *00 space after word, non capital first letter
- *01 space after word, capital first letter
- *10 comma after word, non capital first letter
- *11 comma after word, capital first letter
- 0** next byte is not compressed(the following 2 bytes will be the offset to the next compressed data) then the uncompressed data will follow
- 1** next 24bits represent compressed data
- also, here is my idea for a header:
- header: encoded-dictionary-filename,checksum, position of first compressed data(first data could be non compressible)
- maybe store the offset from the current bit to the next compressed data in the same data…. this can’t be too long though.
- ….compressed data,uncompressible flag,offset:3bytesABCcompressed data,uncompressible flag,offset:5ABCDEF.compressed data
- if the offset is more than 2 bytes long, we can more or less say this file is not very compressible, and is probably not mostly plain text
- hmmm, if there is only one word, it may be better not to compress it, it may end up larger with all the data to jump to the next word, I will need to think about this…
- but this should actually be calculable, if the bytes of the word are < the end result of the compression bytes of the next word, then don’t compress...
- boost dynamic bitset bytes converter 2 way
- http://pastebin.com/5SYsjtfZ
- compile flags for boost:
- clang++ ajderpcompress.cpp -I /opt/local/include/
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement