amigojapan derpson file compression specification

amigojapan derpson file compression specification:
cat enwik8 | awk '{ print tolower($0) }'  | LC_ALL=C tr " " "\n"  | LC_ALL=C sort | LC_ALL=C uniq | LC_ALL=C wc -l

1358029
1358029 unique results, in binary 1358029 is 1 0100 1011 1000 1100 1101, which means we can address it in 21 bits.  adding the 3 bits for the state flags, (7 states), this should be a total of 24 bits for each word…   off the top of my head I would say the flags should  mean this
states:
*00 space after word, non capital first letter
*01 space after word, capital first letter
*10 comma after word, non capital first letter
*11 comma after word, capital first letter
0** next byte is not compressed(the following 2 bytes will be the offset to the next compressed data) then the uncompressed data will follow
1** next 24bits represent compressed data


also, here is my idea for a header:
header: encoded-dictionary-filename,checksum, position of first compressed data(first data could be non compressible)

maybe store the offset from the current bit to the next compressed data in the same data…. this can’t be too long though.

….compressed data,uncompressible flag,offset:3bytesABCcompressed data,uncompressible flag,offset:5ABCDEF.compressed data

if the offset is more than 2 bytes long, we can more or less say this file is not very compressible, and is probably not mostly plain text
hmmm, if there is only one word, it may be better not to compress it, it may end up larger with all the data to jump to the next word, I will need to think about this…

but this should actually be calculable, if the bytes of the word are < the end result of the compression bytes of the next word, then don’t compress...


boost dynamic bitset bytes converter 2 way
http://pastebin.com/5SYsjtfZ
compile flags for boost:
clang++ ajderpcompress.cpp -I /opt/local/include/