Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- # KISS archive format (Keep It Simple and Stupid)
- ## General properties
- - blocksize: 512 bytes
- - only store filename (and directory if any) and content
- - last file contains the filenames
- - header: start block, end block, position of last block
- ## Overall file structure
- [header][1. file][2. file][3. file][filenames]
- ## [header]
- [SB][EB][POS] [SB][EB][POS] [SB][EB][POS] .. .. [SB][EB][POS]
- [ 4][ 4][ 2] [ 4][ 4][ 2] [ 4][ 4][ 2] .. .. [ 4][ 4][ 2]
- [ header ] [ 1. file ] [ 2. file ] .. .. [ filenames ]
- SB (start block): 4 byte
- EB (end block): 4 byte
- POS (position of last block): 2 byte
- All numbers are stored big-endian. That means most significant bit first.
- Example:
- 613 dec = 265 hex = \00 \00 \02 \65 (4 bytes)
- 130411 dec = 1FD6B hex = \00 \01 \FD \6B (4 bytes)
- Note:
- The remaining part of the header block MUST be filled with zero bytes.
- You will always have remaining part in the block, simply each file
- takes 10 bytes. (512/10 = 51 and 2 bytes left)
- ## [filenames]
- UTF-8 text for each filename, delimited with '\n' byte.
- The directory structure is preserved too.
- [name of 1. file]['\n'][name of 2. file]['\n'][name of 3. file] etc..
- Some examples:
- this is a file.txt
- this2.tar.gz
- this3.html
- images/loller.html
- weird_dir/this\/files contains\/several\\ slashes.txt
- Special characters:
- '\n': You cant have '\n' character in the filename. It is preserved.
- (it is not supported in most filesystems anyway)
- '/': directory delimiter. To save directory structure.
- '\/': if the filename itself contains an / character
- '\\': if the filename itself contains a \ character
- ## [X. file]
- The file content as is.
- ## FAQ:
- Q: Why another archive format?
- A: Because it is the most dumb format ever;)
- Q: Why not tar, ar, zip, [name archive type here]?
- A: Short answer: widely used archive format are not suited for random access
- with no compression.
- Long answer: tar: there is no index, reading the last file of the archive
- requires reading the whole file before it.
- zip: individual files are compressed, which means: processortime
- xar: it would fit the requirements, but it is not widely
- supported, and not in every language.
- Q: I use X language does KISS supported there?
- A: The fileformat is so simple, it is intented, every programmer
- could implement it in "no time".
- Q: Does compression supported?
- A: No. But you can compress the whole file,
- just like in tar case: filename.kiss.bz2. Use it for file sharing.
- Q: Do advanced features (rights, symlinks, hardlinks, user/group/other) are
- preserved?
- A: No. It was not the goal of this archive. Although you can implement it, just
- write those informations in a file named .metadata-kiss. It is
- not recommended.
- Q: If the original file is not multiple of 512 bytes, how it will look in the
- archive, how many bytes will it take?
- A: Lets have an example. We have three files:
- 768bytes file, 1024 bytes, 2047 bytes
- First file (768 bytes) will take two blocks: 2*512 = 1024 bytes
- Second file (1024 bytes) will take two blocks too: 2*512 = 1024 bytes
- Third file (2047 bytes) will take four blocks: 4*512 = 2048 bytes
- Lets name the files:
- - "first filename.extension", (24 bytes)
- - "second try", (10 bytes)
- - "I want a sexy name.txt", (22 bytes)
- The [filenames] section:
- [24 bytes][1 byte][10 bytes][1 byte][22 bytes] = 58bytes = 1 block
- The header ([SB][EB][POS]):
- [00 00 00 00][00 00 00 00][00 31] (this is the header itself, start at the
- 0. block, ends at 0. block, and
- the header is 50 bytes long. That means
- start at the 0. byte and
- ends at the at the 49. byte.
- (0..49 = 50 bytes, which is 0x31))
- [00 00 00 01][00 00 00 01][00 39] (58 byte long, that means 0..57 bytes
- and 57 dec = 0x39 hexa)
- [00 00 00 02][00 00 00 03][00 FF] (768-512 = 256, so 0..255 bytes.
- 255dec = FF hexa)
- [00 00 00 04][00 00 00 05][01 FF] (1024 bytes = 2 blocks, the second is full)
- [00 00 00 06][00 00 00 09][01 FE] ( 2047-(3*512) = 511. 510dec = 1FE hexa)
- The overall filesize:
- [header][1. file][2. file][3. file][filenames]
- [ 1 ][ 2 ][ 2 ][ 4 ][ 1 ] = 10 blocks = 5120B = 5kB
- Q: How is it filled the unused part of the block (if the file is
- not multiple of 512 bytes) ?
- A: It can be random bytes. But should be zero bytes. Or checksum if there is
- enough space left (see section "An insane idea for checksums").
- Q: What is the (theoretical) maximum archive filesize?
- A: 256**4 blocks, that means 2TB
- Q: What is the maximum filename, directory length?
- A: No limit.
- ## Implementation advices
- 1. Count the files what you want to archive -> you know how much
- space is required by the header. 1-49 files requires one block for the header
- (10 bytes for the header, 10 bytes for the filenames section, x*10 bytes
- for the files itself. Maximum x is 49 for one block)
- 2. dump 0xFF for the header (look at the "tape archiving" to understand why FF)
- 3. Generate the filenames section (in memory)
- 4. Attache each file to the archive, and generate the header
- on-the-fly (in memory)
- 5. Overwrite the header with valid data.
- 6. Append the [filenames] section at the end of archive
- ## An insane idea for checksums (integrity checking)
- Here is the idea, write the checksum at the remaining space, if the
- file is multiple of 512 byte, write two checksums at the next end of file,
- if there is no enough space, write it at the next file, and so on.
- If each file was multiple of 512 bytes, or there are not enough space
- at the end of each files. There will be no checksum for some of the last files
- (but it is always better then having no checksums at all).
- Which should be rare, but if you are worried about it, you can always add
- a new file with all the necessary informations.
- This section is not mandatory for the fileformat. So if you are brave enough,
- implement it! If you dont care, no worries.
- CRC32 is 4 bytes (32 bits) long.
- I think 4bytes should be safe enough;)
- A little example code in python how to calculate it:
- import binascii
- def crc2hex(crc):
- res=''
- for i in range(4):
- t=crc & 0xFF
- crc >>= 8
- res='%02X%s' % (t, res)
- return res
- if __name__=='__main__':
- test='hello world! and Python too ;)'
- crc=binascii.crc32(test)
- print 'CRC:', crc
- hex_str = crc2hex(crc)
- print 'CRC in hex:', hex_str
- print 'in byte representation: ', hex_str.decode("hex")
- MD5sum is 16bytes long.
- The CRC32 (4bytes) is recommended. It is enough to detect inconsistencies.
- ## An another insane idea for tape archiving
- Who the hell uses tapes these days?;)
- So if the first block is filled with FF hexa, it means the header is at the
- end of the archive file, the tailer is more right term;)
- So when you archive the tape, you cant reverse and
- write the header at the beginning of file.
- In that case, the header (at the end of file) is in REVERSE order.
- So the last 4 bytes tells where the header begins. So no need to search for it.
- Simply read the last 4 bytes, determine where the header begins(reverse order!),
- and read those blocks. Reverse the byte orders, and thats way you can process
- it normally.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement