View difference between Paste ID: f51927121 and
SHOW: | | - or go back to the newest paste.
1
# KISS archive format (Keep It Simple and Stupid)
2
3
## General properties
4
- blocksize: 512 bytes
5
- only store filename (and directory if any) and content
6
- last file contains the filenames
7
- header: start block, end block, position of last block
8
9
## Overall file structure
10
[header][1. file][2. file][3. file][filenames]
11
12
## [header]
13
[SB][EB][POS] [SB][EB][POS] [SB][EB][POS] .. .. [SB][EB][POS]
14
[ 4][ 4][  2] [ 4][ 4][  2] [ 4][ 4][  2] .. .. [ 4][ 4][  2]
15
[   header  ] [  1. file  ] [  2. file  ] .. .. [ filenames ]
16
17
SB (start block): 4 byte 
18
EB (end block): 4 byte
19
POS (position of last block): 2 byte
20
21
All numbers are stored big-endian. That means most significant bit first.
22
Example:
23
613 dec = 265 hex = \00 \00 \02 \65 (4 bytes)
24
130411 dec = 1FD6B hex = \00 \01 \FD \6B (4 bytes)
25
26
Note:
27
The remaining part of the header block MUST be filled with zero bytes.
28
You will always have remaining part in the block, simply each file
29
takes 10 bytes. (512/10 = 51 and 2 bytes left)
30
31
## [filenames]
32
UTF-8 text for each filename, delimited with '\n' byte.
33
The directory structure is preserved too.
34
[name of 1. file]['\n'][name of 2. file]['\n'][name of 3. file] etc..
35
36
Some examples:
37
this is a file.txt
38
this2.tar.gz
39
this3.html
40
images/loller.html
41
weird_dir/this\/files contains\/several\\ slashes.txt
42
43
Special characters:
44
'\n': You cant have '\n' character in the filename. It is preserved.
45
      (it is not supported in most filesystems anyway)
46
'/': directory delimiter. To save directory structure.
47
'\/': if the filename itself contains an / character
48
'\\': if the filename itself contains a \ character
49
50
51
## [X. file]
52
The file content as is.
53
54
55
## FAQ:
56
Q: Why another archive format?
57
A: Because it is the most dumb format ever;)
58
59
Q: Why not tar, ar, zip, [name archive type here]?
60
A: Short answer: widely used archive format are not suited for random access
61
                 with no compression. 
62
   Long answer: tar: there is no index, reading the last file of the archive
63
                     requires reading the whole file before it.
64
                zip: individual files are compressed, which means: processortime
65
                xar: it would fit the requirements, but it is not widely 
66
                     supported, and not in every language.
67
68
Q: I use X language does KISS supported there?
69
A: The fileformat is so simple, it is intented, every programmer 
70
   could implement it in "no time".
71
   
72
Q: Does compression supported?
73
A: No. But you can compress the whole file, 
74
   just like in tar case: filename.kiss.bz2. Use it for file sharing.
75
   
76
Q: Do advanced features (rights, symlinks, hardlinks, user/group/other) are 
77
   preserved?
78
A: No. It was not the goal of this archive. Although you can implement it, just 
79
   write those informations in a file named .metadata-kiss. It is 
80
   not recommended.
81
82
Q: If the original file is not multiple of 512 bytes, how it will look in the 
83
   archive, how many bytes will it take?
84
A: Lets have an example. We have three files: 
85
   768bytes file, 1024 bytes, 2047 bytes
86
   First file (768 bytes) will take two blocks: 2*512 = 1024 bytes 
87
   Second file (1024 bytes) will take two blocks too: 2*512 = 1024 bytes
88
   Third file (2047 bytes) will take four blocks: 4*512 = 2048 bytes
89
   Lets name the files: 
90
            - "first filename.extension", (24 bytes)
91
            - "second try", (10 bytes) 
92
            - "I want a sexy name.txt", (22 bytes)
93
   The [filenames] section:
94
   [24 bytes][1 byte][10 bytes][1 byte][22 bytes] = 58bytes = 1 block
95
96
   The header ([SB][EB][POS]):
97
   [00 00 00 00][00 00 00 00][00 31] (this is the header itself, start at the 
98
                                      0. block, ends at 0. block, and 
99
                                      the header is 50 bytes long. That means 
100
                                      start at the 0. byte and 
101
                                      ends at the at the 49. byte. 
102
                                      (0..49 = 50 bytes, which is 0x31))
103
   [00 00 00 01][00 00 00 01][00 39] (58 byte long, that means 0..57 bytes 
104
                                      and 57 dec = 0x39 hexa)
105
   [00 00 00 02][00 00 00 03][00 FF]  (768-512 = 256, so 0..255 bytes. 
106
                                       255dec = FF hexa)
107
   [00 00 00 04][00 00 00 05][01 FF]  (1024 bytes = 2 blocks, the second is full)
108
   [00 00 00 06][00 00 00 09][01 FE]  ( 2047-(3*512) = 511. 510dec = 1FE hexa)
109
   
110
   
111
   The overall filesize:
112
   [header][1. file][2. file][3. file][filenames]
113
   [ 1    ][  2    ][   2   ][   4   ][  1      ] = 10 blocks = 5120B = 5kB
114
   
115
Q: How is it filled the unused part of the block (if the file is 
116
   not multiple of 512 bytes) ?
117
A: It can be random bytes. But should be zero bytes. Or checksum if there is 
118
   enough space left (see section "An insane idea for checksums").
119
120
Q: What is the (theoretical) maximum archive filesize?
121
A: 256**4 blocks, that means 2TB
122
123
Q: What is the maximum filename, directory length?
124
A: No limit.
125
126
   
127
## Implementation advices
128
129
1. Count the files what you want to archive -> you know how much 
130
   space is required by the header. 1-49 files requires one block for the header
131
   (10 bytes for the header, 10 bytes for the filenames section, x*10 bytes 
132
   for the files itself. Maximum x is 49 for one block)
133
2. dump 0xFF for the header (look at the "tape archiving" to understand why FF)
134
3. Generate the filenames section (in memory)
135
4. Attache each file to the archive, and generate the header 
136
   on-the-fly (in memory)
137
5. Overwrite the header with valid data.
138
6. Append the [filenames] section at the end of archive
139
140
## An insane idea for checksums (integrity checking)
141
142
Here is the idea, write the checksum at the remaining space, if the 
143
file is multiple of 512 byte, write two checksums at the next end of file, 
144
if there is no enough space, write it at the next file, and so on.
145
If each file was multiple of 512 bytes, or there are not enough space 
146
at the end of each files. There will be no checksum for some of the last files 
147
(but it is always better then having no checksums at all).
148
Which should be rare, but if you are worried about it, you can always add 
149
a new file with all the necessary informations.
150
151
This section is not mandatory for the fileformat. So if you are brave enough, 
152
implement it! If you dont care, no worries.
153
154
CRC32 is 4 bytes (32 bits) long. 
155
156
I think 4bytes should be safe enough;)
157
A little example code in python how to calculate it:
158
import binascii
159
160
def crc2hex(crc):
161
    res=''
162
    for i in range(4):
163
        t=crc & 0xFF
164
        crc >>= 8
165
        res='%02X%s' % (t, res)
166
    return res
167
168
if __name__=='__main__':
169
    test='hello world! and Python too ;)'
170
    crc=binascii.crc32(test)
171
    print 'CRC:', crc
172
    hex_str = crc2hex(crc)
173
    print 'CRC in hex:', hex_str
174
    print 'in byte representation: ', hex_str.decode("hex")
175
176
MD5sum is 16bytes long. 
177
178
The CRC32 (4bytes) is recommended. It is enough to detect inconsistencies.
179
180
181
## An another insane idea for tape archiving
182
183
Who the hell uses tapes these days?;)
184
So if the first block is filled with FF hexa, it means the header is at the 
185
end of the archive file, the tailer is more right term;) 
186
So when you archive the tape, you cant reverse and
187
write the header at the beginning of file.
188
In that case, the header (at the end of file) is in REVERSE order.
189
So the last 4 bytes tells where the header begins. So no need to search for it.
190
Simply read the last 4 bytes, determine where the header begins(reverse order!), 
191
and read those blocks. Reverse the byte orders, and thats way you can process 
192
it normally.
193
194
 
195