View difference between Paste ID: f5feafd7a and
SHOW:
|
|
- or go back to the newest paste.
1 | # KISS archive format (Keep It Simple and Stupid) | |
2 | ||
3 | ## General properties | |
4 | - blocksize: 512 bytes | |
5 | - only store filename (and directory if any) and content | |
6 | - last file contains the filenames | |
7 | - header: start block, end block, position of last block | |
8 | ||
9 | ## Overall file structure | |
10 | [header][1. file][2. file][3. file][filenames] | |
11 | ||
12 | ## [header] | |
13 | [SB][EB][POS] [SB][EB][POS] [SB][EB][POS] .. .. [SB][EB][POS] | |
14 | [ 4][ 4][ 2] [ 4][ 4][ 2] [ 4][ 4][ 2] .. .. [ 4][ 4][ 2] | |
15 | [ header ] [ filenames ] [ 2. file ] .. .. [ 1. file ] | |
16 | ||
17 | SB (start block): 4 byte | |
18 | EB (end block): 4 byte | |
19 | POS (position of last block): 2 byte | |
20 | ||
21 | All numbers are stored big-endian. That means most significant bit first. | |
22 | Example: | |
23 | 613 dec = 265 hex = \00 \00 \02 \65 (4 bytes) | |
24 | 130411 dec = 1FD6B hex = \00 \01 \FD \6B (4 bytes) | |
25 | ||
26 | Note: | |
27 | The remaining part of the header block MUST be filled with zero bytes. | |
28 | You will always have remaining part in the block, simply each file | |
29 | takes 10 bytes. (512/10 = 51 and 2 bytes left) | |
30 | ||
31 | ## [filenames] | |
32 | UTF-8 text for each filename, delimited with '\n' byte. | |
33 | The directory structure is preserved too. | |
34 | [name of 1. file]['\n'][name of 2. file]['\n'][name of 3. file] etc.. | |
35 | ||
36 | Some examples: | |
37 | this is a file.txt | |
38 | this2.tar.gz | |
39 | this3.html | |
40 | images/loller.html | |
41 | weird_dir/this\/files contains\/several\\ slashes.txt | |
42 | ||
43 | Special characters: | |
44 | '\n': You cant have '\n' character in the filename. It is preserved. | |
45 | (it is not supported in most filesystems anyway) | |
46 | '/': directory delimiter. To save directory structure. | |
47 | '\/': if the filename itself contains an / character | |
48 | '\\': if the filename itself contains a \ character | |
49 | ||
50 | ||
51 | ## [X. file] | |
52 | The file content as is. | |
53 | ||
54 | ||
55 | ## FAQ: | |
56 | Q: Why another archive format? | |
57 | A: Because it is the most dumb format ever;) | |
58 | ||
59 | Q: Why not tar, ar, zip, [name archive type here]? | |
60 | A: Short answer: widely used archive format are not suited for random access | |
61 | with no compression. | |
62 | Long answer: tar: there is no index, reading the last file of the archive | |
63 | requires reading the whole file before it. | |
64 | zip: individual files are compressed, which means: processortime | |
65 | xar: it would fit the requirements, but it is not widely | |
66 | supported, and not in every language. | |
67 | ||
68 | Q: I use X language does KISS supported there? | |
69 | A: The fileformat is so simple, it is intented, every programmer | |
70 | could implement it in "no time". | |
71 | ||
72 | Q: Does compression supported? | |
73 | A: No. But you can compress the whole file, | |
74 | just like in tar case: filename.kiss.bz2. Use it for file sharing. | |
75 | ||
76 | Q: Do advanced features (rights, symlinks, hardlinks, user/group/other) are | |
77 | preserved? | |
78 | A: No. It was not the goal of this archive. Although you can implement it, just | |
79 | write those informations in a file named .metadata-kiss. It is | |
80 | not recommended. | |
81 | ||
82 | Q: If the original file is not multiple of 512 bytes, how it will look in the | |
83 | archive, how many bytes will it take? | |
84 | A: Lets have an example. We have three files: | |
85 | 768bytes file, 1024 bytes, 2047 bytes | |
86 | First file (768 bytes) will take two blocks: 2*512 = 1024 bytes | |
87 | Second file (1024 bytes) will take two blocks too: 2*512 = 1024 bytes | |
88 | Third file (2047 bytes) will take four blocks: 4*512 = 2048 bytes | |
89 | Lets name the files: | |
90 | - "first filename.extension", (24 bytes) | |
91 | - "second try", (10 bytes) | |
92 | - "I want a sexy name.txt", (22 bytes) | |
93 | The [filenames] section: | |
94 | [24 bytes][1 byte][10 bytes][1 byte][22 bytes] = 58bytes = 1 block | |
95 | ||
96 | The header ([SB][EB][POS]): | |
97 | [00 00 00 00][00 00 00 00][00 31] (this is the header itself, start at the | |
98 | 0. block, ends at 0. block, and | |
99 | the header is 50 bytes long. That means | |
100 | start at the 0. byte and | |
101 | ends at the at the 49. byte. | |
102 | (0..49 = 50 bytes, which is 0x31)) | |
103 | [00 00 00 01][00 00 00 01][00 39] (58 byte long, that means 0..57 bytes | |
104 | and 57 dec = 0x39 hexa) | |
105 | [00 00 00 02][00 00 00 03][00 FF] (768-512 = 256, so 0..255 bytes. | |
106 | 255dec = FF hexa) | |
107 | [00 00 00 04][00 00 00 05][01 FF] (1024 bytes = 2 blocks, the second is full) | |
108 | [00 00 00 06][00 00 00 09][01 FE] ( 2047-(3*512) = 511. 510dec = 1FE hexa) | |
109 | ||
110 | ||
111 | The overall filesize: | |
112 | [header][1. file][2. file][3. file][filenames] | |
113 | [ 1 ][ 2 ][ 2 ][ 4 ][ 1 ] = 10 blocks = 5120B = 5kB | |
114 | ||
115 | Q: How is it filled the unused part of the block (if the file is | |
116 | not multiple of 512 bytes) ? | |
117 | A: It can be random bytes. But should be zero bytes. Or checksum if there is | |
118 | enough space left (see section "An insane idea for checksums"). | |
119 | ||
120 | Q: What is the (theoretical) maximum archive filesize? | |
121 | A: 256**4 blocks, that means 2TB | |
122 | ||
123 | Q: What is the maximum filename, directory length? | |
124 | A: No limit. | |
125 | ||
126 | ||
127 | ## Implementation advices | |
128 | ||
129 | 1. Count the files what you want to archive -> you know how much | |
130 | space is required by the header. 1-49 files requires one block for the header | |
131 | (10 bytes for the header, 10 bytes for the filenames section, x*10 bytes | |
132 | for the files itself. Maximum x is 49 for one block) | |
133 | 2. dump 0xFF for the header (look at the "tape archiving" to understand why FF) | |
134 | 3. Generate the filenames section (in memory) | |
135 | 4. Attache each file to the archive, and generate the header | |
136 | on-the-fly (in memory) | |
137 | 5. Overwrite the header with valid data. | |
138 | 6. Append the [filenames] section at the end of archive | |
139 | ||
140 | ## An insane idea for checksums (integrity checking) | |
141 | ||
142 | Here is the idea, write the checksum at the remaining space, if the | |
143 | file is multiple of 512 byte, write two checksums at the next end of file, | |
144 | if there is no enough space, write it at the next file, and so on. | |
145 | If each file was multiple of 512 bytes, or there are not enough space | |
146 | at the end of each files. There will be no checksum for some of the last files | |
147 | (but it is always better then having no checksums at all). | |
148 | Which should be rare, but if you are worried about it, you can always add | |
149 | a new file with all the necessary informations. | |
150 | ||
151 | This section is not mandatory for the fileformat. So if you are brave enough, | |
152 | implement it! If you dont care, no worries. | |
153 | ||
154 | CRC32 is 4 bytes (32 bits) long. | |
155 | ||
156 | I think 4bytes should be safe enough;) | |
157 | A little example code in python how to calculate it: | |
158 | import binascii | |
159 | ||
160 | def crc2hex(crc): | |
161 | res='' | |
162 | for i in range(4): | |
163 | t=crc & 0xFF | |
164 | crc >>= 8 | |
165 | res='%02X%s' % (t, res) | |
166 | return res | |
167 | ||
168 | if __name__=='__main__': | |
169 | test='hello world! and Python too ;)' | |
170 | crc=binascii.crc32(test) | |
171 | print 'CRC:', crc | |
172 | hex_str = crc2hex(crc) | |
173 | print 'CRC in hex:', hex_str | |
174 | print 'in byte representation: ', hex_str.decode("hex") | |
175 | ||
176 | MD5sum is 16bytes long. | |
177 | ||
178 | The CRC32 (4bytes) is recommended. It is enough to detect inconsistencies. | |
179 | ||
180 | ||
181 | ## An another insane idea for tape archiving | |
182 | ||
183 | Who the hell uses tapes these days?;) | |
184 | So if the first block is filled with FF hexa, it means the header is at the | |
185 | end of the archive file, the tailer is more right term;) | |
186 | So when you archive the tape, you cant reverse and | |
187 | write the header at the beginning of file. | |
188 | In that case, the header (at the end of file) is in REVERSE order. | |
189 | So the last 4 bytes tells where the header begins. So no need to search for it. | |
190 | Simply read the last 4 bytes, determine where the header begins(reverse order!), | |
191 | and read those blocks. Reverse the byte orders, and thats way you can process | |
192 | it normally. | |
193 | ||
194 | ||
195 |