Commit | Line | Data |
---|---|---|
414c0518 TJ |
1 | *** First steps *** |
2 | We need to do some initial research if tar | |
3 | can satisfy our the design requirements. | |
4 | ||
5 | 1. Find out what needs to be done for multi volume | |
6 | support in python tarlib. GNU tar already supports this. | |
7 | ||
8 | [How did duplicity solve this?] | |
9 | ||
6ae78488 TJ |
10 | Wadobo: |
11 | Effectively GNU tar supports multivolume. Tar multivolume has some limitations, | |
12 | for example they cannot be compressed. To implement support for multivolume | |
13 | tars in python tarlib shouldn't be very difficult though, but it would needed | |
14 | to study the multivolume format. | |
15 | ||
16 | Duplicity works creating fixed sized volumes, where the files being archived | |
17 | are stored. If a file doesn't fill the current volume, it's split between | |
18 | current volume and the next. Duplicity generates an external manifest file that | |
19 | specifies what file was splitted from in the end of a volume and the begining of | |
20 | the other This is how they seem to keep track of the splitted files and the | |
21 | volumes themselves. | |
22 | ||
23 | Intra2net: | |
24 | We could implement multi volume by splitting | |
25 | the compressed tar archive once it has reached | |
26 | the volume size limit and treat it later on like one, | |
27 | big virtual volume. The volumes could be encrypted, too. | |
28 | ||
29 | -> No need for a (fragile?) manifest file. I think archiving to tape | |
30 | pretty much works like this with gnu tar. | |
31 | ||
32 | In case of an emergency you could just "cat" | |
33 | those files together and start unpacking it. | |
34 | ||
35 | In case the first volume of a splitted file is gone, | |
36 | we still have the "emergency recover" tool which | |
37 | can extract the files from a given position / search | |
38 | for the first marker. | |
884e7f6a | 39 | |
6916b581 ERE |
40 | Wadobo: |
41 | I had not thought about it this way, but you're right, having a manifest | |
42 | file makes you dependent on it. Looking more closely to the tar format, it's | |
43 | for me now quite clear that manifest is not really needed . | |
44 | ||
45 | Tar format works by dividing a tar file in blocks. Each series of blocks | |
46 | have a block header. This header indicates among other things | |
47 | if it's a continuation of a file that started in another block and the | |
48 | offset, so this is the way one can recognize when a file has been splitted | |
49 | in two volumes easily. In fact, one can treat each individual volume of a | |
50 | multivolume tar archive as if it was a complete tar archive. GNU tar command | |
51 | supports this. This works well unless you have to extract a multivol file | |
52 | that started in a previous volume. We just need to implement multivolume | |
53 | support in python tarlib. Which doesn't seem to be a complicated thing to do | |
54 | as it already has support for openning a stream for read/write. | |
55 | ||
414c0518 TJ |
56 | 2. Design how we are going to split huge files over multiple volumes |
57 | and recognize them later on. | |
58 | ||
59 | [How did duplicity solve this? There are unit tests for this] | |
60 | ||
6ae78488 | 61 | Wadobo: |
884e7f6a ERE |
62 | |
63 | The method used by duplicity related in the previous question seems to do the | |
64 | trick without having to resort to using magic markers and escaping, so I would | |
65 | suggest doing that. | |
66 | ||
67 | Here is an excerpt of a manifest: | |
68 | ||
69 | Volume 1: | |
70 | StartingPath . | |
71 | EndingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 463 | |
72 | Hash SHA1 02d12203ce728f70a846f87aeff08b1ed92f6148 | |
73 | Volume 2: | |
74 | StartingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 464 | |
75 | EndingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 861 | |
76 | Hash SHA1 2299f5afc7f41f66f5bd12061ef68962d3231e34 | |
77 | ||
6ae78488 TJ |
78 | Intra2net: |
79 | Answered above: Manifest is probably not needed. | |
80 | ||
414c0518 TJ |
81 | 3. How can we restart the gzip compression for every file in the tar archive? |
82 | ||
6ae78488 | 83 | Wadobo: |
884e7f6a ERE |
84 | |
85 | duplicity has already proposed a new better-than-tar file format, which | |
86 | might be an interesting way to start collaborating with them: giving them some | |
87 | feedback/input on their proposal based on our needs, and perhaps implementing a | |
88 | generalized solution that could allow for all the use cases. In | |
89 | that document they describe the problems of tar, one of which is that tar | |
90 | only allow to be wrapped on the outside, hence tar.bz2 or tar.gpg formats. | |
91 | ||
92 | We could just force all files to be compressed and encrypted in the inside, this | |
93 | is doable. So the process would be: for each file, they would be compressed, | |
94 | encrypted and put in the tar. | |
95 | ||
96 | One problem that could arise is that if you have too many small | |
97 | files, this process of compressing&encrypting each file might be quite slow | |
98 | and take more space than needed. On the other hand, one would probably still | |
99 | have to split large files. | |
100 | ||
101 | A compromise solution could be to simply just use a good volume size, and try | |
102 | to remove the possibility of a file whose size is less than the volume | |
103 | size to be splitted in two volumes. Which can be easily done with duplicity. | |
104 | ||
6ae78488 TJ |
105 | Intra2net: |
106 | We want the compression / encryption on top of tar since | |
107 | we don't want to leak the file names / directory | |
108 | structure in the encryption case. | |
109 | ||
6916b581 ERE |
110 | Wadobo: |
111 | Fine by me: then it's clear that the duplicity new format is not good for us | |
112 | and the current file format isn't either, unless we remove the need for the | |
113 | manifest. | |
114 | ||
884e7f6a ERE |
115 | -- |
116 | [1] http://duplicity.nongnu.org/new_format.html | |
117 | ||
414c0518 TJ |
118 | 4. Have a look at duplicity and evaluate if it can be adapted |
119 | in an upstream-compatible way to our needs | |
120 | ||
121 | May be it can be tweaked to always produce "full snapshots" | |
122 | for the differential backup scenario? | |
123 | ||
884e7f6a ERE |
124 | The code of duplicity seems quite well documented and sufficiently structured |
125 | and flexible so that even if there's no direct support for producing full | |
126 | snapshots for the incremental mode, it seems doable without having to resort | |
127 | in breaking things. | |
128 | ||
129 | From my inspection of the code, it seems that the function that specifies if | |
130 | a file is going to be diffed with rsync or snapshotted when doing a diff dir | |
131 | backup is get_delta_path in duplicity/diffdir.py. | |
132 | ||
133 | Given that, I would propose to send a message in the mailing list asking what | |
134 | do they think about adding that, if they would be willing to add this feature | |
135 | upstream, and in any case, ask them if someone has already tried to do this | |
136 | before and if they have any suggestion before we start to try to code it. | |
137 | ||
6ae78488 TJ |
138 | Intra2net: |
139 | Please go ahead and ask on the duplicity mailinglist two things: | |
140 | - (like you proposed) What do they think about the diff backup mode? | |
141 | ||
6916b581 | 142 | |
6ae78488 TJ |
143 | - What do they think about restarting the gzip compression / encryption |
144 | on the stream level at each file "boundary". | |
145 | If they don't like this the our future is already sealed :o) | |
146 | ||
6916b581 ERE |
147 | Wadobo: |
148 | ||
149 | To encrypt/compress each "file boundary" (what is a "file entry" in GNU Tar | |
150 | terminology [1]), and that includes both header blocks with file path, to | |
151 | conceal that information, and the file data, which is contained in "payload" | |
152 | data blocks - this is probably what you are proposing, right? | |
153 | ||
154 | But the thing is, tar format is so easy/dumb that what you're describing is | |
155 | in fact technically the same as a compressing a tar file - a tar file that | |
156 | contains only one file entry. That's because a tar file doesn't have any | |
157 | initial special header, and it does have an end of file marker, which | |
158 | consists of two 512 blocks of zero bytes, but it's optional. | |
159 | ||
160 | So technically if I understand correctly, what you're proposing is the same | |
161 | as creating a tar file, and then compressing/encrypting it, which is fine | |
162 | by me. When the tarball is big, we could use the help of multivolume tars, | |
163 | so each file entry is divided in different volumes (which are in fact also | |
164 | tar files, containing sections of a file). | |
165 | ||
166 | Of course, this would generate one lots of files in the backup dir, one per | |
167 | backed file (or more, if it was splitted via multivolume), and this is not | |
168 | good. But we know a solution for that: on top of those tar.gz.gpg, we can | |
169 | put *another* tar container. this container would have headers that are not | |
170 | concealed because they do not contain more than the name of the volume, and | |
171 | not the name (or any other sensitive information) from the compressed / | |
172 | encrypted file. Rhis could be just a very big tar file, splitted via | |
173 | multivolume, with lots of different sized files, which are also tar files | |
174 | containing encrypted/compressed files. This would meet the requirements, I | |
175 | think. | |
176 | ||
177 | For that idea, the conclusion is that this can be done easily with available | |
178 | tools: tarfile python library allows to create tar.gz files and it's also | |
179 | easy to encrypt tarballs with it, as duplicity already does. Only part | |
180 | missing would be multivolume encryption. For optimization, we would also | |
181 | have to add an option to remove the end of file marker | |
182 | ||
183 | Maybe I have not understood well what you were trying to say, please tell | |
184 | me if that's the case. | |
185 | ||
186 | [1] http://www.gnu.org/software/tar/manual/html_node/Standard.html | |
187 | ||
6ae78488 TJ |
188 | *** New items to check *** |
189 | - please investigate how the compression can be restarted in tarlib | |
190 | on the global stream level for each file boundary. | |
191 | We need this for our "own" solution and/or for duplicity. | |
192 | Only this gives good data integrity. | |
193 | ||
6916b581 ERE |
194 | Wadobo: |
195 | ||
196 | You mean tarfile right? [1]. Anyway, asuming I'm correct in the analysis | |
197 | above, this can be done easily using the write mode "w:" and "w:gz" modes | |
198 | (same for read) and doing TarFile.addtarinfo(TarInfo.frombuf(f)) being "f" | |
199 | an encrypted data file/stream. | |
200 | ||
201 | The stream support in tarfile means that you just give a fileobj with the | |
202 | read/write functions that tarlib will call to create/read tar files.It's up | |
203 | to you how those read/write functions to do the rest. | |
204 | ||
205 | [1] http://docs.python.org/2/library/tarfile.html | |
206 | ||
6ae78488 TJ |
207 | - we could use a tar format variant like GNU tar or "pax". |
208 | Please evaluate the pros / cons of other tar format variants. | |
209 | ||
210 | pax f.e. seems to store the file owner names etc. | |
211 | IIRC Fedora added xattr support to tar. | |
212 | ||
6916b581 ERE |
213 | Wadobo: |
214 | ||
215 | pax indeed seems to be the more powerful tar format and it's been | |
216 | standarized by IEEE. It's supported by tarfile python lib, I would | |
217 | recommend using it. As advantages, it allows storing | |
218 | hard links, encoding, charset, large paths and bigger sizes. This is all | |
219 | because of the extended header. | |
220 | ||
221 | GNU Tar format is also supported by tarfile and has long names support. | |
222 | ||
223 | As you note, there's support for extended attributes to tar that has been | |
224 | added by some distros. For example in opensuse this is not the case. The | |
225 | case is similar with pax: by default it doesn't support it, but some have | |
226 | patched it (for example solaris). | |
227 | ||
228 | tarfile python lib does not have support for extended attributes but | |
229 | shouldn't be very difficult to add it. | |
230 | ||
6ae78488 TJ |
231 | - duplicity is GPL. We intend to add the "archive layer" |
232 | later on to our backup subsystem which means the | |
233 | whole subsystem must become GPL, too. That's something | |
234 | I have not made my mind up, but it would block that road. | |
235 | OTOH we save a bit of development time by using the duplicity basis. | |
236 | ||
6916b581 ERE |
237 | Wadobo: |
238 | That's true. We seem to have a different use case than the one covered by | |
239 | duplicity, so in any case we perhaps can learn some tricks from their code | |
240 | but it might be not such a big advantage to start from duplicity. | |
241 | ||
6ae78488 TJ |
242 | *** Later on *** |
243 | X. Collect feedback from Intra2net. If all is fine, | |
414c0518 TJ |
244 | design the class layout for all the design requirements |
245 | and have another round of feedback. |