bpo-32713: Fix tarfile.itn for large/negative float values. (GH-5434)
[python-delta-tar] / docs / TODO.txt
CommitLineData
414c0518
TJ
1*** First steps ***
2We need to do some initial research if tar
3can satisfy our the design requirements.
4
51. Find out what needs to be done for multi volume
6 support in python tarlib. GNU tar already supports this.
7
8 [How did duplicity solve this?]
9
6ae78488
TJ
10Wadobo:
11 Effectively GNU tar supports multivolume. Tar multivolume has some limitations,
12 for example they cannot be compressed. To implement support for multivolume
13 tars in python tarlib shouldn't be very difficult though, but it would needed
14 to study the multivolume format.
15
16 Duplicity works creating fixed sized volumes, where the files being archived
17 are stored. If a file doesn't fill the current volume, it's split between
18 current volume and the next. Duplicity generates an external manifest file that
19 specifies what file was splitted from in the end of a volume and the begining of
20 the other This is how they seem to keep track of the splitted files and the
21 volumes themselves.
22
23Intra2net:
24 We could implement multi volume by splitting
25 the compressed tar archive once it has reached
26 the volume size limit and treat it later on like one,
27 big virtual volume. The volumes could be encrypted, too.
28
29 -> No need for a (fragile?) manifest file. I think archiving to tape
30 pretty much works like this with gnu tar.
31
32 In case of an emergency you could just "cat"
33 those files together and start unpacking it.
34
35 In case the first volume of a splitted file is gone,
36 we still have the "emergency recover" tool which
37 can extract the files from a given position / search
38 for the first marker.
884e7f6a 39
6916b581
ERE
40Wadobo:
41 I had not thought about it this way, but you're right, having a manifest
42 file makes you dependent on it. Looking more closely to the tar format, it's
43 for me now quite clear that manifest is not really needed .
44
45 Tar format works by dividing a tar file in blocks. Each series of blocks
46 have a block header. This header indicates among other things
47 if it's a continuation of a file that started in another block and the
48 offset, so this is the way one can recognize when a file has been splitted
49 in two volumes easily. In fact, one can treat each individual volume of a
50 multivolume tar archive as if it was a complete tar archive. GNU tar command
51 supports this. This works well unless you have to extract a multivol file
52 that started in a previous volume. We just need to implement multivolume
53 support in python tarlib. Which doesn't seem to be a complicated thing to do
54 as it already has support for openning a stream for read/write.
55
414c0518
TJ
562. Design how we are going to split huge files over multiple volumes
57 and recognize them later on.
58
59 [How did duplicity solve this? There are unit tests for this]
60
6ae78488 61Wadobo:
884e7f6a
ERE
62
63The method used by duplicity related in the previous question seems to do the
64trick without having to resort to using magic markers and escaping, so I would
65suggest doing that.
66
67Here is an excerpt of a manifest:
68
69Volume 1:
70 StartingPath .
71 EndingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 463
72 Hash SHA1 02d12203ce728f70a846f87aeff08b1ed92f6148
73Volume 2:
74 StartingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 464
75 EndingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 861
76 Hash SHA1 2299f5afc7f41f66f5bd12061ef68962d3231e34
77
6ae78488
TJ
78Intra2net:
79 Answered above: Manifest is probably not needed.
80
414c0518
TJ
813. How can we restart the gzip compression for every file in the tar archive?
82
6ae78488 83Wadobo:
884e7f6a
ERE
84
85duplicity has already proposed a new better-than-tar file format, which
86might be an interesting way to start collaborating with them: giving them some
87feedback/input on their proposal based on our needs, and perhaps implementing a
88generalized solution that could allow for all the use cases. In
89that document they describe the problems of tar, one of which is that tar
90only allow to be wrapped on the outside, hence tar.bz2 or tar.gpg formats.
91
92We could just force all files to be compressed and encrypted in the inside, this
93is doable. So the process would be: for each file, they would be compressed,
94encrypted and put in the tar.
95
96One problem that could arise is that if you have too many small
97files, this process of compressing&encrypting each file might be quite slow
98and take more space than needed. On the other hand, one would probably still
99have to split large files.
100
101A compromise solution could be to simply just use a good volume size, and try
102to remove the possibility of a file whose size is less than the volume
103size to be splitted in two volumes. Which can be easily done with duplicity.
104
6ae78488
TJ
105Intra2net:
106 We want the compression / encryption on top of tar since
107 we don't want to leak the file names / directory
108 structure in the encryption case.
109
6916b581
ERE
110Wadobo:
111 Fine by me: then it's clear that the duplicity new format is not good for us
112 and the current file format isn't either, unless we remove the need for the
113 manifest.
114
884e7f6a
ERE
115--
116[1] http://duplicity.nongnu.org/new_format.html
117
414c0518
TJ
1184. Have a look at duplicity and evaluate if it can be adapted
119 in an upstream-compatible way to our needs
120
121 May be it can be tweaked to always produce "full snapshots"
122 for the differential backup scenario?
123
884e7f6a
ERE
124The code of duplicity seems quite well documented and sufficiently structured
125and flexible so that even if there's no direct support for producing full
126snapshots for the incremental mode, it seems doable without having to resort
127in breaking things.
128
129From my inspection of the code, it seems that the function that specifies if
130a file is going to be diffed with rsync or snapshotted when doing a diff dir
131backup is get_delta_path in duplicity/diffdir.py.
132
133Given that, I would propose to send a message in the mailing list asking what
134do they think about adding that, if they would be willing to add this feature
135upstream, and in any case, ask them if someone has already tried to do this
136before and if they have any suggestion before we start to try to code it.
137
6ae78488
TJ
138Intra2net:
139 Please go ahead and ask on the duplicity mailinglist two things:
140 - (like you proposed) What do they think about the diff backup mode?
141
6916b581 142
6ae78488
TJ
143 - What do they think about restarting the gzip compression / encryption
144 on the stream level at each file "boundary".
145 If they don't like this the our future is already sealed :o)
146
6916b581
ERE
147Wadobo:
148
149 To encrypt/compress each "file boundary" (what is a "file entry" in GNU Tar
150 terminology [1]), and that includes both header blocks with file path, to
151 conceal that information, and the file data, which is contained in "payload"
152 data blocks - this is probably what you are proposing, right?
153
154 But the thing is, tar format is so easy/dumb that what you're describing is
155 in fact technically the same as a compressing a tar file - a tar file that
156 contains only one file entry. That's because a tar file doesn't have any
157 initial special header, and it does have an end of file marker, which
158 consists of two 512 blocks of zero bytes, but it's optional.
159
160 So technically if I understand correctly, what you're proposing is the same
161 as creating a tar file, and then compressing/encrypting it, which is fine
162 by me. When the tarball is big, we could use the help of multivolume tars,
163 so each file entry is divided in different volumes (which are in fact also
164 tar files, containing sections of a file).
165
166 Of course, this would generate one lots of files in the backup dir, one per
167 backed file (or more, if it was splitted via multivolume), and this is not
168 good. But we know a solution for that: on top of those tar.gz.gpg, we can
169 put *another* tar container. this container would have headers that are not
170 concealed because they do not contain more than the name of the volume, and
171 not the name (or any other sensitive information) from the compressed /
172 encrypted file. Rhis could be just a very big tar file, splitted via
173 multivolume, with lots of different sized files, which are also tar files
174 containing encrypted/compressed files. This would meet the requirements, I
175 think.
176
177 For that idea, the conclusion is that this can be done easily with available
178 tools: tarfile python library allows to create tar.gz files and it's also
179 easy to encrypt tarballs with it, as duplicity already does. Only part
180 missing would be multivolume encryption. For optimization, we would also
181 have to add an option to remove the end of file marker
182
183 Maybe I have not understood well what you were trying to say, please tell
184 me if that's the case.
185
186 [1] http://www.gnu.org/software/tar/manual/html_node/Standard.html
187
6ae78488
TJ
188*** New items to check ***
189- please investigate how the compression can be restarted in tarlib
190 on the global stream level for each file boundary.
191 We need this for our "own" solution and/or for duplicity.
192 Only this gives good data integrity.
193
6916b581
ERE
194Wadobo:
195
196 You mean tarfile right? [1]. Anyway, asuming I'm correct in the analysis
197 above, this can be done easily using the write mode "w:" and "w:gz" modes
198 (same for read) and doing TarFile.addtarinfo(TarInfo.frombuf(f)) being "f"
199 an encrypted data file/stream.
200
201 The stream support in tarfile means that you just give a fileobj with the
202 read/write functions that tarlib will call to create/read tar files.It's up
203 to you how those read/write functions to do the rest.
204
205[1] http://docs.python.org/2/library/tarfile.html
206
6ae78488
TJ
207- we could use a tar format variant like GNU tar or "pax".
208 Please evaluate the pros / cons of other tar format variants.
209
210 pax f.e. seems to store the file owner names etc.
211 IIRC Fedora added xattr support to tar.
212
6916b581
ERE
213Wadobo:
214
215 pax indeed seems to be the more powerful tar format and it's been
216 standarized by IEEE. It's supported by tarfile python lib, I would
217 recommend using it. As advantages, it allows storing
218 hard links, encoding, charset, large paths and bigger sizes. This is all
219 because of the extended header.
220
221 GNU Tar format is also supported by tarfile and has long names support.
222
223 As you note, there's support for extended attributes to tar that has been
224 added by some distros. For example in opensuse this is not the case. The
225 case is similar with pax: by default it doesn't support it, but some have
226 patched it (for example solaris).
227
228 tarfile python lib does not have support for extended attributes but
229 shouldn't be very difficult to add it.
230
6ae78488
TJ
231- duplicity is GPL. We intend to add the "archive layer"
232 later on to our backup subsystem which means the
233 whole subsystem must become GPL, too. That's something
234 I have not made my mind up, but it would block that road.
235 OTOH we save a bit of development time by using the duplicity basis.
236
6916b581
ERE
237Wadobo:
238 That's true. We seem to have a different use case than the one covered by
239 duplicity, so in any case we perhaps can learn some tricks from their code
240 but it might be not such a big advantage to start from duplicity.
241
6ae78488
TJ
242*** Later on ***
243X. Collect feedback from Intra2net. If all is fine,
414c0518
TJ
244 design the class layout for all the design requirements
245 and have another round of feedback.