From: Eduardo Robles Elvira Date: Tue, 11 Jun 2013 13:21:56 +0000 (+0200) Subject: answering some questions X-Git-Tag: v2.2~193 X-Git-Url: http://developer.intra2net.com/git/?a=commitdiff_plain;h=6916b581b0b43a6267da528f9650b0f961562757;p=python-delta-tar answering some questions --- diff --git a/docs/TODO.txt b/docs/TODO.txt index 45ac1fc..49c4d17 100644 --- a/docs/TODO.txt +++ b/docs/TODO.txt @@ -37,6 +37,22 @@ Intra2net: can extract the files from a given position / search for the first marker. +Wadobo: + I had not thought about it this way, but you're right, having a manifest + file makes you dependent on it. Looking more closely to the tar format, it's + for me now quite clear that manifest is not really needed . + + Tar format works by dividing a tar file in blocks. Each series of blocks + have a block header. This header indicates among other things + if it's a continuation of a file that started in another block and the + offset, so this is the way one can recognize when a file has been splitted + in two volumes easily. In fact, one can treat each individual volume of a + multivolume tar archive as if it was a complete tar archive. GNU tar command + supports this. This works well unless you have to extract a multivol file + that started in a previous volume. We just need to implement multivolume + support in python tarlib. Which doesn't seem to be a complicated thing to do + as it already has support for openning a stream for read/write. + 2. Design how we are going to split huge files over multiple volumes and recognize them later on. @@ -91,6 +107,11 @@ Intra2net: we don't want to leak the file names / directory structure in the encryption case. +Wadobo: + Fine by me: then it's clear that the duplicity new format is not good for us + and the current file format isn't either, unless we remove the need for the + manifest. + -- [1] http://duplicity.nongnu.org/new_format.html @@ -118,28 +139,106 @@ Intra2net: Please go ahead and ask on the duplicity mailinglist two things: - (like you proposed) What do they think about the diff backup mode? + - What do they think about restarting the gzip compression / encryption on the stream level at each file "boundary". If they don't like this the our future is already sealed :o) +Wadobo: + + To encrypt/compress each "file boundary" (what is a "file entry" in GNU Tar + terminology [1]), and that includes both header blocks with file path, to + conceal that information, and the file data, which is contained in "payload" + data blocks - this is probably what you are proposing, right? + + But the thing is, tar format is so easy/dumb that what you're describing is + in fact technically the same as a compressing a tar file - a tar file that + contains only one file entry. That's because a tar file doesn't have any + initial special header, and it does have an end of file marker, which + consists of two 512 blocks of zero bytes, but it's optional. + + So technically if I understand correctly, what you're proposing is the same + as creating a tar file, and then compressing/encrypting it, which is fine + by me. When the tarball is big, we could use the help of multivolume tars, + so each file entry is divided in different volumes (which are in fact also + tar files, containing sections of a file). + + Of course, this would generate one lots of files in the backup dir, one per + backed file (or more, if it was splitted via multivolume), and this is not + good. But we know a solution for that: on top of those tar.gz.gpg, we can + put *another* tar container. this container would have headers that are not + concealed because they do not contain more than the name of the volume, and + not the name (or any other sensitive information) from the compressed / + encrypted file. Rhis could be just a very big tar file, splitted via + multivolume, with lots of different sized files, which are also tar files + containing encrypted/compressed files. This would meet the requirements, I + think. + + For that idea, the conclusion is that this can be done easily with available + tools: tarfile python library allows to create tar.gz files and it's also + easy to encrypt tarballs with it, as duplicity already does. Only part + missing would be multivolume encryption. For optimization, we would also + have to add an option to remove the end of file marker + + Maybe I have not understood well what you were trying to say, please tell + me if that's the case. + + [1] http://www.gnu.org/software/tar/manual/html_node/Standard.html + *** New items to check *** - please investigate how the compression can be restarted in tarlib on the global stream level for each file boundary. We need this for our "own" solution and/or for duplicity. Only this gives good data integrity. +Wadobo: + + You mean tarfile right? [1]. Anyway, asuming I'm correct in the analysis + above, this can be done easily using the write mode "w:" and "w:gz" modes + (same for read) and doing TarFile.addtarinfo(TarInfo.frombuf(f)) being "f" + an encrypted data file/stream. + + The stream support in tarfile means that you just give a fileobj with the + read/write functions that tarlib will call to create/read tar files.It's up + to you how those read/write functions to do the rest. + +[1] http://docs.python.org/2/library/tarfile.html + - we could use a tar format variant like GNU tar or "pax". Please evaluate the pros / cons of other tar format variants. pax f.e. seems to store the file owner names etc. IIRC Fedora added xattr support to tar. +Wadobo: + + pax indeed seems to be the more powerful tar format and it's been + standarized by IEEE. It's supported by tarfile python lib, I would + recommend using it. As advantages, it allows storing + hard links, encoding, charset, large paths and bigger sizes. This is all + because of the extended header. + + GNU Tar format is also supported by tarfile and has long names support. + + As you note, there's support for extended attributes to tar that has been + added by some distros. For example in opensuse this is not the case. The + case is similar with pax: by default it doesn't support it, but some have + patched it (for example solaris). + + tarfile python lib does not have support for extended attributes but + shouldn't be very difficult to add it. + - duplicity is GPL. We intend to add the "archive layer" later on to our backup subsystem which means the whole subsystem must become GPL, too. That's something I have not made my mind up, but it would block that road. OTOH we save a bit of development time by using the duplicity basis. +Wadobo: + That's true. We seem to have a different use case than the one covered by + duplicity, so in any case we perhaps can learn some tricks from their code + but it might be not such a big advantage to start from duplicity. + *** Later on *** X. Collect feedback from Intra2net. If all is fine, design the class layout for all the design requirements