*** First steps *** We need to do some initial research if tar can satisfy our the design requirements. 1. Find out what needs to be done for multi volume support in python tarlib. GNU tar already supports this. [How did duplicity solve this?] Wadobo: Effectively GNU tar supports multivolume. Tar multivolume has some limitations, for example they cannot be compressed. To implement support for multivolume tars in python tarlib shouldn't be very difficult though, but it would needed to study the multivolume format. Duplicity works creating fixed sized volumes, where the files being archived are stored. If a file doesn't fill the current volume, it's split between current volume and the next. Duplicity generates an external manifest file that specifies what file was splitted from in the end of a volume and the begining of the other This is how they seem to keep track of the splitted files and the volumes themselves. Intra2net: We could implement multi volume by splitting the compressed tar archive once it has reached the volume size limit and treat it later on like one, big virtual volume. The volumes could be encrypted, too. -> No need for a (fragile?) manifest file. I think archiving to tape pretty much works like this with gnu tar. In case of an emergency you could just "cat" those files together and start unpacking it. In case the first volume of a splitted file is gone, we still have the "emergency recover" tool which can extract the files from a given position / search for the first marker. Wadobo: I had not thought about it this way, but you're right, having a manifest file makes you dependent on it. Looking more closely to the tar format, it's for me now quite clear that manifest is not really needed . Tar format works by dividing a tar file in blocks. Each series of blocks have a block header. This header indicates among other things if it's a continuation of a file that started in another block and the offset, so this is the way one can recognize when a file has been splitted in two volumes easily. In fact, one can treat each individual volume of a multivolume tar archive as if it was a complete tar archive. GNU tar command supports this. This works well unless you have to extract a multivol file that started in a previous volume. We just need to implement multivolume support in python tarlib. Which doesn't seem to be a complicated thing to do as it already has support for openning a stream for read/write. 2. Design how we are going to split huge files over multiple volumes and recognize them later on. [How did duplicity solve this? There are unit tests for this] Wadobo: The method used by duplicity related in the previous question seems to do the trick without having to resort to using magic markers and escaping, so I would suggest doing that. Here is an excerpt of a manifest: Volume 1: StartingPath . EndingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 463 Hash SHA1 02d12203ce728f70a846f87aeff08b1ed92f6148 Volume 2: StartingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 464 EndingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 861 Hash SHA1 2299f5afc7f41f66f5bd12061ef68962d3231e34 Intra2net: Answered above: Manifest is probably not needed. 3. How can we restart the gzip compression for every file in the tar archive? Wadobo: duplicity has already proposed a new better-than-tar file format, which might be an interesting way to start collaborating with them: giving them some feedback/input on their proposal based on our needs, and perhaps implementing a generalized solution that could allow for all the use cases. In that document they describe the problems of tar, one of which is that tar only allow to be wrapped on the outside, hence tar.bz2 or tar.gpg formats. We could just force all files to be compressed and encrypted in the inside, this is doable. So the process would be: for each file, they would be compressed, encrypted and put in the tar. One problem that could arise is that if you have too many small files, this process of compressing&encrypting each file might be quite slow and take more space than needed. On the other hand, one would probably still have to split large files. A compromise solution could be to simply just use a good volume size, and try to remove the possibility of a file whose size is less than the volume size to be splitted in two volumes. Which can be easily done with duplicity. Intra2net: We want the compression / encryption on top of tar since we don't want to leak the file names / directory structure in the encryption case. Wadobo: Fine by me: then it's clear that the duplicity new format is not good for us and the current file format isn't either, unless we remove the need for the manifest. -- [1] http://duplicity.nongnu.org/new_format.html 4. Have a look at duplicity and evaluate if it can be adapted in an upstream-compatible way to our needs May be it can be tweaked to always produce "full snapshots" for the differential backup scenario? The code of duplicity seems quite well documented and sufficiently structured and flexible so that even if there's no direct support for producing full snapshots for the incremental mode, it seems doable without having to resort in breaking things. From my inspection of the code, it seems that the function that specifies if a file is going to be diffed with rsync or snapshotted when doing a diff dir backup is get_delta_path in duplicity/diffdir.py. Given that, I would propose to send a message in the mailing list asking what do they think about adding that, if they would be willing to add this feature upstream, and in any case, ask them if someone has already tried to do this before and if they have any suggestion before we start to try to code it. Intra2net: Please go ahead and ask on the duplicity mailinglist two things: - (like you proposed) What do they think about the diff backup mode? - What do they think about restarting the gzip compression / encryption on the stream level at each file "boundary". If they don't like this the our future is already sealed :o) Wadobo: To encrypt/compress each "file boundary" (what is a "file entry" in GNU Tar terminology [1]), and that includes both header blocks with file path, to conceal that information, and the file data, which is contained in "payload" data blocks - this is probably what you are proposing, right? But the thing is, tar format is so easy/dumb that what you're describing is in fact technically the same as a compressing a tar file - a tar file that contains only one file entry. That's because a tar file doesn't have any initial special header, and it does have an end of file marker, which consists of two 512 blocks of zero bytes, but it's optional. So technically if I understand correctly, what you're proposing is the same as creating a tar file, and then compressing/encrypting it, which is fine by me. When the tarball is big, we could use the help of multivolume tars, so each file entry is divided in different volumes (which are in fact also tar files, containing sections of a file). Of course, this would generate one lots of files in the backup dir, one per backed file (or more, if it was splitted via multivolume), and this is not good. But we know a solution for that: on top of those tar.gz.gpg, we can put *another* tar container. this container would have headers that are not concealed because they do not contain more than the name of the volume, and not the name (or any other sensitive information) from the compressed / encrypted file. Rhis could be just a very big tar file, splitted via multivolume, with lots of different sized files, which are also tar files containing encrypted/compressed files. This would meet the requirements, I think. For that idea, the conclusion is that this can be done easily with available tools: tarfile python library allows to create tar.gz files and it's also easy to encrypt tarballs with it, as duplicity already does. Only part missing would be multivolume encryption. For optimization, we would also have to add an option to remove the end of file marker Maybe I have not understood well what you were trying to say, please tell me if that's the case. [1] http://www.gnu.org/software/tar/manual/html_node/Standard.html *** New items to check *** - please investigate how the compression can be restarted in tarlib on the global stream level for each file boundary. We need this for our "own" solution and/or for duplicity. Only this gives good data integrity. Wadobo: You mean tarfile right? [1]. Anyway, asuming I'm correct in the analysis above, this can be done easily using the write mode "w:" and "w:gz" modes (same for read) and doing TarFile.addtarinfo(TarInfo.frombuf(f)) being "f" an encrypted data file/stream. The stream support in tarfile means that you just give a fileobj with the read/write functions that tarlib will call to create/read tar files.It's up to you how those read/write functions to do the rest. [1] http://docs.python.org/2/library/tarfile.html - we could use a tar format variant like GNU tar or "pax". Please evaluate the pros / cons of other tar format variants. pax f.e. seems to store the file owner names etc. IIRC Fedora added xattr support to tar. Wadobo: pax indeed seems to be the more powerful tar format and it's been standarized by IEEE. It's supported by tarfile python lib, I would recommend using it. As advantages, it allows storing hard links, encoding, charset, large paths and bigger sizes. This is all because of the extended header. GNU Tar format is also supported by tarfile and has long names support. As you note, there's support for extended attributes to tar that has been added by some distros. For example in opensuse this is not the case. The case is similar with pax: by default it doesn't support it, but some have patched it (for example solaris). tarfile python lib does not have support for extended attributes but shouldn't be very difficult to add it. - duplicity is GPL. We intend to add the "archive layer" later on to our backup subsystem which means the whole subsystem must become GPL, too. That's something I have not made my mind up, but it would block that road. OTOH we save a bit of development time by using the duplicity basis. Wadobo: That's true. We seem to have a different use case than the one covered by duplicity, so in any case we perhaps can learn some tricks from their code but it might be not such a big advantage to start from duplicity. *** Later on *** X. Collect feedback from Intra2net. If all is fine, design the class layout for all the design requirements and have another round of feedback.