can extract the files from a given position / search
for the first marker.
+Wadobo:
+ I had not thought about it this way, but you're right, having a manifest
+ file makes you dependent on it. Looking more closely to the tar format, it's
+ for me now quite clear that manifest is not really needed .
+
+ Tar format works by dividing a tar file in blocks. Each series of blocks
+ have a block header. This header indicates among other things
+ if it's a continuation of a file that started in another block and the
+ offset, so this is the way one can recognize when a file has been splitted
+ in two volumes easily. In fact, one can treat each individual volume of a
+ multivolume tar archive as if it was a complete tar archive. GNU tar command
+ supports this. This works well unless you have to extract a multivol file
+ that started in a previous volume. We just need to implement multivolume
+ support in python tarlib. Which doesn't seem to be a complicated thing to do
+ as it already has support for openning a stream for read/write.
+
2. Design how we are going to split huge files over multiple volumes
and recognize them later on.
we don't want to leak the file names / directory
structure in the encryption case.
+Wadobo:
+ Fine by me: then it's clear that the duplicity new format is not good for us
+ and the current file format isn't either, unless we remove the need for the
+ manifest.
+
--
[1] http://duplicity.nongnu.org/new_format.html
Please go ahead and ask on the duplicity mailinglist two things:
- (like you proposed) What do they think about the diff backup mode?
+
- What do they think about restarting the gzip compression / encryption
on the stream level at each file "boundary".
If they don't like this the our future is already sealed :o)
+Wadobo:
+
+ To encrypt/compress each "file boundary" (what is a "file entry" in GNU Tar
+ terminology [1]), and that includes both header blocks with file path, to
+ conceal that information, and the file data, which is contained in "payload"
+ data blocks - this is probably what you are proposing, right?
+
+ But the thing is, tar format is so easy/dumb that what you're describing is
+ in fact technically the same as a compressing a tar file - a tar file that
+ contains only one file entry. That's because a tar file doesn't have any
+ initial special header, and it does have an end of file marker, which
+ consists of two 512 blocks of zero bytes, but it's optional.
+
+ So technically if I understand correctly, what you're proposing is the same
+ as creating a tar file, and then compressing/encrypting it, which is fine
+ by me. When the tarball is big, we could use the help of multivolume tars,
+ so each file entry is divided in different volumes (which are in fact also
+ tar files, containing sections of a file).
+
+ Of course, this would generate one lots of files in the backup dir, one per
+ backed file (or more, if it was splitted via multivolume), and this is not
+ good. But we know a solution for that: on top of those tar.gz.gpg, we can
+ put *another* tar container. this container would have headers that are not
+ concealed because they do not contain more than the name of the volume, and
+ not the name (or any other sensitive information) from the compressed /
+ encrypted file. Rhis could be just a very big tar file, splitted via
+ multivolume, with lots of different sized files, which are also tar files
+ containing encrypted/compressed files. This would meet the requirements, I
+ think.
+
+ For that idea, the conclusion is that this can be done easily with available
+ tools: tarfile python library allows to create tar.gz files and it's also
+ easy to encrypt tarballs with it, as duplicity already does. Only part
+ missing would be multivolume encryption. For optimization, we would also
+ have to add an option to remove the end of file marker
+
+ Maybe I have not understood well what you were trying to say, please tell
+ me if that's the case.
+
+ [1] http://www.gnu.org/software/tar/manual/html_node/Standard.html
+
*** New items to check ***
- please investigate how the compression can be restarted in tarlib
on the global stream level for each file boundary.
We need this for our "own" solution and/or for duplicity.
Only this gives good data integrity.
+Wadobo:
+
+ You mean tarfile right? [1]. Anyway, asuming I'm correct in the analysis
+ above, this can be done easily using the write mode "w:" and "w:gz" modes
+ (same for read) and doing TarFile.addtarinfo(TarInfo.frombuf(f)) being "f"
+ an encrypted data file/stream.
+
+ The stream support in tarfile means that you just give a fileobj with the
+ read/write functions that tarlib will call to create/read tar files.It's up
+ to you how those read/write functions to do the rest.
+
+[1] http://docs.python.org/2/library/tarfile.html
+
- we could use a tar format variant like GNU tar or "pax".
Please evaluate the pros / cons of other tar format variants.
pax f.e. seems to store the file owner names etc.
IIRC Fedora added xattr support to tar.
+Wadobo:
+
+ pax indeed seems to be the more powerful tar format and it's been
+ standarized by IEEE. It's supported by tarfile python lib, I would
+ recommend using it. As advantages, it allows storing
+ hard links, encoding, charset, large paths and bigger sizes. This is all
+ because of the extended header.
+
+ GNU Tar format is also supported by tarfile and has long names support.
+
+ As you note, there's support for extended attributes to tar that has been
+ added by some distros. For example in opensuse this is not the case. The
+ case is similar with pax: by default it doesn't support it, but some have
+ patched it (for example solaris).
+
+ tarfile python lib does not have support for extended attributes but
+ shouldn't be very difficult to add it.
+
- duplicity is GPL. We intend to add the "archive layer"
later on to our backup subsystem which means the
whole subsystem must become GPL, too. That's something
I have not made my mind up, but it would block that road.
OTOH we save a bit of development time by using the duplicity basis.
+Wadobo:
+ That's true. We seem to have a different use case than the one covered by
+ duplicity, so in any case we perhaps can learn some tricks from their code
+ but it might be not such a big advantage to start from duplicity.
+
*** Later on ***
X. Collect feedback from Intra2net. If all is fine,
design the class layout for all the design requirements