*** First steps ***
We need to do some initial research if tar
can satisfy our the design requirements.

1. Find out what needs to be done for multi volume
   support in python tarlib. GNU tar already supports this.

   [How did duplicity solve this?]

Wadobo:
    Effectively GNU tar supports multivolume. Tar multivolume has some limitations,
    for example they cannot be compressed.  To implement support for multivolume
    tars in python tarlib shouldn't be very difficult though, but it would needed
    to study the multivolume format.

    Duplicity works creating fixed sized volumes, where the files being archived
    are stored. If a file doesn't fill the current volume, it's split between
    current volume and the next. Duplicity generates an external manifest file that
    specifies what file was splitted from in the end of a volume and the begining of
    the other This is how they seem to keep track of the splitted files and the
    volumes themselves.

Intra2net:
    We could implement multi volume by splitting
    the compressed tar archive once it has reached
    the volume size limit and treat it later on like one,
    big virtual volume. The volumes could be encrypted, too.

    -> No need for a (fragile?) manifest file. I think archiving to tape
        pretty much works like this with gnu tar.

    In case of an emergency you could just "cat"
    those files together and start unpacking it.

    In case the first volume of a splitted file is gone,
    we still have the "emergency recover" tool which
    can extract the files from a given position / search
    for the first marker.

Wadobo:
    I had not thought about it this way, but you're right, having a manifest
    file makes you dependent on it. Looking more closely to the tar format, it's
    for me now quite clear that manifest is not really needed .

    Tar format works by dividing a tar file in blocks. Each series of blocks
    have a block header. This header indicates among other things
    if it's a continuation of a file that started in another block and the
    offset, so this is the way one can recognize when a file has been splitted
    in two volumes easily. In fact, one can treat each individual volume of a
    multivolume tar archive as if it was a complete tar archive. GNU tar command
    supports this. This works well unless you have to extract a multivol file
    that started in a previous volume. We just need to implement multivolume
    support in python tarlib. Which doesn't seem to be a complicated thing to do
    as it already has support for openning a stream for read/write.

2. Design how we are going to split huge files over multiple volumes
   and recognize them later on.

   [How did duplicity solve this? There are unit tests for this]

Wadobo:

The method used by duplicity related in the previous question seems to do the
trick without having to resort to using magic markers and escaping, so I would
suggest doing that.

Here is an excerpt of a manifest:

Volume 1:
    StartingPath   .
    EndingPath     "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 463
    Hash SHA1 02d12203ce728f70a846f87aeff08b1ed92f6148
Volume 2:
    StartingPath   "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 464
    EndingPath     "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 861
    Hash SHA1 2299f5afc7f41f66f5bd12061ef68962d3231e34

Intra2net:
    Answered above: Manifest is probably not needed.

3. How can we restart the gzip compression for every file in the tar archive?

Wadobo:

duplicity has already proposed a new better-than-tar file format, which
might be an interesting way to start collaborating with them: giving them some
feedback/input on their proposal based on our needs, and perhaps implementing a
generalized solution that could allow for all the use cases. In
that document they describe the problems of tar, one of which is that tar
only allow to be wrapped on the outside, hence tar.bz2 or tar.gpg formats.

We could just force all files to be compressed and encrypted in the inside, this
is doable. So the process would be: for each file, they would be compressed,
encrypted and put in the tar.

One problem that could arise  is that if you have too many small
files, this process  of compressing&encrypting each file might be quite slow
and take more space than needed. On the other hand, one would probably still
have to split large files.

A compromise solution could be to simply just use a good volume size, and try
to remove the possibility of a file whose size is less than the volume
size to be splitted in two volumes. Which can be easily done with duplicity.

Intra2net:
    We want the compression / encryption on top of tar since
    we don't want to leak the file names / directory
    structure in the encryption case.

Wadobo:
    Fine by me: then it's clear that the duplicity new format is not good for us
    and the current file format isn't either, unless we remove the need for the
    manifest.

--
[1] http://duplicity.nongnu.org/new_format.html

4. Have a look at duplicity and evaluate if it can be adapted
   in an upstream-compatible way to our needs

   May be it can be tweaked to always produce "full snapshots"
   for the differential backup scenario?

The code of duplicity seems quite well documented and sufficiently structured
and flexible so that even if there's no direct support for producing full
snapshots for the incremental mode, it seems doable without having to resort
in breaking things.

From my inspection of the code, it seems that the function that specifies if
a file is going to be diffed with rsync or snapshotted when doing a diff dir
backup  is get_delta_path in duplicity/diffdir.py.

Given that, I would propose to send a message in the mailing list asking what
do they think about adding that, if they would be willing to add this feature
upstream, and in any case, ask them if someone has already tried to do this
before and if they have any suggestion before we start to try to code it.

Intra2net:
    Please go ahead and ask on the duplicity mailinglist two things:
    - (like you proposed) What do they think about the diff backup mode?


    - What do they think about restarting the gzip compression / encryption
      on the stream level at each file "boundary".
      If they don't like this the our future is already sealed :o)

Wadobo:

    To encrypt/compress each "file boundary" (what is a "file entry" in GNU Tar
    terminology [1]), and that includes both header blocks with file path, to
    conceal that information, and the file data, which is contained in "payload"
    data blocks - this is probably what you are proposing, right?

    But the thing is, tar format is so easy/dumb that what you're describing is
    in fact technically the same as a compressing a tar file - a tar file that
    contains only one file entry. That's because a tar file doesn't have any
    initial special header, and it does have an end of file marker, which
    consists of two 512 blocks of zero bytes, but it's optional.

    So technically if I understand correctly, what you're proposing is the same
    as creating a tar file, and then compressing/encrypting it, which is fine
    by me. When the tarball is big, we could use the help of multivolume tars,
    so each file entry is divided in different volumes (which are in fact also
    tar files, containing sections of a file).

    Of course, this would generate one lots of files in the backup dir, one per
    backed file (or more, if it was splitted via multivolume), and this is not
    good. But we know a solution for that: on top of those tar.gz.gpg, we can
    put *another* tar container. this container would have headers that are not
    concealed because they do not contain more than the name of the volume, and
    not the name (or any other sensitive information) from the compressed /
    encrypted file. Rhis could be just a very big tar file, splitted via
    multivolume, with lots of different sized files, which are also tar files
    containing encrypted/compressed files. This would meet the requirements, I
    think.

    For that idea, the conclusion is that this can be done easily with available
    tools: tarfile python library allows to create tar.gz files and it's also
    easy to encrypt tarballs with it, as duplicity already does. Only part
    missing would be multivolume encryption. For optimization, we would also
    have to add an option to remove the end of file marker

    Maybe I have not understood well what you were trying to say, please tell
    me if that's the case.

    [1] http://www.gnu.org/software/tar/manual/html_node/Standard.html

*** New items to check ***
- please investigate how the compression can be restarted in tarlib
  on the global stream level for each file boundary.
  We need this for our "own" solution and/or for duplicity.
  Only this gives good data integrity.

Wadobo:

    You mean tarfile right? [1]. Anyway, asuming I'm correct in the analysis
    above, this can be done easily using the write mode "w:" and "w:gz" modes
    (same for read) and doing TarFile.addtarinfo(TarInfo.frombuf(f)) being "f"
    an encrypted data file/stream.

    The stream support in tarfile means that you just give a fileobj with the
    read/write functions that tarlib will call to create/read tar files.It's up
    to you how those read/write functions to do the rest.

[1] http://docs.python.org/2/library/tarfile.html

- we could use a tar format variant like GNU tar or "pax".
  Please evaluate the pros / cons of other tar format variants.
 
  pax f.e. seems to store the file owner names etc.
  IIRC Fedora added xattr support to tar.

Wadobo:

    pax indeed seems to be the more powerful tar format and it's been
    standarized by IEEE. It's supported by tarfile python lib, I would
    recommend using it. As advantages, it allows storing
    hard links, encoding, charset, large paths and bigger sizes. This is all
    because of the extended header.

    GNU Tar format is also supported by tarfile and has long names support.

    As you note, there's support for extended attributes to tar that has been
    added by some distros. For example in opensuse this is not the case. The
    case is similar with pax: by default it doesn't support it, but some have
    patched it (for example solaris).

    tarfile python lib does not have support for extended attributes but
    shouldn't be very difficult to add it.

- duplicity is GPL. We intend to add the "archive layer"
  later on to our backup subsystem which means the
  whole subsystem must become GPL, too. That's something
  I have not made my mind up, but it would block that road.
  OTOH we save a bit of development time by using the duplicity basis.

Wadobo:
    That's true. We seem to have a different use case than the one covered by
    duplicity, so in any case we perhaps can learn some tricks from their code
    but it might be not such a big advantage to start from duplicity.

*** Later on ***
X. Collect feedback from Intra2net. If all is fine,
   design the class layout for all the design requirements
   and have another round of feedback.