[python-delta-tar] / docs / TODO.txt

*** First steps ***
We need to do some initial research if tar
can satisfy our the design requirements.

1. Find out what needs to be done for multi volume
   support in python tarlib. GNU tar already supports this.

   [How did duplicity solve this?]

Wadobo:
    Effectively GNU tar supports multivolume. Tar multivolume has some limitations,
    for example they cannot be compressed.  To implement support for multivolume
    tars in python tarlib shouldn't be very difficult though, but it would needed
    to study the multivolume format.

    Duplicity works creating fixed sized volumes, where the files being archived
    are stored. If a file doesn't fill the current volume, it's split between
    current volume and the next. Duplicity generates an external manifest file that
    specifies what file was splitted from in the end of a volume and the begining of
    the other This is how they seem to keep track of the splitted files and the
    volumes themselves.

Intra2net:
    We could implement multi volume by splitting
    the compressed tar archive once it has reached
    the volume size limit and treat it later on like one,
    big virtual volume. The volumes could be encrypted, too.

    -> No need for a (fragile?) manifest file. I think archiving to tape
        pretty much works like this with gnu tar.

    In case of an emergency you could just "cat"
    those files together and start unpacking it.

    In case the first volume of a splitted file is gone,
    we still have the "emergency recover" tool which
    can extract the files from a given position / search
    for the first marker.

Wadobo:
    I had not thought about it this way, but you're right, having a manifest
    file makes you dependent on it. Looking more closely to the tar format, it's
    for me now quite clear that manifest is not really needed .

    Tar format works by dividing a tar file in blocks. Each series of blocks
    have a block header. This header indicates among other things
    if it's a continuation of a file that started in another block and the
    offset, so this is the way one can recognize when a file has been splitted
    in two volumes easily. In fact, one can treat each individual volume of a
    multivolume tar archive as if it was a complete tar archive. GNU tar command
    supports this. This works well unless you have to extract a multivol file
    that started in a previous volume. We just need to implement multivolume
    support in python tarlib. Which doesn't seem to be a complicated thing to do
    as it already has support for openning a stream for read/write.

2. Design how we are going to split huge files over multiple volumes
   and recognize them later on.

   [How did duplicity solve this? There are unit tests for this]

Wadobo:

The method used by duplicity related in the previous question seems to do the
trick without having to resort to using magic markers and escaping, so I would
suggest doing that.

Here is an excerpt of a manifest:

Volume 1:
    StartingPath   .
    EndingPath     "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 463
    Hash SHA1 02d12203ce728f70a846f87aeff08b1ed92f6148
Volume 2:
    StartingPath   "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 464
    EndingPath     "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 861
    Hash SHA1 2299f5afc7f41f66f5bd12061ef68962d3231e34

Intra2net:
    Answered above: Manifest is probably not needed.

3. How can we restart the gzip compression for every file in the tar archive?

Wadobo:

duplicity has already proposed a new better-than-tar file format, which
might be an interesting way to start collaborating with them: giving them some
feedback/input on their proposal based on our needs, and perhaps implementing a
generalized solution that could allow for all the use cases. In
that document they describe the problems of tar, one of which is that tar
only allow to be wrapped on the outside, hence tar.bz2 or tar.gpg formats.

We could just force all files to be compressed and encrypted in the inside, this
is doable. So the process would be: for each file, they would be compressed,
encrypted and put in the tar.

One problem that could arise  is that if you have too many small
files, this process  of compressing&encrypting each file might be quite slow
and take more space than needed. On the other hand, one would probably still
have to split large files.

A compromise solution could be to simply just use a good volume size, and try
to remove the possibility of a file whose size is less than the volume
size to be splitted in two volumes. Which can be easily done with duplicity.

Intra2net:
    We want the compression / encryption on top of tar since
    we don't want to leak the file names / directory
    structure in the encryption case.

Wadobo:
    Fine by me: then it's clear that the duplicity new format is not good for us
    and the current file format isn't either, unless we remove the need for the
    manifest.

--
[1] http://duplicity.nongnu.org/new_format.html

4. Have a look at duplicity and evaluate if it can be adapted
   in an upstream-compatible way to our needs

   May be it can be tweaked to always produce "full snapshots"
   for the differential backup scenario?

The code of duplicity seems quite well documented and sufficiently structured
and flexible so that even if there's no direct support for producing full
snapshots for the incremental mode, it seems doable without having to resort
in breaking things.

From my inspection of the code, it seems that the function that specifies if
a file is going to be diffed with rsync or snapshotted when doing a diff dir
backup  is get_delta_path in duplicity/diffdir.py.

Given that, I would propose to send a message in the mailing list asking what
do they think about adding that, if they would be willing to add this feature
upstream, and in any case, ask them if someone has already tried to do this
before and if they have any suggestion before we start to try to code it.

Intra2net:
    Please go ahead and ask on the duplicity mailinglist two things:
    - (like you proposed) What do they think about the diff backup mode?


    - What do they think about restarting the gzip compression / encryption
      on the stream level at each file "boundary".
      If they don't like this the our future is already sealed :o)

Wadobo:

    To encrypt/compress each "file boundary" (what is a "file entry" in GNU Tar
    terminology [1]), and that includes both header blocks with file path, to
    conceal that information, and the file data, which is contained in "payload"
    data blocks - this is probably what you are proposing, right?

    But the thing is, tar format is so easy/dumb that what you're describing is
    in fact technically the same as a compressing a tar file - a tar file that
    contains only one file entry. That's because a tar file doesn't have any
    initial special header, and it does have an end of file marker, which
    consists of two 512 blocks of zero bytes, but it's optional.

    So technically if I understand correctly, what you're proposing is the same
    as creating a tar file, and then compressing/encrypting it, which is fine
    by me. When the tarball is big, we could use the help of multivolume tars,
    so each file entry is divided in different volumes (which are in fact also
    tar files, containing sections of a file).

    Of course, this would generate one lots of files in the backup dir, one per
    backed file (or more, if it was splitted via multivolume), and this is not
    good. But we know a solution for that: on top of those tar.gz.gpg, we can
    put *another* tar container. this container would have headers that are not
    concealed because they do not contain more than the name of the volume, and
    not the name (or any other sensitive information) from the compressed /
    encrypted file. Rhis could be just a very big tar file, splitted via
    multivolume, with lots of different sized files, which are also tar files
    containing encrypted/compressed files. This would meet the requirements, I
    think.

    For that idea, the conclusion is that this can be done easily with available
    tools: tarfile python library allows to create tar.gz files and it's also
    easy to encrypt tarballs with it, as duplicity already does. Only part
    missing would be multivolume encryption. For optimization, we would also
    have to add an option to remove the end of file marker

    Maybe I have not understood well what you were trying to say, please tell
    me if that's the case.

    [1] http://www.gnu.org/software/tar/manual/html_node/Standard.html

*** New items to check ***
- please investigate how the compression can be restarted in tarlib
  on the global stream level for each file boundary.
  We need this for our "own" solution and/or for duplicity.
  Only this gives good data integrity.

Wadobo:

    You mean tarfile right? [1]. Anyway, asuming I'm correct in the analysis
    above, this can be done easily using the write mode "w:" and "w:gz" modes
    (same for read) and doing TarFile.addtarinfo(TarInfo.frombuf(f)) being "f"
    an encrypted data file/stream.

    The stream support in tarfile means that you just give a fileobj with the
    read/write functions that tarlib will call to create/read tar files.It's up
    to you how those read/write functions to do the rest.

[1] http://docs.python.org/2/library/tarfile.html

- we could use a tar format variant like GNU tar or "pax".
  Please evaluate the pros / cons of other tar format variants.
 
  pax f.e. seems to store the file owner names etc.
  IIRC Fedora added xattr support to tar.

Wadobo:

    pax indeed seems to be the more powerful tar format and it's been
    standarized by IEEE. It's supported by tarfile python lib, I would
    recommend using it. As advantages, it allows storing
    hard links, encoding, charset, large paths and bigger sizes. This is all
    because of the extended header.

    GNU Tar format is also supported by tarfile and has long names support.

    As you note, there's support for extended attributes to tar that has been
    added by some distros. For example in opensuse this is not the case. The
    case is similar with pax: by default it doesn't support it, but some have
    patched it (for example solaris).

    tarfile python lib does not have support for extended attributes but
    shouldn't be very difficult to add it.

- duplicity is GPL. We intend to add the "archive layer"
  later on to our backup subsystem which means the
  whole subsystem must become GPL, too. That's something
  I have not made my mind up, but it would block that road.
  OTOH we save a bit of development time by using the duplicity basis.

Wadobo:
    That's true. We seem to have a different use case than the one covered by
    duplicity, so in any case we perhaps can learn some tricks from their code
    but it might be not such a big advantage to start from duplicity.

*** Later on ***
X. Collect feedback from Intra2net. If all is fine,
   design the class layout for all the design requirements
   and have another round of feedback.
Commit	Line	Data
414c0518 TJ	1	* First steps *
	2	We need to do some initial research if tar
	3	can satisfy our the design requirements.
	4
	5	1. Find out what needs to be done for multi volume
	6	support in python tarlib. GNU tar already supports this.
	7
	8	[How did duplicity solve this?]
	9
6ae78488 TJ	10	Wadobo:
	11	Effectively GNU tar supports multivolume. Tar multivolume has some limitations,
	12	for example they cannot be compressed. To implement support for multivolume
	13	tars in python tarlib shouldn't be very difficult though, but it would needed
	14	to study the multivolume format.
	15
	16	Duplicity works creating fixed sized volumes, where the files being archived
	17	are stored. If a file doesn't fill the current volume, it's split between
	18	current volume and the next. Duplicity generates an external manifest file that
	19	specifies what file was splitted from in the end of a volume and the begining of
	20	the other This is how they seem to keep track of the splitted files and the
	21	volumes themselves.
	22
	23	Intra2net:
	24	We could implement multi volume by splitting
	25	the compressed tar archive once it has reached
	26	the volume size limit and treat it later on like one,
	27	big virtual volume. The volumes could be encrypted, too.
	28
	29	-> No need for a (fragile?) manifest file. I think archiving to tape
	30	pretty much works like this with gnu tar.
	31
	32	In case of an emergency you could just "cat"
	33	those files together and start unpacking it.
	34
	35	In case the first volume of a splitted file is gone,
	36	we still have the "emergency recover" tool which
	37	can extract the files from a given position / search
	38	for the first marker.
884e7f6a	39
6916b581 ERE	40	Wadobo:
	41	I had not thought about it this way, but you're right, having a manifest
	42	file makes you dependent on it. Looking more closely to the tar format, it's
	43	for me now quite clear that manifest is not really needed .
	44
	45	Tar format works by dividing a tar file in blocks. Each series of blocks
	46	have a block header. This header indicates among other things
	47	if it's a continuation of a file that started in another block and the
	48	offset, so this is the way one can recognize when a file has been splitted
	49	in two volumes easily. In fact, one can treat each individual volume of a
	50	multivolume tar archive as if it was a complete tar archive. GNU tar command
	51	supports this. This works well unless you have to extract a multivol file
	52	that started in a previous volume. We just need to implement multivolume
	53	support in python tarlib. Which doesn't seem to be a complicated thing to do
	54	as it already has support for openning a stream for read/write.
	55
414c0518 TJ	56	2. Design how we are going to split huge files over multiple volumes
	57	and recognize them later on.
	58
	59	[How did duplicity solve this? There are unit tests for this]
	60
6ae78488	61	Wadobo:
884e7f6a ERE	62
	63	The method used by duplicity related in the previous question seems to do the
	64	trick without having to resort to using magic markers and escaping, so I would
	65	suggest doing that.
	66
	67	Here is an excerpt of a manifest:
	68
	69	Volume 1:
	70	StartingPath .
	71	EndingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 463
	72	Hash SHA1 02d12203ce728f70a846f87aeff08b1ed92f6148
	73	Volume 2:
	74	StartingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 464
	75	EndingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 861
	76	Hash SHA1 2299f5afc7f41f66f5bd12061ef68962d3231e34
	77
6ae78488 TJ	78	Intra2net:
	79	Answered above: Manifest is probably not needed.
	80
414c0518 TJ	81	3. How can we restart the gzip compression for every file in the tar archive?
414c0518 TJ	82
6ae78488	83	Wadobo:
884e7f6a ERE	84
	85	duplicity has already proposed a new better-than-tar file format, which
	86	might be an interesting way to start collaborating with them: giving them some
	87	feedback/input on their proposal based on our needs, and perhaps implementing a
	88	generalized solution that could allow for all the use cases. In
	89	that document they describe the problems of tar, one of which is that tar
	90	only allow to be wrapped on the outside, hence tar.bz2 or tar.gpg formats.
	91
	92	We could just force all files to be compressed and encrypted in the inside, this
	93	is doable. So the process would be: for each file, they would be compressed,
	94	encrypted and put in the tar.
	95
	96	One problem that could arise is that if you have too many small
	97	files, this process of compressing&encrypting each file might be quite slow
	98	and take more space than needed. On the other hand, one would probably still
	99	have to split large files.
	100
	101	A compromise solution could be to simply just use a good volume size, and try
	102	to remove the possibility of a file whose size is less than the volume
	103	size to be splitted in two volumes. Which can be easily done with duplicity.
	104
6ae78488 TJ	105	Intra2net:
	106	We want the compression / encryption on top of tar since
	107	we don't want to leak the file names / directory
	108	structure in the encryption case.
	109
6916b581 ERE	110	Wadobo:
	111	Fine by me: then it's clear that the duplicity new format is not good for us
	112	and the current file format isn't either, unless we remove the need for the
	113	manifest.
	114
884e7f6a ERE	115	--
	116	[1] http://duplicity.nongnu.org/new_format.html
	117
414c0518 TJ	118	4. Have a look at duplicity and evaluate if it can be adapted
	119	in an upstream-compatible way to our needs
	120
	121	May be it can be tweaked to always produce "full snapshots"
	122	for the differential backup scenario?
	123
884e7f6a ERE	124	The code of duplicity seems quite well documented and sufficiently structured
	125	and flexible so that even if there's no direct support for producing full
	126	snapshots for the incremental mode, it seems doable without having to resort
	127	in breaking things.
	128
	129	From my inspection of the code, it seems that the function that specifies if
	130	a file is going to be diffed with rsync or snapshotted when doing a diff dir
	131	backup is get_delta_path in duplicity/diffdir.py.
	132
	133	Given that, I would propose to send a message in the mailing list asking what
	134	do they think about adding that, if they would be willing to add this feature
	135	upstream, and in any case, ask them if someone has already tried to do this
	136	before and if they have any suggestion before we start to try to code it.
	137
6ae78488 TJ	138	Intra2net:
	139	Please go ahead and ask on the duplicity mailinglist two things:
	140	- (like you proposed) What do they think about the diff backup mode?
	141
6916b581	142
6ae78488 TJ	143	- What do they think about restarting the gzip compression / encryption
	144	on the stream level at each file "boundary".
	145	If they don't like this the our future is already sealed :o)
	146
6916b581 ERE	147	Wadobo:
	148
	149	To encrypt/compress each "file boundary" (what is a "file entry" in GNU Tar
	150	terminology [1]), and that includes both header blocks with file path, to
	151	conceal that information, and the file data, which is contained in "payload"
	152	data blocks - this is probably what you are proposing, right?
	153
	154	But the thing is, tar format is so easy/dumb that what you're describing is
	155	in fact technically the same as a compressing a tar file - a tar file that
	156	contains only one file entry. That's because a tar file doesn't have any
	157	initial special header, and it does have an end of file marker, which
	158	consists of two 512 blocks of zero bytes, but it's optional.
	159
	160	So technically if I understand correctly, what you're proposing is the same
	161	as creating a tar file, and then compressing/encrypting it, which is fine
	162	by me. When the tarball is big, we could use the help of multivolume tars,
	163	so each file entry is divided in different volumes (which are in fact also
	164	tar files, containing sections of a file).
	165
	166	Of course, this would generate one lots of files in the backup dir, one per
	167	backed file (or more, if it was splitted via multivolume), and this is not
	168	good. But we know a solution for that: on top of those tar.gz.gpg, we can
	169	put another tar container. this container would have headers that are not
	170	concealed because they do not contain more than the name of the volume, and
	171	not the name (or any other sensitive information) from the compressed /
	172	encrypted file. Rhis could be just a very big tar file, splitted via
	173	multivolume, with lots of different sized files, which are also tar files
	174	containing encrypted/compressed files. This would meet the requirements, I
	175	think.
	176
	177	For that idea, the conclusion is that this can be done easily with available
	178	tools: tarfile python library allows to create tar.gz files and it's also
	179	easy to encrypt tarballs with it, as duplicity already does. Only part
	180	missing would be multivolume encryption. For optimization, we would also
	181	have to add an option to remove the end of file marker
	182
	183	Maybe I have not understood well what you were trying to say, please tell
	184	me if that's the case.
	185
	186	[1] http://www.gnu.org/software/tar/manual/html_node/Standard.html
	187
6ae78488 TJ	188	* New items to check *
	189	- please investigate how the compression can be restarted in tarlib
	190	on the global stream level for each file boundary.
	191	We need this for our "own" solution and/or for duplicity.
	192	Only this gives good data integrity.
	193
6916b581 ERE	194	Wadobo:
	195
	196	You mean tarfile right? [1]. Anyway, asuming I'm correct in the analysis
	197	above, this can be done easily using the write mode "w:" and "w:gz" modes
	198	(same for read) and doing TarFile.addtarinfo(TarInfo.frombuf(f)) being "f"
	199	an encrypted data file/stream.
	200
	201	The stream support in tarfile means that you just give a fileobj with the
	202	read/write functions that tarlib will call to create/read tar files.It's up
	203	to you how those read/write functions to do the rest.
	204
	205	[1] http://docs.python.org/2/library/tarfile.html
	206
6ae78488 TJ	207	- we could use a tar format variant like GNU tar or "pax".
	208	Please evaluate the pros / cons of other tar format variants.
	209
	210	pax f.e. seems to store the file owner names etc.
	211	IIRC Fedora added xattr support to tar.
	212
6916b581 ERE	213	Wadobo:
	214
	215	pax indeed seems to be the more powerful tar format and it's been
	216	standarized by IEEE. It's supported by tarfile python lib, I would
	217	recommend using it. As advantages, it allows storing
	218	hard links, encoding, charset, large paths and bigger sizes. This is all
	219	because of the extended header.
	220
	221	GNU Tar format is also supported by tarfile and has long names support.
	222
	223	As you note, there's support for extended attributes to tar that has been
	224	added by some distros. For example in opensuse this is not the case. The
	225	case is similar with pax: by default it doesn't support it, but some have
	226	patched it (for example solaris).
	227
	228	tarfile python lib does not have support for extended attributes but
	229	shouldn't be very difficult to add it.
	230
6ae78488 TJ	231	- duplicity is GPL. We intend to add the "archive layer"
	232	later on to our backup subsystem which means the
	233	whole subsystem must become GPL, too. That's something
	234	I have not made my mind up, but it would block that road.
	235	OTOH we save a bit of development time by using the duplicity basis.
	236
6916b581 ERE	237	Wadobo:
	238	That's true. We seem to have a different use case than the one covered by
	239	duplicity, so in any case we perhaps can learn some tricks from their code
	240	but it might be not such a big advantage to start from duplicity.
	241
6ae78488 TJ	242	* Later on *
6ae78488 TJ	243	X. Collect feedback from Intra2net. If all is fine,
414c0518 TJ	244	design the class layout for all the design requirements
414c0518 TJ	245	and have another round of feedback.