From: Eduardo Robles Elvira Date: Fri, 7 Jun 2013 08:43:37 +0000 (+0200) Subject: answering TODO questions X-Git-Tag: v2.2~195 X-Git-Url: http://developer.intra2net.com/git/?a=commitdiff_plain;h=884e7f6a2ab56377e78765a1668e55e85352b568;p=python-delta-tar answering TODO questions --- diff --git a/docs/TODO.txt b/docs/TODO.txt index b85c24a..0a175fe 100644 --- a/docs/TODO.txt +++ b/docs/TODO.txt @@ -7,19 +7,90 @@ can satisfy our the design requirements. [How did duplicity solve this?] +Answer: + +Effectively GNU tar supports multivolume. Tar multivolume has some limitations, +for example they cannot be compressed. To implement support for multivolume +tars in python tarlib shouldn't be very difficult though, but it would needed +to study the multivolume format. + +Duplicity works creating fixed sized volumes, where the files being archived +are stored. If a file doesn't fill the current volume, it's split between +current volume and the next. Duplicity generates an external manifest file that +specifies what file was splitted from in the end of a volume and the begining of +the other This is how they seem to keep track of the splitted files and the +volumes themselves. + 2. Design how we are going to split huge files over multiple volumes and recognize them later on. [How did duplicity solve this? There are unit tests for this] +Answer: + +The method used by duplicity related in the previous question seems to do the +trick without having to resort to using magic markers and escaping, so I would +suggest doing that. + +Here is an excerpt of a manifest: + +Volume 1: + StartingPath . + EndingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 463 + Hash SHA1 02d12203ce728f70a846f87aeff08b1ed92f6148 +Volume 2: + StartingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 464 + EndingPath "Ultimate\x20Lite\x20x86\x20Faster\x20June\x202012.iso" 861 + Hash SHA1 2299f5afc7f41f66f5bd12061ef68962d3231e34 + 3. How can we restart the gzip compression for every file in the tar archive? +Answer: + +duplicity has already proposed a new better-than-tar file format, which +might be an interesting way to start collaborating with them: giving them some +feedback/input on their proposal based on our needs, and perhaps implementing a +generalized solution that could allow for all the use cases. In +that document they describe the problems of tar, one of which is that tar +only allow to be wrapped on the outside, hence tar.bz2 or tar.gpg formats. + +We could just force all files to be compressed and encrypted in the inside, this +is doable. So the process would be: for each file, they would be compressed, +encrypted and put in the tar. + +One problem that could arise is that if you have too many small +files, this process of compressing&encrypting each file might be quite slow +and take more space than needed. On the other hand, one would probably still +have to split large files. + +A compromise solution could be to simply just use a good volume size, and try +to remove the possibility of a file whose size is less than the volume +size to be splitted in two volumes. Which can be easily done with duplicity. + +-- +[1] http://duplicity.nongnu.org/new_format.html + 4. Have a look at duplicity and evaluate if it can be adapted in an upstream-compatible way to our needs May be it can be tweaked to always produce "full snapshots" for the differential backup scenario? +The code of duplicity seems quite well documented and sufficiently structured +and flexible so that even if there's no direct support for producing full +snapshots for the incremental mode, it seems doable without having to resort +in breaking things. + +From my inspection of the code, it seems that the function that specifies if +a file is going to be diffed with rsync or snapshotted when doing a diff dir +backup is get_delta_path in duplicity/diffdir.py. + +Given that, I would propose to send a message in the mailing list asking what +do they think about adding that, if they would be willing to add this feature +upstream, and in any case, ask them if someone has already tried to do this +before and if they have any suggestion before we start to try to code it. + 5. Collect feedback from Intra2net. If all is fine, design the class layout for all the design requirements and have another round of feedback. +