From 414c0518f4e131be7a8c7fb85c904fc81e850bc9 Mon Sep 17 00:00:00 2001 From: Thomas Jarosch Date: Tue, 4 Jun 2013 16:59:58 +0200 Subject: [PATCH] Add design ideas and first TODO steps --- docs/TODO.txt | 25 ++++++++ docs/design_ideas.txt | 152 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 177 insertions(+), 0 deletions(-) create mode 100644 docs/TODO.txt create mode 100644 docs/design_ideas.txt diff --git a/docs/TODO.txt b/docs/TODO.txt new file mode 100644 index 0000000..b85c24a --- /dev/null +++ b/docs/TODO.txt @@ -0,0 +1,25 @@ +*** First steps *** +We need to do some initial research if tar +can satisfy our the design requirements. + +1. Find out what needs to be done for multi volume + support in python tarlib. GNU tar already supports this. + + [How did duplicity solve this?] + +2. Design how we are going to split huge files over multiple volumes + and recognize them later on. + + [How did duplicity solve this? There are unit tests for this] + +3. How can we restart the gzip compression for every file in the tar archive? + +4. Have a look at duplicity and evaluate if it can be adapted + in an upstream-compatible way to our needs + + May be it can be tweaked to always produce "full snapshots" + for the differential backup scenario? + +5. Collect feedback from Intra2net. If all is fine, + design the class layout for all the design requirements + and have another round of feedback. diff --git a/docs/design_ideas.txt b/docs/design_ideas.txt new file mode 100644 index 0000000..a16a0d3 --- /dev/null +++ b/docs/design_ideas.txt @@ -0,0 +1,152 @@ +For our backup system, we have have the following requirements: + +*** Requirements *** +- backup of many files, even in one directory (cyrus imapd data folder) +- split backup volumes at a configurable size +- split large files across volumes +- compression support +- encryption support +- filtering of files to include/exclude for backup +- filtering of files to include/exclude for restore + +- robustness against corrupted archives. + If one file is corrupt / partially written, the rest should still be readable. + +- preserve file ownership and stuff like that. + Note: No support for hardlinks or xattrs neccessary if not really cheap to get + +- keep a list of stored files. With this list we can perform a + "differential" backup, to create a small backup only with new / modified files. + Always backup complete files for maxiumum integrity (no rdiff-like functionality) + +- full-backups and differential backups need to be merged on restore. + +Intra2net specific requirement: +- Once the final volume is written, an "info.dat" file is created. + It contains the md5 checksums of all volumes + how much space + each user in the cyrus directory occupies. We should also store + in which file and at which position the user data begins, + so we can quickly extract all messages of a user. + +*** Design ideas *** +- written in python, python 3.x support if possible, LGPLv3 license +- based upon tar format / python-tarlib. + Should be extractable with standard tools in case of an emergency +- use gzip for compression +- use pycrypto for strong encryption + (-> avoid forking subprocesses if possible) +- file modification check via stat (like git): + Check the following criteria: + - stat.st_mode + - mtime + - ctime + - uid/gid + - inode number + - file size +- the file index is separate from the backup volumes. + It is just needed to perform the differential backup. + +- the file index uses an easily parsable, readable and extendable + format. XML might be a good choice. +- The file index can become quite large: make sure it can always + be processed as stream and never needs to be completely kept in RAM +- Store file positions in the file index to make extraction + faster when the file index is available + + When performing the differential backup, only process + one directory at a time to keep memory requirements low. + [Instead of using a recursive algorithm, insert subdirectories + to be processed at the beginning of a queue. So files for + one user are still grouped together in the backup file.] + +- differential backup: deleted files are marked by a path prefix. + regular files in a diff backup: diff/regular/path/file + deleted file markers in a diff backup: del/regular/path/file + + Idea: Differential backups could be marked as such if the first file + of the tar archive is a "DIFF-BACKUP" file. + Then we prefix all filenames like above, otherwise not. + -> full backups contain normal paths. + +- gzip compression supports appending to existing compressed files (zcat). + + The compression for tar archives is usually added on top of the whole + archive stream (-> .tar.gz). We want to do it a little bit different, + but still fully compatible to standard practice: + + For every file and it surrounding tar metadata we backup, restart the + gzip compression for maximum robustness. This will give + slightly worse compression but if two consecutive files are corrupt, we + can still extract the files beyond the corruption. + +- encryption is added after the compression. It is done in a similar + way as the compression: restart the encryption on every file boundary. + + To be able to find the start of a new encryption block, we need a + magic marker. Most probably the data needs to be escaped to not contain the + magic marker. + [ Look how this is done in gzip or dar ] + + To increase security against known plaintext attacks, add a small, random + amount of padding with random data. + + [Check this concept against other common encryption attack vectors.] + +- provide small cmdline tools just for stream decompression / decryption. + This is useful for emergency recovery of corrupted archives. + +Minor ideas: +- develop unit tests for most functionality +- pylint should be mostly quiet for sane messages +- designed as a library to be used by our real backup script. + Should be as generic as possible so it's useful + for other people later on. +- all filenames are UTF-8 + +similar projects (might give additionals ideas): +- duplicity (http://duplicity.nongnu.org/) + - Pro: + - written in python + + - Contra: + - no differential backup, only incremental (needs all previous backup files) + - creates diffs of files (rsync based) + [Both of these design decisions weaken the rule that we should be able + to recover as much as possible from broken machines.] + - plans to move away from tar, makes it harder for long-term maintenance + + - uses bzr. This is personal taste, I just prefer git :) + +- dar / libdar (http://dar.linux.free.fr/) + - Pro: + - supports most of our requirements + - code matured for ten years + + - Neutral: + - written in C++ (code looks a bit C-ish though) + - way more features than we need + + - Contra: + - huge memory requirements. It keeps an index of all + files in memory. Each files occupies at least 850 bytes + + the path. We aborted our test with 30.000.000 files + when the program already slurped up 19GB of RAM. + + Will probably break our entry level boxes / old machine + without enough RAM and swap. Fixing this will probably + require a rewrite of a lot of core components. + + - no unit tests / C-ish code style / French class and variable names + - not using tar format / custom format + - state of python binding is unknown. Probably working, no unit tests. + +Related reads: + - http://dar.linux.free.fr/doc/Notes.html + - http://duplicity.nongnu.org/new_format.html + +Security ideas (TO BE DECIDED ON, STILL VAGUE IDEAS): +- Optionally: Prevent directory traversal attack on restore +- forbid restoring of programs with setuid/setgid + +Misc ideas: +- benchmark bz2 / lzo compression -- 1.7.1