--- /dev/null
+For our backup system, we have have the following requirements:
+
+*** Requirements ***
+- backup of many files, even in one directory (cyrus imapd data folder)
+- split backup volumes at a configurable size
+- split large files across volumes
+- compression support
+- encryption support
+- filtering of files to include/exclude for backup
+- filtering of files to include/exclude for restore
+
+- robustness against corrupted archives.
+ If one file is corrupt / partially written, the rest should still be readable.
+
+- preserve file ownership and stuff like that.
+ Note: No support for hardlinks or xattrs neccessary if not really cheap to get
+
+- keep a list of stored files. With this list we can perform a
+ "differential" backup, to create a small backup only with new / modified files.
+ Always backup complete files for maxiumum integrity (no rdiff-like functionality)
+
+- full-backups and differential backups need to be merged on restore.
+
+Intra2net specific requirement:
+- Once the final volume is written, an "info.dat" file is created.
+ It contains the md5 checksums of all volumes + how much space
+ each user in the cyrus directory occupies. We should also store
+ in which file and at which position the user data begins,
+ so we can quickly extract all messages of a user.
+
+*** Design ideas ***
+- written in python, python 3.x support if possible, LGPLv3 license
+- based upon tar format / python-tarlib.
+ Should be extractable with standard tools in case of an emergency
+- use gzip for compression
+- use pycrypto for strong encryption
+ (-> avoid forking subprocesses if possible)
+- file modification check via stat (like git):
+ Check the following criteria:
+ - stat.st_mode
+ - mtime
+ - ctime
+ - uid/gid
+ - inode number
+ - file size
+- the file index is separate from the backup volumes.
+ It is just needed to perform the differential backup.
+
+- the file index uses an easily parsable, readable and extendable
+ format. XML might be a good choice.
+- The file index can become quite large: make sure it can always
+ be processed as stream and never needs to be completely kept in RAM
+- Store file positions in the file index to make extraction
+ faster when the file index is available
+
+ When performing the differential backup, only process
+ one directory at a time to keep memory requirements low.
+ [Instead of using a recursive algorithm, insert subdirectories
+ to be processed at the beginning of a queue. So files for
+ one user are still grouped together in the backup file.]
+
+- differential backup: deleted files are marked by a path prefix.
+ regular files in a diff backup: diff/regular/path/file
+ deleted file markers in a diff backup: del/regular/path/file
+
+ Idea: Differential backups could be marked as such if the first file
+ of the tar archive is a "DIFF-BACKUP" file.
+ Then we prefix all filenames like above, otherwise not.
+ -> full backups contain normal paths.
+
+- gzip compression supports appending to existing compressed files (zcat).
+
+ The compression for tar archives is usually added on top of the whole
+ archive stream (-> .tar.gz). We want to do it a little bit different,
+ but still fully compatible to standard practice:
+
+ For every file and it surrounding tar metadata we backup, restart the
+ gzip compression for maximum robustness. This will give
+ slightly worse compression but if two consecutive files are corrupt, we
+ can still extract the files beyond the corruption.
+
+- encryption is added after the compression. It is done in a similar
+ way as the compression: restart the encryption on every file boundary.
+
+ To be able to find the start of a new encryption block, we need a
+ magic marker. Most probably the data needs to be escaped to not contain the
+ magic marker.
+ [ Look how this is done in gzip or dar ]
+
+ To increase security against known plaintext attacks, add a small, random
+ amount of padding with random data.
+
+ [Check this concept against other common encryption attack vectors.]
+
+- provide small cmdline tools just for stream decompression / decryption.
+ This is useful for emergency recovery of corrupted archives.
+
+Minor ideas:
+- develop unit tests for most functionality
+- pylint should be mostly quiet for sane messages
+- designed as a library to be used by our real backup script.
+ Should be as generic as possible so it's useful
+ for other people later on.
+- all filenames are UTF-8
+
+similar projects (might give additionals ideas):
+- duplicity (http://duplicity.nongnu.org/)
+ - Pro:
+ - written in python
+
+ - Contra:
+ - no differential backup, only incremental (needs all previous backup files)
+ - creates diffs of files (rsync based)
+ [Both of these design decisions weaken the rule that we should be able
+ to recover as much as possible from broken machines.]
+ - plans to move away from tar, makes it harder for long-term maintenance
+
+ - uses bzr. This is personal taste, I just prefer git :)
+
+- dar / libdar (http://dar.linux.free.fr/)
+ - Pro:
+ - supports most of our requirements
+ - code matured for ten years
+
+ - Neutral:
+ - written in C++ (code looks a bit C-ish though)
+ - way more features than we need
+
+ - Contra:
+ - huge memory requirements. It keeps an index of all
+ files in memory. Each files occupies at least 850 bytes
+ + the path. We aborted our test with 30.000.000 files
+ when the program already slurped up 19GB of RAM.
+
+ Will probably break our entry level boxes / old machine
+ without enough RAM and swap. Fixing this will probably
+ require a rewrite of a lot of core components.
+
+ - no unit tests / C-ish code style / French class and variable names
+ - not using tar format / custom format
+ - state of python binding is unknown. Probably working, no unit tests.
+
+Related reads:
+ - http://dar.linux.free.fr/doc/Notes.html
+ - http://duplicity.nongnu.org/new_format.html
+
+Security ideas (TO BE DECIDED ON, STILL VAGUE IDEAS):
+- Optionally: Prevent directory traversal attack on restore
+- forbid restoring of programs with setuid/setgid
+
+Misc ideas:
+- benchmark bz2 / lzo compression