For our backup system, we have have the following requirements: *** Requirements *** - backup of many files, even in one directory (cyrus imapd data folder) - split backup volumes at a configurable size - split large files across volumes - compression support - encryption support - filtering of files to include/exclude for backup - filtering of files to include/exclude for restore - robustness against corrupted archives. If one file is corrupt / partially written, the rest should still be readable. - preserve file ownership and stuff like that. Note: No support for hardlinks or xattrs neccessary if not really cheap to get - keep a list of stored files. With this list we can perform a "differential" backup, to create a small backup only with new / modified files. Always backup complete files for maxiumum integrity (no rdiff-like functionality) - full-backups and differential backups need to be merged on restore. Intra2net specific requirement: - Once the final volume is written, an "info.dat" file is created. It contains the md5 checksums of all volumes + how much space each user in the cyrus directory occupies. We should also store in which file and at which position the user data begins, so we can quickly extract all messages of a user. *** Design ideas *** - written in python, python 3.x support if possible, LGPLv3 license - based upon tar format / python-tarlib. Should be extractable with standard tools in case of an emergency - use gzip for compression - use pycrypto for strong encryption (-> avoid forking subprocesses if possible) - file modification check via stat (like git): Check the following criteria: - stat.st_mode - mtime - ctime - uid/gid - inode number - file size - the file index is separate from the backup volumes. It is just needed to perform the differential backup. - the file index uses an easily parsable, readable and extendable format, JSON might be a good choice. Good read on streaming JSON objects: http://www.enricozini.org/2011/tips/python-stream-json/ - The file index can become quite large: make sure it can always be processed as stream and never needs to be completely kept in RAM - Store file positions in the file index to make extraction faster when the file index is available When performing the differential backup, only process one directory at a time to keep memory requirements low. [Instead of using a recursive algorithm, insert subdirectories to be processed at the beginning of a queue. So files for one user are still grouped together in the backup file.] - differential backup: deleted files are marked by a path prefix. regular files in a diff backup: diff/regular/path/file deleted file markers in a diff backup: del/regular/path/file Idea: Differential backups could be marked as such if the first file of the tar archive is a "DIFF-BACKUP" file. Then we prefix all filenames like above, otherwise not. -> full backups contain normal paths. - the file index should optionally be compressed and/or encrypted, too. - gzip compression supports appending to existing compressed files (zcat). The compression for tar archives is usually added on top of the whole archive stream (-> .tar.gz). We want to do it a little bit different, but still fully compatible to standard practice: For every file and it surrounding tar metadata we backup, restart the gzip compression for maximum robustness. This will give slightly worse compression but if two consecutive files are corrupt, we can still extract the files beyond the corruption. - encryption is added after the compression. It is done in a similar way as the compression: restart the encryption on every file boundary. To be able to find the start of a new encryption block, we need a magic marker. Most probably the data needs to be escaped to not contain the magic marker. [ Look how this is done in gzip or dar ] To increase security against known plaintext attacks, add a small, random amount of padding with random data. [Check this concept against other common encryption attack vectors.] - provide small cmdline tools just for stream decompression / decryption. This is useful for emergency recovery of corrupted archives. Minor ideas: - develop unit tests for most functionality - pylint should be mostly quiet for sane messages - designed as a library to be used by our real backup script. Should be as generic as possible so it's useful for other people later on. - all filenames are UTF-8 similar projects (might give additionals ideas): - duplicity (http://duplicity.nongnu.org/) - Pro: - written in python - Contra: - no differential backup, only incremental (needs all previous backup files) - creates diffs of files (rsync based) [Both of these design decisions weaken the rule that we should be able to recover as much as possible from broken machines.] - plans to move away from tar, makes it harder for long-term maintenance - uses bzr. This is personal taste, I just prefer git :) - dar / libdar (http://dar.linux.free.fr/) - Pro: - supports most of our requirements - code matured for ten years - Neutral: - written in C++ (code looks a bit C-ish though) - way more features than we need - Contra: - huge memory requirements. It keeps an index of all files in memory. Each files occupies at least 850 bytes + the path. We aborted our test with 30.000.000 files when the program already slurped up 19GB of RAM. Will probably break our entry level boxes / old machine without enough RAM and swap. Fixing this will probably require a rewrite of a lot of core components. - no unit tests / C-ish code style / French class and variable names - not using tar format / custom format - state of python binding is unknown. Probably working, no unit tests. Related reads: - http://dar.linux.free.fr/doc/Notes.html - http://duplicity.nongnu.org/new_format.html Security ideas (TO BE DECIDED ON, STILL VAGUE IDEAS): - Optionally: Prevent directory traversal attack on restore - forbid restoring of programs with setuid/setgid Misc ideas: - benchmark bz2 / lzo compression