developer.intra2net.com Git - python-delta-tar/blob - docs/design_ideas.txt

   1 For our backup system, we have have the following requirements:
   2
   3
   4 *** Requirements ***
   5 - backup of many files, even in one directory (cyrus imapd data folder)
   6 - split backup volumes at a configurable size
   7 - split large files across volumes
   8 - compression support
   9 - encryption support
  10 - filtering of files to include/exclude for  backup
  11 - filtering of files to include/exclude for restore
  12
  13 - robustness against corrupted archives.
  14   If one file is corrupt / partially written, the rest should still be readable.
  15
  16 - preserve file ownership and stuff like that.
  17   Note: No support for hardlinks or xattrs neccessary if not really cheap to get
  18
  19 - keep a list of stored files. With this list we can perform a
  20   "differential" backup, to create a small backup only with new / modified files.
  21   Always backup complete files for maxiumum integrity (no rdiff-like functionality)
  22
  23 - full-backups and differential backups need to be merged on restore.
  24
  25 Intra2net specific requirement:
  26 - Once the final volume is written, an "info.dat" file is created.
  27   It contains the md5 checksums of all volumes + how much space
  28   each user in the cyrus directory occupies. We should also store
  29   in which file and at which position the user data begins,
  30   so we can quickly extract all messages of a user.
  31
  32 *** Design ideas ***
  33 - written in python, python 3.x support if possible, LGPLv3 license
  34 - based upon tar format / python-tarlib.
  35   Should be extractable with standard tools in case of an emergency
  36 - use gzip for compression
  37 - use pycrypto for strong encryption
  38   (-> avoid forking subprocesses if possible)
  39 - file modification check via stat (like git):
  40   Check the following criteria:
  41       - stat.st_mode
  42       - mtime
  43       - ctime
  44       - uid/gid
  45       - inode number
  46       - file size
  47 - the file index is separate from the backup volumes.
  48   It is just needed to perform the differential backup.
  49
  50 - the file index uses an easily parsable, readable and extendable
  51   format, JSON might be a good choice.
  52
  53   Good read on streaming JSON objects:
  54   http://www.enricozini.org/2011/tips/python-stream-json/
  55
  56 - The file index can become quite large: make sure it can always
  57   be processed as stream and never needs to be completely kept in RAM
  58 - Store file positions in the file index to make extraction
  59   faster when the file index is available
  60
  61   When performing the differential backup, only process
  62   one directory at a time to keep memory requirements low.
  63   [Instead of using a recursive algorithm, insert subdirectories
  64   to be processed at the beginning of a queue. So files for
  65   one user are still grouped together in the backup file.]
  66
  67 - differential backup: deleted files are marked by a path prefix.
  68   regular files in a diff backup: diff/regular/path/file
  69   deleted file markers in a diff backup: del/regular/path/file
  70
  71   Idea: Differential backups could be marked as such if the first file
  72   of the tar archive is a "DIFF-BACKUP" file.
  73   Then we prefix all filenames like above, otherwise not.
  74   -> full backups contain normal paths.
  75
  76 - the file index should optionally be compressed and/or encrypted, too.
  77
  78 - gzip compression supports appending to existing compressed files (zcat).
  79
  80   The compression for tar archives is usually added on top of the whole
  81   archive stream (-> .tar.gz). We want to do it a little bit different,
  82   but still fully compatible to standard practice:
  83
  84   For every file and it surrounding tar metadata we backup, restart the
  85   gzip compression for maximum robustness. This will give
  86   slightly worse compression but if two consecutive files are corrupt, we
  87   can still extract the files beyond the corruption.
  88
  89 - encryption is added after the compression. It is done in a similar
  90   way as the compression: restart the encryption on every file boundary.
  91
  92   To be able to find the start of a new encryption block, we need a
  93   magic marker. Most probably the data needs to be escaped to not contain the
  94   magic marker.
  95   [ Look how this is done in gzip or dar ]
  96
  97   To increase security against known plaintext attacks, add a small, random
  98   amount of padding with random data.
  99
 100   [Check this concept against other common encryption attack vectors.]
 101
 102 - provide small cmdline tools just for stream decompression / decryption.
 103   This is useful for emergency recovery of corrupted archives.
 104
 105 Minor ideas:
 106 - develop unit tests for most functionality
 107 - pylint should be mostly quiet for sane messages
 108 - designed as a library to be used by our real backup script.
 109   Should be as generic as possible so it's useful
 110   for other people later on.
 111 - all filenames are UTF-8
 112
 113 similar projects (might give additionals ideas):
 114 - duplicity (http://duplicity.nongnu.org/)
 115     - Pro:
 116         - written in python
 117
 118     - Contra:
 119         - no differential backup, only incremental (needs all previous backup files)
 120         - creates diffs of files (rsync based)
 121         [Both of these design decisions weaken the rule that we should be able
 122         to recover as much as possible from broken machines.]
 123         - plans to move away from tar, makes it harder for long-term maintenance
 124
 125         - uses bzr. This is personal taste, I just prefer git :)
 126
 127 - dar / libdar (http://dar.linux.free.fr/)
 128       - Pro:
 129           - supports most of our requirements
 130           - code matured for ten years
 131
 132     - Neutral:
 133         - written in C++ (code looks a bit C-ish though)
 134         - way more features than we need
 135
 136     - Contra:
 137       - huge memory requirements. It keeps an index of all
 138         files in memory. Each files occupies at least 850 bytes
 139         + the path. We aborted our test with 30.000.000 files
 140         when the program already slurped up 19GB of RAM.
 141
 142         Will probably break our entry level boxes / old machine
 143         without enough RAM and swap. Fixing this will probably
 144         require a rewrite of a lot of core components.
 145
 146       - no unit tests / C-ish code style / French class and variable names
 147       - not using tar format / custom format
 148       - state of python binding is unknown. Probably working, no unit tests.
 149
 150 Related reads:
 151     - http://dar.linux.free.fr/doc/Notes.html
 152     - http://duplicity.nongnu.org/new_format.html
 153
 154 Security ideas (TO BE DECIDED ON, STILL VAGUE IDEAS):
 155 - Optionally: Prevent directory traversal attack on restore
 156 - forbid restoring of programs with setuid/setgid
 157
 158 Misc ideas:
 159 - benchmark bz2 / lzo compression