| 1 | For our backup system, we have have the following requirements: |
| 2 | |
| 3 | |
| 4 | *** Requirements *** |
| 5 | - backup of many files, even in one directory (cyrus imapd data folder) |
| 6 | - split backup volumes at a configurable size |
| 7 | - split large files across volumes |
| 8 | - compression support |
| 9 | - encryption support |
| 10 | - filtering of files to include/exclude for backup |
| 11 | - filtering of files to include/exclude for restore |
| 12 | |
| 13 | - robustness against corrupted archives. |
| 14 | If one file is corrupt / partially written, the rest should still be readable. |
| 15 | |
| 16 | - preserve file ownership and stuff like that. |
| 17 | Note: No support for hardlinks or xattrs neccessary if not really cheap to get |
| 18 | |
| 19 | - keep a list of stored files. With this list we can perform a |
| 20 | "differential" backup, to create a small backup only with new / modified files. |
| 21 | Always backup complete files for maxiumum integrity (no rdiff-like functionality) |
| 22 | |
| 23 | - full-backups and differential backups need to be merged on restore. |
| 24 | |
| 25 | Intra2net specific requirement: |
| 26 | - Once the final volume is written, an "info.dat" file is created. |
| 27 | It contains the md5 checksums of all volumes + how much space |
| 28 | each user in the cyrus directory occupies. We should also store |
| 29 | in which file and at which position the user data begins, |
| 30 | so we can quickly extract all messages of a user. |
| 31 | |
| 32 | *** Design ideas *** |
| 33 | - written in python, python 3.x support if possible, LGPLv3 license |
| 34 | - based upon tar format / python-tarlib. |
| 35 | Should be extractable with standard tools in case of an emergency |
| 36 | - use gzip for compression |
| 37 | - use pycrypto for strong encryption |
| 38 | (-> avoid forking subprocesses if possible) |
| 39 | - file modification check via stat (like git): |
| 40 | Check the following criteria: |
| 41 | - stat.st_mode |
| 42 | - mtime |
| 43 | - ctime |
| 44 | - uid/gid |
| 45 | - inode number |
| 46 | - file size |
| 47 | - the file index is separate from the backup volumes. |
| 48 | It is just needed to perform the differential backup. |
| 49 | |
| 50 | - the file index uses an easily parsable, readable and extendable |
| 51 | format, JSON might be a good choice. |
| 52 | |
| 53 | Good read on streaming JSON objects: |
| 54 | http://www.enricozini.org/2011/tips/python-stream-json/ |
| 55 | |
| 56 | - The file index can become quite large: make sure it can always |
| 57 | be processed as stream and never needs to be completely kept in RAM |
| 58 | - Store file positions in the file index to make extraction |
| 59 | faster when the file index is available |
| 60 | |
| 61 | When performing the differential backup, only process |
| 62 | one directory at a time to keep memory requirements low. |
| 63 | [Instead of using a recursive algorithm, insert subdirectories |
| 64 | to be processed at the beginning of a queue. So files for |
| 65 | one user are still grouped together in the backup file.] |
| 66 | |
| 67 | - differential backup: deleted files are marked by a path prefix. |
| 68 | regular files in a diff backup: diff/regular/path/file |
| 69 | deleted file markers in a diff backup: del/regular/path/file |
| 70 | |
| 71 | Idea: Differential backups could be marked as such if the first file |
| 72 | of the tar archive is a "DIFF-BACKUP" file. |
| 73 | Then we prefix all filenames like above, otherwise not. |
| 74 | -> full backups contain normal paths. |
| 75 | |
| 76 | - the file index should optionally be compressed and/or encrypted, too. |
| 77 | |
| 78 | - gzip compression supports appending to existing compressed files (zcat). |
| 79 | |
| 80 | The compression for tar archives is usually added on top of the whole |
| 81 | archive stream (-> .tar.gz). We want to do it a little bit different, |
| 82 | but still fully compatible to standard practice: |
| 83 | |
| 84 | For every file and it surrounding tar metadata we backup, restart the |
| 85 | gzip compression for maximum robustness. This will give |
| 86 | slightly worse compression but if two consecutive files are corrupt, we |
| 87 | can still extract the files beyond the corruption. |
| 88 | |
| 89 | - encryption is added after the compression. It is done in a similar |
| 90 | way as the compression: restart the encryption on every file boundary. |
| 91 | |
| 92 | To be able to find the start of a new encryption block, we need a |
| 93 | magic marker. Most probably the data needs to be escaped to not contain the |
| 94 | magic marker. |
| 95 | [ Look how this is done in gzip or dar ] |
| 96 | |
| 97 | To increase security against known plaintext attacks, add a small, random |
| 98 | amount of padding with random data. |
| 99 | |
| 100 | [Check this concept against other common encryption attack vectors.] |
| 101 | |
| 102 | - provide small cmdline tools just for stream decompression / decryption. |
| 103 | This is useful for emergency recovery of corrupted archives. |
| 104 | |
| 105 | Minor ideas: |
| 106 | - develop unit tests for most functionality |
| 107 | - pylint should be mostly quiet for sane messages |
| 108 | - designed as a library to be used by our real backup script. |
| 109 | Should be as generic as possible so it's useful |
| 110 | for other people later on. |
| 111 | - all filenames are UTF-8 |
| 112 | |
| 113 | similar projects (might give additionals ideas): |
| 114 | - duplicity (http://duplicity.nongnu.org/) |
| 115 | - Pro: |
| 116 | - written in python |
| 117 | |
| 118 | - Contra: |
| 119 | - no differential backup, only incremental (needs all previous backup files) |
| 120 | - creates diffs of files (rsync based) |
| 121 | [Both of these design decisions weaken the rule that we should be able |
| 122 | to recover as much as possible from broken machines.] |
| 123 | - plans to move away from tar, makes it harder for long-term maintenance |
| 124 | |
| 125 | - uses bzr. This is personal taste, I just prefer git :) |
| 126 | |
| 127 | - dar / libdar (http://dar.linux.free.fr/) |
| 128 | - Pro: |
| 129 | - supports most of our requirements |
| 130 | - code matured for ten years |
| 131 | |
| 132 | - Neutral: |
| 133 | - written in C++ (code looks a bit C-ish though) |
| 134 | - way more features than we need |
| 135 | |
| 136 | - Contra: |
| 137 | - huge memory requirements. It keeps an index of all |
| 138 | files in memory. Each files occupies at least 850 bytes |
| 139 | + the path. We aborted our test with 30.000.000 files |
| 140 | when the program already slurped up 19GB of RAM. |
| 141 | |
| 142 | Will probably break our entry level boxes / old machine |
| 143 | without enough RAM and swap. Fixing this will probably |
| 144 | require a rewrite of a lot of core components. |
| 145 | |
| 146 | - no unit tests / C-ish code style / French class and variable names |
| 147 | - not using tar format / custom format |
| 148 | - state of python binding is unknown. Probably working, no unit tests. |
| 149 | |
| 150 | Related reads: |
| 151 | - http://dar.linux.free.fr/doc/Notes.html |
| 152 | - http://duplicity.nongnu.org/new_format.html |
| 153 | |
| 154 | Security ideas (TO BE DECIDED ON, STILL VAGUE IDEAS): |
| 155 | - Optionally: Prevent directory traversal attack on restore |
| 156 | - forbid restoring of programs with setuid/setgid |
| 157 | |
| 158 | Misc ideas: |
| 159 | - benchmark bz2 / lzo compression |