Commit | Line | Data |
---|---|---|
414c0518 TJ |
1 | For our backup system, we have have the following requirements: |
2 | ||
550cf81f | 3 | |
414c0518 TJ |
4 | *** Requirements *** |
5 | - backup of many files, even in one directory (cyrus imapd data folder) | |
6 | - split backup volumes at a configurable size | |
7 | - split large files across volumes | |
8 | - compression support | |
9 | - encryption support | |
10 | - filtering of files to include/exclude for backup | |
11 | - filtering of files to include/exclude for restore | |
12 | ||
13 | - robustness against corrupted archives. | |
14 | If one file is corrupt / partially written, the rest should still be readable. | |
15 | ||
16 | - preserve file ownership and stuff like that. | |
17 | Note: No support for hardlinks or xattrs neccessary if not really cheap to get | |
18 | ||
19 | - keep a list of stored files. With this list we can perform a | |
20 | "differential" backup, to create a small backup only with new / modified files. | |
21 | Always backup complete files for maxiumum integrity (no rdiff-like functionality) | |
22 | ||
23 | - full-backups and differential backups need to be merged on restore. | |
24 | ||
25 | Intra2net specific requirement: | |
26 | - Once the final volume is written, an "info.dat" file is created. | |
27 | It contains the md5 checksums of all volumes + how much space | |
28 | each user in the cyrus directory occupies. We should also store | |
29 | in which file and at which position the user data begins, | |
30 | so we can quickly extract all messages of a user. | |
31 | ||
32 | *** Design ideas *** | |
33 | - written in python, python 3.x support if possible, LGPLv3 license | |
34 | - based upon tar format / python-tarlib. | |
35 | Should be extractable with standard tools in case of an emergency | |
36 | - use gzip for compression | |
37 | - use pycrypto for strong encryption | |
38 | (-> avoid forking subprocesses if possible) | |
39 | - file modification check via stat (like git): | |
40 | Check the following criteria: | |
41 | - stat.st_mode | |
42 | - mtime | |
43 | - ctime | |
44 | - uid/gid | |
45 | - inode number | |
46 | - file size | |
47 | - the file index is separate from the backup volumes. | |
48 | It is just needed to perform the differential backup. | |
49 | ||
50 | - the file index uses an easily parsable, readable and extendable | |
5cab9209 TJ |
51 | format, JSON might be a good choice. |
52 | ||
53 | Good read on streaming JSON objects: | |
54 | http://www.enricozini.org/2011/tips/python-stream-json/ | |
55 | ||
414c0518 TJ |
56 | - The file index can become quite large: make sure it can always |
57 | be processed as stream and never needs to be completely kept in RAM | |
58 | - Store file positions in the file index to make extraction | |
59 | faster when the file index is available | |
60 | ||
61 | When performing the differential backup, only process | |
62 | one directory at a time to keep memory requirements low. | |
63 | [Instead of using a recursive algorithm, insert subdirectories | |
64 | to be processed at the beginning of a queue. So files for | |
65 | one user are still grouped together in the backup file.] | |
66 | ||
67 | - differential backup: deleted files are marked by a path prefix. | |
68 | regular files in a diff backup: diff/regular/path/file | |
69 | deleted file markers in a diff backup: del/regular/path/file | |
70 | ||
71 | Idea: Differential backups could be marked as such if the first file | |
72 | of the tar archive is a "DIFF-BACKUP" file. | |
73 | Then we prefix all filenames like above, otherwise not. | |
74 | -> full backups contain normal paths. | |
75 | ||
f0c411e4 TJ |
76 | - the file index should optionally be compressed and/or encrypted, too. |
77 | ||
414c0518 TJ |
78 | - gzip compression supports appending to existing compressed files (zcat). |
79 | ||
80 | The compression for tar archives is usually added on top of the whole | |
81 | archive stream (-> .tar.gz). We want to do it a little bit different, | |
82 | but still fully compatible to standard practice: | |
83 | ||
84 | For every file and it surrounding tar metadata we backup, restart the | |
85 | gzip compression for maximum robustness. This will give | |
86 | slightly worse compression but if two consecutive files are corrupt, we | |
87 | can still extract the files beyond the corruption. | |
88 | ||
89 | - encryption is added after the compression. It is done in a similar | |
90 | way as the compression: restart the encryption on every file boundary. | |
91 | ||
92 | To be able to find the start of a new encryption block, we need a | |
93 | magic marker. Most probably the data needs to be escaped to not contain the | |
94 | magic marker. | |
95 | [ Look how this is done in gzip or dar ] | |
96 | ||
97 | To increase security against known plaintext attacks, add a small, random | |
98 | amount of padding with random data. | |
99 | ||
100 | [Check this concept against other common encryption attack vectors.] | |
101 | ||
102 | - provide small cmdline tools just for stream decompression / decryption. | |
103 | This is useful for emergency recovery of corrupted archives. | |
104 | ||
105 | Minor ideas: | |
106 | - develop unit tests for most functionality | |
107 | - pylint should be mostly quiet for sane messages | |
108 | - designed as a library to be used by our real backup script. | |
109 | Should be as generic as possible so it's useful | |
110 | for other people later on. | |
111 | - all filenames are UTF-8 | |
112 | ||
113 | similar projects (might give additionals ideas): | |
114 | - duplicity (http://duplicity.nongnu.org/) | |
115 | - Pro: | |
116 | - written in python | |
117 | ||
118 | - Contra: | |
119 | - no differential backup, only incremental (needs all previous backup files) | |
120 | - creates diffs of files (rsync based) | |
121 | [Both of these design decisions weaken the rule that we should be able | |
122 | to recover as much as possible from broken machines.] | |
123 | - plans to move away from tar, makes it harder for long-term maintenance | |
124 | ||
125 | - uses bzr. This is personal taste, I just prefer git :) | |
126 | ||
127 | - dar / libdar (http://dar.linux.free.fr/) | |
128 | - Pro: | |
129 | - supports most of our requirements | |
130 | - code matured for ten years | |
131 | ||
132 | - Neutral: | |
133 | - written in C++ (code looks a bit C-ish though) | |
134 | - way more features than we need | |
135 | ||
136 | - Contra: | |
137 | - huge memory requirements. It keeps an index of all | |
138 | files in memory. Each files occupies at least 850 bytes | |
139 | + the path. We aborted our test with 30.000.000 files | |
140 | when the program already slurped up 19GB of RAM. | |
141 | ||
142 | Will probably break our entry level boxes / old machine | |
143 | without enough RAM and swap. Fixing this will probably | |
144 | require a rewrite of a lot of core components. | |
145 | ||
146 | - no unit tests / C-ish code style / French class and variable names | |
147 | - not using tar format / custom format | |
148 | - state of python binding is unknown. Probably working, no unit tests. | |
149 | ||
150 | Related reads: | |
151 | - http://dar.linux.free.fr/doc/Notes.html | |
152 | - http://duplicity.nongnu.org/new_format.html | |
153 | ||
154 | Security ideas (TO BE DECIDED ON, STILL VAGUE IDEAS): | |
155 | - Optionally: Prevent directory traversal attack on restore | |
156 | - forbid restoring of programs with setuid/setgid | |
157 | ||
158 | Misc ideas: | |
159 | - benchmark bz2 / lzo compression |