bpo-32713: Fix tarfile.itn for large/negative float values. (GH-5434)
[python-delta-tar] / docs / design_ideas.txt
CommitLineData
414c0518
TJ
1For our backup system, we have have the following requirements:
2
550cf81f 3
414c0518
TJ
4*** Requirements ***
5- backup of many files, even in one directory (cyrus imapd data folder)
6- split backup volumes at a configurable size
7- split large files across volumes
8- compression support
9- encryption support
10- filtering of files to include/exclude for backup
11- filtering of files to include/exclude for restore
12
13- robustness against corrupted archives.
14 If one file is corrupt / partially written, the rest should still be readable.
15
16- preserve file ownership and stuff like that.
17 Note: No support for hardlinks or xattrs neccessary if not really cheap to get
18
19- keep a list of stored files. With this list we can perform a
20 "differential" backup, to create a small backup only with new / modified files.
21 Always backup complete files for maxiumum integrity (no rdiff-like functionality)
22
23- full-backups and differential backups need to be merged on restore.
24
25Intra2net specific requirement:
26- Once the final volume is written, an "info.dat" file is created.
27 It contains the md5 checksums of all volumes + how much space
28 each user in the cyrus directory occupies. We should also store
29 in which file and at which position the user data begins,
30 so we can quickly extract all messages of a user.
31
32*** Design ideas ***
33- written in python, python 3.x support if possible, LGPLv3 license
34- based upon tar format / python-tarlib.
35 Should be extractable with standard tools in case of an emergency
36- use gzip for compression
37- use pycrypto for strong encryption
38 (-> avoid forking subprocesses if possible)
39- file modification check via stat (like git):
40 Check the following criteria:
41 - stat.st_mode
42 - mtime
43 - ctime
44 - uid/gid
45 - inode number
46 - file size
47- the file index is separate from the backup volumes.
48 It is just needed to perform the differential backup.
49
50- the file index uses an easily parsable, readable and extendable
5cab9209
TJ
51 format, JSON might be a good choice.
52
53 Good read on streaming JSON objects:
54 http://www.enricozini.org/2011/tips/python-stream-json/
55
414c0518
TJ
56- The file index can become quite large: make sure it can always
57 be processed as stream and never needs to be completely kept in RAM
58- Store file positions in the file index to make extraction
59 faster when the file index is available
60
61 When performing the differential backup, only process
62 one directory at a time to keep memory requirements low.
63 [Instead of using a recursive algorithm, insert subdirectories
64 to be processed at the beginning of a queue. So files for
65 one user are still grouped together in the backup file.]
66
67- differential backup: deleted files are marked by a path prefix.
68 regular files in a diff backup: diff/regular/path/file
69 deleted file markers in a diff backup: del/regular/path/file
70
71 Idea: Differential backups could be marked as such if the first file
72 of the tar archive is a "DIFF-BACKUP" file.
73 Then we prefix all filenames like above, otherwise not.
74 -> full backups contain normal paths.
75
f0c411e4
TJ
76- the file index should optionally be compressed and/or encrypted, too.
77
414c0518
TJ
78- gzip compression supports appending to existing compressed files (zcat).
79
80 The compression for tar archives is usually added on top of the whole
81 archive stream (-> .tar.gz). We want to do it a little bit different,
82 but still fully compatible to standard practice:
83
84 For every file and it surrounding tar metadata we backup, restart the
85 gzip compression for maximum robustness. This will give
86 slightly worse compression but if two consecutive files are corrupt, we
87 can still extract the files beyond the corruption.
88
89- encryption is added after the compression. It is done in a similar
90 way as the compression: restart the encryption on every file boundary.
91
92 To be able to find the start of a new encryption block, we need a
93 magic marker. Most probably the data needs to be escaped to not contain the
94 magic marker.
95 [ Look how this is done in gzip or dar ]
96
97 To increase security against known plaintext attacks, add a small, random
98 amount of padding with random data.
99
100 [Check this concept against other common encryption attack vectors.]
101
102- provide small cmdline tools just for stream decompression / decryption.
103 This is useful for emergency recovery of corrupted archives.
104
105Minor ideas:
106- develop unit tests for most functionality
107- pylint should be mostly quiet for sane messages
108- designed as a library to be used by our real backup script.
109 Should be as generic as possible so it's useful
110 for other people later on.
111- all filenames are UTF-8
112
113similar projects (might give additionals ideas):
114- duplicity (http://duplicity.nongnu.org/)
115 - Pro:
116 - written in python
117
118 - Contra:
119 - no differential backup, only incremental (needs all previous backup files)
120 - creates diffs of files (rsync based)
121 [Both of these design decisions weaken the rule that we should be able
122 to recover as much as possible from broken machines.]
123 - plans to move away from tar, makes it harder for long-term maintenance
124
125 - uses bzr. This is personal taste, I just prefer git :)
126
127- dar / libdar (http://dar.linux.free.fr/)
128 - Pro:
129 - supports most of our requirements
130 - code matured for ten years
131
132 - Neutral:
133 - written in C++ (code looks a bit C-ish though)
134 - way more features than we need
135
136 - Contra:
137 - huge memory requirements. It keeps an index of all
138 files in memory. Each files occupies at least 850 bytes
139 + the path. We aborted our test with 30.000.000 files
140 when the program already slurped up 19GB of RAM.
141
142 Will probably break our entry level boxes / old machine
143 without enough RAM and swap. Fixing this will probably
144 require a rewrite of a lot of core components.
145
146 - no unit tests / C-ish code style / French class and variable names
147 - not using tar format / custom format
148 - state of python binding is unknown. Probably working, no unit tests.
149
150Related reads:
151 - http://dar.linux.free.fr/doc/Notes.html
152 - http://duplicity.nongnu.org/new_format.html
153
154Security ideas (TO BE DECIDED ON, STILL VAGUE IDEAS):
155- Optionally: Prevent directory traversal attack on restore
156- forbid restoring of programs with setuid/setgid
157
158Misc ideas:
159- benchmark bz2 / lzo compression