From 414c0518f4e131be7a8c7fb85c904fc81e850bc9 Mon Sep 17 00:00:00 2001
From: Thomas Jarosch <thomas.jarosch@intra2net.com>
Date: Tue, 4 Jun 2013 16:59:58 +0200
Subject: [PATCH] Add design ideas and first TODO steps

---
 docs/TODO.txt         |   25 ++++++++
 docs/design_ideas.txt |  152 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 177 insertions(+), 0 deletions(-)
 create mode 100644 docs/TODO.txt
 create mode 100644 docs/design_ideas.txt

diff --git a/docs/TODO.txt b/docs/TODO.txt
new file mode 100644
index 0000000..b85c24a
--- /dev/null
+++ b/docs/TODO.txt
@@ -0,0 +1,25 @@
+*** First steps ***
+We need to do some initial research if tar
+can satisfy our the design requirements.
+
+1. Find out what needs to be done for multi volume
+   support in python tarlib. GNU tar already supports this.
+
+   [How did duplicity solve this?]
+
+2. Design how we are going to split huge files over multiple volumes
+   and recognize them later on.
+
+   [How did duplicity solve this? There are unit tests for this]
+
+3. How can we restart the gzip compression for every file in the tar archive?
+
+4. Have a look at duplicity and evaluate if it can be adapted
+   in an upstream-compatible way to our needs
+
+   May be it can be tweaked to always produce "full snapshots"
+   for the differential backup scenario?
+
+5. Collect feedback from Intra2net. If all is fine,
+   design the class layout for all the design requirements
+   and have another round of feedback.
diff --git a/docs/design_ideas.txt b/docs/design_ideas.txt
new file mode 100644
index 0000000..a16a0d3
--- /dev/null
+++ b/docs/design_ideas.txt
@@ -0,0 +1,152 @@
+For our backup system, we have have the following requirements:
+
+*** Requirements ***
+- backup of many files, even in one directory (cyrus imapd data folder)
+- split backup volumes at a configurable size
+- split large files across volumes
+- compression support
+- encryption support
+- filtering of files to include/exclude for  backup
+- filtering of files to include/exclude for restore
+
+- robustness against corrupted archives.
+  If one file is corrupt / partially written, the rest should still be readable.
+
+- preserve file ownership and stuff like that.
+  Note: No support for hardlinks or xattrs neccessary if not really cheap to get
+
+- keep a list of stored files. With this list we can perform a
+  "differential" backup, to create a small backup only with new / modified files.
+  Always backup complete files for maxiumum integrity (no rdiff-like functionality)
+
+- full-backups and differential backups need to be merged on restore.
+
+Intra2net specific requirement:
+- Once the final volume is written, an "info.dat" file is created.
+  It contains the md5 checksums of all volumes + how much space
+  each user in the cyrus directory occupies. We should also store
+  in which file and at which position the user data begins,
+  so we can quickly extract all messages of a user.
+
+*** Design ideas ***
+- written in python, python 3.x support if possible, LGPLv3 license
+- based upon tar format / python-tarlib.
+  Should be extractable with standard tools in case of an emergency
+- use gzip for compression
+- use pycrypto for strong encryption
+  (-> avoid forking subprocesses if possible)
+- file modification check via stat (like git):
+  Check the following criteria:
+      - stat.st_mode
+      - mtime
+      - ctime
+      - uid/gid
+      - inode number
+      - file size
+- the file index is separate from the backup volumes.
+  It is just needed to perform the differential backup.
+
+- the file index uses an easily parsable, readable and extendable
+  format. XML might be a good choice.
+- The file index can become quite large: make sure it can always
+  be processed as stream and never needs to be completely kept in RAM
+- Store file positions in the file index to make extraction
+  faster when the file index is available
+
+  When performing the differential backup, only process
+  one directory at a time to keep memory requirements low.
+  [Instead of using a recursive algorithm, insert subdirectories
+  to be processed at the beginning of a queue. So files for
+  one user are still grouped together in the backup file.]
+
+- differential backup: deleted files are marked by a path prefix.
+  regular files in a diff backup: diff/regular/path/file
+  deleted file markers in a diff backup: del/regular/path/file
+
+  Idea: Differential backups could be marked as such if the first file
+  of the tar archive is a "DIFF-BACKUP" file.
+  Then we prefix all filenames like above, otherwise not.
+  -> full backups contain normal paths.
+
+- gzip compression supports appending to existing compressed files (zcat).
+
+  The compression for tar archives is usually added on top of the whole
+  archive stream (-> .tar.gz). We want to do it a little bit different,
+  but still fully compatible to standard practice:
+
+  For every file and it surrounding tar metadata we backup, restart the
+  gzip compression for maximum robustness. This will give
+  slightly worse compression but if two consecutive files are corrupt, we
+  can still extract the files beyond the corruption.
+
+- encryption is added after the compression. It is done in a similar 
+  way as the compression: restart the encryption on every file boundary.
+
+  To be able to find the start of a new encryption block, we need a
+  magic marker. Most probably the data needs to be escaped to not contain the
+  magic marker. 
+  [ Look how this is done in gzip or dar ]
+
+  To increase security against known plaintext attacks, add a small, random
+  amount of padding with random data.
+
+  [Check this concept against other common encryption attack vectors.]
+
+- provide small cmdline tools just for stream decompression / decryption.
+  This is useful for emergency recovery of corrupted archives.
+
+Minor ideas:
+- develop unit tests for most functionality
+- pylint should be mostly quiet for sane messages
+- designed as a library to be used by our real backup script.
+  Should be as generic as possible so it's useful
+  for other people later on.
+- all filenames are UTF-8
+
+similar projects (might give additionals ideas):
+- duplicity (http://duplicity.nongnu.org/)
+    - Pro: 
+        - written in python
+
+    - Contra:
+        - no differential backup, only incremental (needs all previous backup files)
+        - creates diffs of files (rsync based)
+        [Both of these design decisions weaken the rule that we should be able
+        to recover as much as possible from broken machines.]
+        - plans to move away from tar, makes it harder for long-term maintenance
+
+        - uses bzr. This is personal taste, I just prefer git :)
+
+- dar / libdar (http://dar.linux.free.fr/)
+      - Pro:
+          - supports most of our requirements
+          - code matured for ten years
+
+    - Neutral:
+        - written in C++ (code looks a bit C-ish though)
+        - way more features than we need
+
+    - Contra:
+      - huge memory requirements. It keeps an index of all
+        files in memory. Each files occupies at least 850 bytes
+        + the path. We aborted our test with 30.000.000 files
+        when the program already slurped up 19GB of RAM.
+
+        Will probably break our entry level boxes / old machine
+        without enough RAM and swap. Fixing this will probably
+        require a rewrite of a lot of core components.
+
+      - no unit tests / C-ish code style / French class and variable names
+      - not using tar format / custom format
+      - state of python binding is unknown. Probably working, no unit tests.
+
+Related reads:
+    - http://dar.linux.free.fr/doc/Notes.html
+    - http://duplicity.nongnu.org/new_format.html
+
+Security ideas (TO BE DECIDED ON, STILL VAGUE IDEAS):
+- Optionally: Prevent directory traversal attack on restore
+- forbid restoring of programs with setuid/setgid
+
+Misc ideas:
+- benchmark bz2 / lzo compression
-- 
1.7.1