[python-delta-tar] / docs / design_ideas.txt

For our backup system, we have have the following requirements:


*** Requirements ***
- backup of many files, even in one directory (cyrus imapd data folder)
- split backup volumes at a configurable size
- split large files across volumes
- compression support
- encryption support
- filtering of files to include/exclude for  backup
- filtering of files to include/exclude for restore

- robustness against corrupted archives.
  If one file is corrupt / partially written, the rest should still be readable.

- preserve file ownership and stuff like that.
  Note: No support for hardlinks or xattrs neccessary if not really cheap to get

- keep a list of stored files. With this list we can perform a
  "differential" backup, to create a small backup only with new / modified files.
  Always backup complete files for maxiumum integrity (no rdiff-like functionality)

- full-backups and differential backups need to be merged on restore.

Intra2net specific requirement:
- Once the final volume is written, an "info.dat" file is created.
  It contains the md5 checksums of all volumes + how much space
  each user in the cyrus directory occupies. We should also store
  in which file and at which position the user data begins,
  so we can quickly extract all messages of a user.

*** Design ideas ***
- written in python, python 3.x support if possible, LGPLv3 license
- based upon tar format / python-tarlib.
  Should be extractable with standard tools in case of an emergency
- use gzip for compression
- use pycrypto for strong encryption
  (-> avoid forking subprocesses if possible)
- file modification check via stat (like git):
  Check the following criteria:
      - stat.st_mode
      - mtime
      - ctime
      - uid/gid
      - inode number
      - file size
- the file index is separate from the backup volumes.
  It is just needed to perform the differential backup.

- the file index uses an easily parsable, readable and extendable
  format, JSON might be a good choice.

  Good read on streaming JSON objects:
  http://www.enricozini.org/2011/tips/python-stream-json/

- The file index can become quite large: make sure it can always
  be processed as stream and never needs to be completely kept in RAM
- Store file positions in the file index to make extraction
  faster when the file index is available

  When performing the differential backup, only process
  one directory at a time to keep memory requirements low.
  [Instead of using a recursive algorithm, insert subdirectories
  to be processed at the beginning of a queue. So files for
  one user are still grouped together in the backup file.]

- differential backup: deleted files are marked by a path prefix.
  regular files in a diff backup: diff/regular/path/file
  deleted file markers in a diff backup: del/regular/path/file

  Idea: Differential backups could be marked as such if the first file
  of the tar archive is a "DIFF-BACKUP" file.
  Then we prefix all filenames like above, otherwise not.
  -> full backups contain normal paths.

- the file index should optionally be compressed and/or encrypted, too.

- gzip compression supports appending to existing compressed files (zcat).

  The compression for tar archives is usually added on top of the whole
  archive stream (-> .tar.gz). We want to do it a little bit different,
  but still fully compatible to standard practice:

  For every file and it surrounding tar metadata we backup, restart the
  gzip compression for maximum robustness. This will give
  slightly worse compression but if two consecutive files are corrupt, we
  can still extract the files beyond the corruption.

- encryption is added after the compression. It is done in a similar 
  way as the compression: restart the encryption on every file boundary.

  To be able to find the start of a new encryption block, we need a
  magic marker. Most probably the data needs to be escaped to not contain the
  magic marker. 
  [ Look how this is done in gzip or dar ]

  To increase security against known plaintext attacks, add a small, random
  amount of padding with random data.

  [Check this concept against other common encryption attack vectors.]

- provide small cmdline tools just for stream decompression / decryption.
  This is useful for emergency recovery of corrupted archives.

Minor ideas:
- develop unit tests for most functionality
- pylint should be mostly quiet for sane messages
- designed as a library to be used by our real backup script.
  Should be as generic as possible so it's useful
  for other people later on.
- all filenames are UTF-8

similar projects (might give additionals ideas):
- duplicity (http://duplicity.nongnu.org/)
    - Pro: 
        - written in python

    - Contra:
        - no differential backup, only incremental (needs all previous backup files)
        - creates diffs of files (rsync based)
        [Both of these design decisions weaken the rule that we should be able
        to recover as much as possible from broken machines.]
        - plans to move away from tar, makes it harder for long-term maintenance

        - uses bzr. This is personal taste, I just prefer git :)

- dar / libdar (http://dar.linux.free.fr/)
      - Pro:
          - supports most of our requirements
          - code matured for ten years

    - Neutral:
        - written in C++ (code looks a bit C-ish though)
        - way more features than we need

    - Contra:
      - huge memory requirements. It keeps an index of all
        files in memory. Each files occupies at least 850 bytes
        + the path. We aborted our test with 30.000.000 files
        when the program already slurped up 19GB of RAM.

        Will probably break our entry level boxes / old machine
        without enough RAM and swap. Fixing this will probably
        require a rewrite of a lot of core components.

      - no unit tests / C-ish code style / French class and variable names
      - not using tar format / custom format
      - state of python binding is unknown. Probably working, no unit tests.

Related reads:
    - http://dar.linux.free.fr/doc/Notes.html
    - http://duplicity.nongnu.org/new_format.html

Security ideas (TO BE DECIDED ON, STILL VAGUE IDEAS):
- Optionally: Prevent directory traversal attack on restore
- forbid restoring of programs with setuid/setgid

Misc ideas:
- benchmark bz2 / lzo compression
Commit	Line	Data
	1	For our backup system, we have have the following requirements:
	2
	3
	4	* Requirements *
	5	- backup of many files, even in one directory (cyrus imapd data folder)
	6	- split backup volumes at a configurable size
	7	- split large files across volumes
	8	- compression support
	9	- encryption support
	10	- filtering of files to include/exclude for backup
	11	- filtering of files to include/exclude for restore
	12
	13	- robustness against corrupted archives.
	14	If one file is corrupt / partially written, the rest should still be readable.
	15
	16	- preserve file ownership and stuff like that.
	17	Note: No support for hardlinks or xattrs neccessary if not really cheap to get
	18
	19	- keep a list of stored files. With this list we can perform a
	20	"differential" backup, to create a small backup only with new / modified files.
	21	Always backup complete files for maxiumum integrity (no rdiff-like functionality)
	22
	23	- full-backups and differential backups need to be merged on restore.
	24
	25	Intra2net specific requirement:
	26	- Once the final volume is written, an "info.dat" file is created.
	27	It contains the md5 checksums of all volumes + how much space
	28	each user in the cyrus directory occupies. We should also store
	29	in which file and at which position the user data begins,
	30	so we can quickly extract all messages of a user.
	31
	32	* Design ideas *
	33	- written in python, python 3.x support if possible, LGPLv3 license
	34	- based upon tar format / python-tarlib.
	35	Should be extractable with standard tools in case of an emergency
	36	- use gzip for compression
	37	- use pycrypto for strong encryption
	38	(-> avoid forking subprocesses if possible)
	39	- file modification check via stat (like git):
	40	Check the following criteria:
	41	- stat.st_mode
	42	- mtime
	43	- ctime
	44	- uid/gid
	45	- inode number
	46	- file size
	47	- the file index is separate from the backup volumes.
	48	It is just needed to perform the differential backup.
	49
	50	- the file index uses an easily parsable, readable and extendable
	51	format, JSON might be a good choice.
	52
	53	Good read on streaming JSON objects:
	54	http://www.enricozini.org/2011/tips/python-stream-json/
	55
	56	- The file index can become quite large: make sure it can always
	57	be processed as stream and never needs to be completely kept in RAM
	58	- Store file positions in the file index to make extraction
	59	faster when the file index is available
	60
	61	When performing the differential backup, only process
	62	one directory at a time to keep memory requirements low.
	63	[Instead of using a recursive algorithm, insert subdirectories
	64	to be processed at the beginning of a queue. So files for
	65	one user are still grouped together in the backup file.]
	66
	67	- differential backup: deleted files are marked by a path prefix.
	68	regular files in a diff backup: diff/regular/path/file
	69	deleted file markers in a diff backup: del/regular/path/file
	70
	71	Idea: Differential backups could be marked as such if the first file
	72	of the tar archive is a "DIFF-BACKUP" file.
	73	Then we prefix all filenames like above, otherwise not.
	74	-> full backups contain normal paths.
	75
	76	- the file index should optionally be compressed and/or encrypted, too.
	77
	78	- gzip compression supports appending to existing compressed files (zcat).
	79
	80	The compression for tar archives is usually added on top of the whole
	81	archive stream (-> .tar.gz). We want to do it a little bit different,
	82	but still fully compatible to standard practice:
	83
	84	For every file and it surrounding tar metadata we backup, restart the
	85	gzip compression for maximum robustness. This will give
	86	slightly worse compression but if two consecutive files are corrupt, we
	87	can still extract the files beyond the corruption.
	88
	89	- encryption is added after the compression. It is done in a similar
	90	way as the compression: restart the encryption on every file boundary.
	91
	92	To be able to find the start of a new encryption block, we need a
	93	magic marker. Most probably the data needs to be escaped to not contain the
	94	magic marker.
	95	[ Look how this is done in gzip or dar ]
	96
	97	To increase security against known plaintext attacks, add a small, random
	98	amount of padding with random data.
	99
	100	[Check this concept against other common encryption attack vectors.]
	101
	102	- provide small cmdline tools just for stream decompression / decryption.
	103	This is useful for emergency recovery of corrupted archives.
	104
	105	Minor ideas:
	106	- develop unit tests for most functionality
	107	- pylint should be mostly quiet for sane messages
	108	- designed as a library to be used by our real backup script.
	109	Should be as generic as possible so it's useful
	110	for other people later on.
	111	- all filenames are UTF-8
	112
	113	similar projects (might give additionals ideas):
	114	- duplicity (http://duplicity.nongnu.org/)
	115	- Pro:
	116	- written in python
	117
	118	- Contra:
	119	- no differential backup, only incremental (needs all previous backup files)
	120	- creates diffs of files (rsync based)
	121	[Both of these design decisions weaken the rule that we should be able
	122	to recover as much as possible from broken machines.]
	123	- plans to move away from tar, makes it harder for long-term maintenance
	124
	125	- uses bzr. This is personal taste, I just prefer git :)
	126
	127	- dar / libdar (http://dar.linux.free.fr/)
	128	- Pro:
	129	- supports most of our requirements
	130	- code matured for ten years
	131
	132	- Neutral:
	133	- written in C++ (code looks a bit C-ish though)
	134	- way more features than we need
	135
	136	- Contra:
	137	- huge memory requirements. It keeps an index of all
	138	files in memory. Each files occupies at least 850 bytes
	139	+ the path. We aborted our test with 30.000.000 files
	140	when the program already slurped up 19GB of RAM.
	141
	142	Will probably break our entry level boxes / old machine
	143	without enough RAM and swap. Fixing this will probably
	144	require a rewrite of a lot of core components.
	145
	146	- no unit tests / C-ish code style / French class and variable names
	147	- not using tar format / custom format
	148	- state of python binding is unknown. Probably working, no unit tests.
	149
	150	Related reads:
	151	- http://dar.linux.free.fr/doc/Notes.html
	152	- http://duplicity.nongnu.org/new_format.html
	153
	154	Security ideas (TO BE DECIDED ON, STILL VAGUE IDEAS):
	155	- Optionally: Prevent directory traversal attack on restore
	156	- forbid restoring of programs with setuid/setgid
	157
	158	Misc ideas:
	159	- benchmark bz2 / lzo compression