bgproc

Obnam should do some processing in the background, for example uploading of data to the backup repository. This would allow better use of the bottleneck resource (network). Below is a journal entry with my thoughts on how to implement that. It may be out of date by now, but we'll see. I have a Python module to simplify the use of multiprocessing to do jobs in the background (which avoids the Python global interpreter lock, in case that matters). --liw

Here's a design for Obnam concurrency that came to me the other day while walking.

The core of Obnam (and larch) is quite synchronous: read data from file, read B-tree nodes, push chunks and B-tree nodes into repository. Some of that can be parallelized, but not easily: it's already tricky code, and making it even more tricky is going to require very strong justification.

Things like encrypting and decrypting files need to be done in parallel with other things, for speed. These things are not really in the core, and indeed are provided by plugins.

So here's a way to them in parallel:

the core code stays synchronous, the way it is now
whenever larch code needs to read a B-tree node, it blocks until it gets it
the node is read, synchronously, from wherever, and put into a background processing queue (using Python multiprocess)
the code that waits for the node to be processed polls the queue, and handles any other background jobs that happen to finish while it waits, and returns the desired node when it gets it
when larch writes a node (after it gets pushed out of the upload queue inside larch), it is put into a background processing queue
at the same time, if there were any finished background jobs, they're handled (written to repo)
at the end of the run, the main loop makes sure any pending background jobs finish and are handled

There's a complication that the B-tree code may need a node that is not yet written to the repository, since it is still going through a background processing queue.

I'm going to need to restructure how hooks process files that are written to or read from the repository. Writing should happen asynchronously: files are put in a queue and processed in the background, and then written to the actual repository when background processing is finished. Reading needs to happen synchronously, since there's a B-tree call waiting for them, but to handle the case of needing a node that is still being processed in the background, we need to keep track of what nodes are in the background, and wait for them to be done before reading them.

Reading would thus be something like this, implemented in the Repository class:

while wanted file is in write queue:
    process a write queue result

read file from repository
process file data through hooks
return file

The write queue is more complicated (again handled somehow in the Repository class):

a multiprocessing.Queue instance for holding pending jobs
- a job is a (pathname, file contents) pair
another Queue instance for holding unhandled results
- (pathname, file contents) pair, where the contents may have changed
a set for holding file identifiers (paths) that have been put into the pending jobs queue, but not yet processed from the results queue

Each plugin can provide one or more Unix commands (filters) through which the file contents gets piped. The background processes run each filter in turn, giving the output of the previous one as input to the next one.

To handle a result from a background job, the following needs to be done:

remove the pathname from the set
write the filtered file contents into the repository

To implement this, I'll do this:

All changes should be in HookedFS
write_file and overwrite_file put things into the pending jobs queue, and also call a new method handle_background_results
cat gets changed to wait for files in the write queue, calling handle_background_results
handle_background_results will do what is needed

This design isn't optimal, since writing things to the repository isn't being done in parallel with other things, but I'll tackle that problem later.

done this clearly isn't happening, so closing the old wishlist bug --liw