I wrote an algorithm

Last Friday night I put my second Python package on PyPI: parallel-queue. It’s something I wrote for work, but I very happily put in a bunch of free time hours to tidy it up and open-source it, which I wouldn’t do for most of my day-job work even if it would make sense. So what is different about this particular project?

At the heart of that package is a single algorithm, and I’m proud enough of the algorithm that I wanted to show it to the world. And thinking about that has made me realise something a bit surprising about programming as a day job: generally speaking, it doesn’t involve inventing algorithms.

For that statement not to be trivially false I’ll have to qualify a bit what I mean by “algorithm”: in a sense every computer program (the constant output of my day job) is an algorithm. But programmers typically (and informally) mean something more specific by the term: an abstract program (not too closely tied to a single programming language, system architecture, etc) for solving a general problem (not too closely tied to a single problem domain). These are vague distinctions, but I think still relevant for how programmers think about their work.¹ There are algorithms for sorting, and for page layout; but the code I wrote that sorts a list of charts uses a sorting algorithm,² while the code laying out charts on a page implements a particular algorithm but has all sorts of specific details baked into it.

So if you accept this vague pseudo-definition, an algorithm is a solution to a general problem. The thing is, a programmer is only rarely confronted with a general problem that needs solving. This is for two reasons: almost everything a programmer has to write solves a specific problem (with the exception of those lucky souls whose job is to write general utility libraries), and while many of these specific problems are instances of general problems (e.g. “sort this list of charts” is an instance of the general problem “sort a list of items”), usually the general problem is already well-known and doesn’t have to be solved again. Most day-to-day programming is either plumbing specific details through an existing general solution (“just use library xyz“), or bolting together specific details in one-off ways that are not instances of any general problem (also known as “business logic”: “if the chart doesn’t have a title specified, take it from the name of the first data line”).

This one time, though, my specific problem was an instance of a general problem that’s a bit more off the beaten track: “consume a stream of task specifications in sequential order; perform the tasks in parallel; but only report completion of any given task when all earlier tasks have already reported completion”.³ I’m sure this general problem has been solved before (in particular, it looks a lot like TCP reassembling packets into their original order after transmission over a network), but it doesn’t have a catchy name that I know of,⁴ so after a brief look at general task queue frameworks I sat down and came up with an algorithm.

Turning it into working code was an interesting experience too. You won’t find an “algorithmic core” with layers of code around it implementing the various complications I list: the complications reach into the very heart of the algorithm. (Especially the adjustment to prevent packet ids from increasing without limit gave me serious grief, and would have stymied the whole project if not for Boaz’s observation that once the sizes of all the internal buffers are specified, there is an absolute limit on the number of packets that can be inside the system at any given moment. I was trying to do the same kind of arithmetic but without that global perspective, and –unsurprisingly in retrospect– not making any progress.)

There’s still a bit of tidying to be done on the package (I need to add some insight into the state of the internal buffers so we can monitor for blockages, and properly documenting the interface will be a good opportunity to get comfortable with Python’s documentation tools) but already it’s one of the most satisfying pieces of work I’ve done in the last six months or so. Shame more of my day job isn’t like that, really.

Notes:

I’m sure of that at least for the sample of programmers consisting of myself.
In fact, one that is already implemented in a standard system utility.
The specific task is: parallelise the indexing of incoming documents to our search server, but keep a log of indexing requests in the original sequential order but only when successfully completed, which we can use for backups. I’m indebted to Boaz both for the general formulation and for allowing me to open-source some of the company’s code under my own name.
Incidentally, this is the Achilles heel of “everything is on the internet these days”: you have to know how to name it to be able to look it up.

Latest Images

Trending Articles

Latest Images