Curve Logic

There is a principle that I slipped into a QCon presentation on Starling Bank a couple of years ago and that I’ve never really spoken about since.

Martin and Megan’s recent re:Invent talk reminded me of it and I decided it might be worth exploring in a little more detail because it has been useful, if not universally popular. And although I formulated it, others have policed it, enforced it and ensured it has been as successful as it has.

“cherish your dirty data”

What did I mean?

Well it’s really about how we test the systems we develop.

These days we generally think we’re pretty good at testing software.

We value orthogonality and reproducability because they increase the value of our test failures while limiting the cost of those failures.

With infrastructure as code, entire environments can be orchestrated into and out of existence. It becomes reasonable to create fresh environments for tests or test runs, ensuring orthogonality and reproducability more easily and at a larger scale than in the past.

We are also proficient at generating test data. Increasingly, we use randomisation (or fuzzing or mutation) to supplement the meagre resources of our human imaginations in finding the corner cases we need to break our code.

Still, there remains ample scope for humility.

Bugs are a fact of life. Systems fail in unexpected ways at inconvenient times and, despite lots of best practice, lots of deep thinking and lots of effort, we’re not there yet.

We should not be above making use of every little blessing that fortune sends our way.

So.

Look at your most heavily and widely used non-production environment.

Never, ever, ever, “reset” the data in it.

Let it live and let it flourish.

Or, more accurately, let it fester and then occasionally mash it around a tiny bit and only when you need to.

The processes, hacks, workarounds, experiments, and sheer idiocy that this environment has been subjected to over time make it such a fertile source of corner cases, logic bombs and bottlenecks, that it is truly worth cherishing.

It is actually hard to do this. One the one hand you’re asking people to take failures in this environment seriously and investigate and fix them. On the other you’re accepting that there might be data in there that should theoretically never be possible in production because it results from code or processes that have never been in production. This is tough. There will be calls for a clean up.

OK - so you have to be pragmatic. Sometimes you’ll disregard or ignore a failure. Sometimes you’ll resort to manual data fixing to patch up something. But just patch it up - don’t clean or recreate the environment’s whole data set.

Every time this happens you must at least run the thought experiment: what happens if my code actually did this in production? is it right to assume it won’t? should my code tolerate it? or even fix it on the fly?

In many cases this won’t merely be a thought experiment - you’ll be motivated to prove or ensure that there aren’t any instances of the problem in production or you’ll change the data model or constraints or fix loopholes in the code to make it impossible.

The genetic diversity of the dirty data that you accumulate in this environment is valuable because generated data (however sophisticated) and production dumps are no substitute for it.

Why?

Randomisation is probably our most powerful tool for testing universal assertions but when it comes to generating data for test environments the randomisation is always very constrained.

For instance, we frequently populate names at random from name lists rather than stuffing them full of emojis and klingon.

We set data in different services so that references across databases (where no foreign keys are possible) always match up. Almost always we’re creating data to be valid according to the intended data model which is tighter than the data model enforced by code or database.

Sometimes we reuse layers of the software under test as part of the data generation which actually makes it impossible to step outside the notion of validity that is enforced in the software.

On the other hand, old environments may contain data that is not valid today but was valid in the past because the notions of validity have changed over time. Or data that is simply invalid because of the manual tinkering that has happened over time. Even the data put there by real people is very likely quite erratic in places (so plenty of emojis) and yet oddly homogeneous in others (like several thousand “Testy McTestface” users).

The net effect is that dirty old data can be awesome at finding bugs.

Production data dumps also lack the some of the characteristics that make dirty test environments so useful. While real users out in the wild are very effective at finding problems that you won’t find in test environments, production environments have been usually been protected and nurtured in a way that test environments have not and furthermore they often have to be cleansed and anonymised to enable their use in testing. All of this means you are likely to turn up things in test environments which you won’t get from production dumps.

I’m not saying that old test environments are superior to generated data or production dumps, just that they provide an added dimension that neither of those sources offer.

Think twice before resetting them because you may be flushing away problems that you really ought to be confronting.