DevOps and Build-Run Teams

DevOps is the “you build it, you run it” school of software delivery.

It collapses the distinction between a “dev” team responsible for writing code and an “ops” team responsible for running it and empowers one team to take responsibility for their product all the way through development, deployment and operation.

But DevOps is a slippery word. So for now, let’s just pretend it doesn’t also mean expertise in CI/CD technologies or Kubernetes or Chef or Puppet or other infrastructure as code techniques or cloud automation or HashiCorp tooling… although these are obviously related. It is often clearer to talk about Build-Run teams.

By collapsing these teams together you improve

  • efficiency (from reducing the hand-offs)
  • flexibility (through reducing inter-team dependencies)
  • responsiveness (to incidents and emergencies)
  • quality (through better accountability and autonomy and ownership)
  • precision (through superior feedback from production to development).

You can do the same with dev and QA, or dev and security. (DevTestOps, DevSecOps, DevTestSecOps, …)

Much has been said on this.

Unfortunately many forward-thinking people hit roadblocks when trying to move their existing organisations to build-run teams. And many forward-thinking organisations hit roadblocks when they have external stakeholders like auditors and regulators holding their practices to account.

It can be done. In the UK, modern tech-led challenger banks like Starling and Monzo achieved great success by embedding build-run cultures in spite of the heavy regulation in the industry.

It’s All About Responsibility

Ultimately you should not expect people to take responsibility where they aren’t given responsibility.

Motivation comes from a cluster of related concepts around ownership, autonomy, and accountability all of which form part of what it means to be responsible.

Giving responsibility is the secret sauce of the DevOps recipe. But it is the way that DevOps reassigns roles and responsibilities that also makes it hard to square with risk management.

The voice of risk management might be your auditor (internal or external), your regulator, your risk department, your security department, consultants doing due diligence, filling in assurance questionnaires, QSAs evaluating compliance. To keep things simple, let’s call them Auden.

Auden places a great deal of emphasis on the principle of Segregation of Responsibility (SoR). And rightly so. Poor segregation of responsibility has been implicated in many very high profile bad things. SoR attempts to mitigate the risk of error or fraud by ensuring that no one person has enough responsibility to commit the error or fraud.

Responsibilities that, taken together, assign one person enough responsibility to commit error or fraud without detection are sometimes called conflicting responsibilities. Although really they are really complementary responsibilities. A much better description that is sometimes used is toxic combination.

SoR in corporate IT is traditionally manifested as a separation between dev and ops. Operations own production and checkpoint the transition of code from development environments into production by formalising an operational acceptance process which may involve a test phase. This is of course the precise opposite of DevOps.

So there is a clear tension between a DevOps philosophy of delivery and traditional IT risk management. That tension is not a mirage. If you are working in a regulated industry or your software delivery processes are subject to audit in any form, you must navigate a sensible course through these issues.

The crux of the issue is:

SoR makes it hard to do bad things by making it hard to do anything.

If you point this out to Auden, she may respectfully assert that it is not her problem. She’s absolutely right. Your Auden might be more helpful than this but don’t count on it. It’s really not her job to design risk controls. That would be a terrible segregation or responsibilities.

If you think that there is no conflict or that automation solves everything or that buying Auden a copy of the Continuous Delivery should do the trick, go away and come back for some more thoughts when you hit the wall.

Don’t Be Ridiculous

First up, to deny any of the following is absurd:

  • Trust needs limits
  • Not everybody is trustworthy
  • Everybody is subject to honest mistakes
  • Even honest people may be compromised or coerced
  • Coworkers are more likely to collude
  • Coworkers are more likely to “rubber-stamp” one another’s work
  • People behave differently under observation
  • SoR is generally effective

If you are trying to implement Build-Run teams in a traditional organisation or a highly regulated context, do not start by making yourself ridiculous.

In particular, beware of absolutes. That there have been high profile incidents and frauds that SoR has failed to stop, does not mean that SoR “doesn’t work” (a dangerous line to take with Auden). It means only that SoR is imperfect. Auden would not disagree with that. In general, Auden is capable of a great deal more subtlety than software engineers seem to assume.

Auden recognises that the name of the game is mitigating risk down to an acceptable level. And that level can only be defined relative to circumstances and the organisation.

Auden also recognises that the controls you need when you are a company of twenty people who sit next to each other and all know each other’s names are not necessarily the controls that you’ll need when you are a company of twenty thousand spread across several offices and continents.

Ultimately you need to define your tolerance or your appetite for risk? And then decide what controls you will use to manage the risk within your tolerance. And because strong SoR is an imperfect control that can be costly and harmful, it’s worth reviewing other imperfect controls, some weaker and some stronger, and working out which should be applied where to deliver a DevOps culture that is safe and effective. And you might just come up with something that Auden is happy with.

A Quick Survey of Alternative Controls

Strong SoR assigns roles to parties based on their affiliations to a department or their job descriptions or reporting lines. So individuals from Dev write code and individuals from QA test code and individuals from Ops run the code. It is a strong control: the parties may not even know who they would need to collude with to subvert the control. And incompetence would need to be reflected in two or three different places (actually more common than it sounds) to allow an error to pass through.

However it is an extremely heavyweight control. It makes it hard to do bad things by making it hard to do anything. It is also imperfect, it has a tendency to become a diffusion of responsibility and is prone to rubber stamping. For all the bad things that have occurred without SoR, many occur on SoR’s watch.

There are other variations on SoR which are not as strong. A weaker control is to relax the requirement that the roles are strongly separated and instead segregate assumed roles. Individuals collaborating on a change assume different temporary roles for the duration of the interaction. Peer review is a common example of this. One developer writes code and another reviews it. These might be individuals in the same team or in another parallel team.

A further attenuation relaxes the separation of roles itself but demands that two parties collaborate. Like SoR and SoAR, these are four-eyes controls but in this case the roles are not clearly distinguished. Pair programming is an instance of this. From time to time each programmer may take the keyboard but they will swap. This is weaker than a segregation of responsibilities. The parties are known and close to each other and therefore more liable to collude in wrongdoing.

Automation is the jewel in DevOps crown. By replacing humans with automation we can, in principle, eradicate fraud and, over time, make errors arbitrarily rare. But it is not a free lunch. As well as questions over the effectiveness and reliability of the automation, all the same questions arise around the development and operation of the automation itself. Who can change the automation to make it do bad things? Can we be assured that the automation is the only means of delivery and cannot be circumvented? If so what provisions do we have in case of failure of the automation?

Automated enforcement of access privileges is essential these days, and the questions of who and how they can be changed and temporarily escalated impinge on all the other controls. This is one area where strong SoR reigns supreme although automation of role assumption and escalation under certain conditions is certainly amenable to automation.

These are all preventative controls. There are also detective controls, and their counterparts corrective controls.

The strongest corrective control is reversibility and in so far as you can do things reversibly (rollback, ledgered changes), you can move very fast indeed. You can do anything you can undo!. You need to worry about the period prior to reversal and the reliability of the reversal.

Another option is the kill switch, an option which accepts the certainty of a bad outcome over the possibility of a catastrophic one. Kill switches are unattractive and for that reason rarely tested well enough. It is all too likely that in extremis the switch cannot be found or doesn’t actually work. Both risks need considering.

On the detective side, high visibility is actually a very effective risk mitigation. In many cases it can be judged acceptable to allow anyone the capability to perform a sensitive action if, when they do, it is signalled in real-time in the most busy slack channels or on monitors around the office. Factors to be aware of in this case are: a) not every highly visible forum is actually busy all the time - how many people are watching your #general slack channel at 3am? b) too many automated alerts in slack channels reduced signal to noise ratio and erodes the quality of attention that people are paying. For sensitive controls it might become necessary to have somebody who’s job it is to actually review these alerts, which takes you most of the way back to building that review into the process itself.

There is a whole other aspect to risk management around governance and supervision.

Start With the Risk

So there are actually a wide array of controls we can use to manage risk and there are even more ways we can combine them in pipelines and processes. How do we decide what controls we need? The most important thing is: Always start with the risk.

Start with the risk (and your risk tolerance) and find the minimum combination of risk controls to manage the risk down to within your tolerance. (The minimum, because every control has a cost.)

Use the same principle when explaining your controls to Auden. If she starts by asking how you implement SoR, you are entitled to ask what risks she is looking to control and to present equivalent controls to manage those risks. It’s not unreasonable for Auden to make a few assumptions when she sees the same approach day in and day out. By bringing everything back to the risk and presenting the associated controls you are most like to get insightful analysis from Auden.