Silicon Glen: February 2015

The modelling of cause and effect is well established in various fields of science. An event (cause) results in an effect and the two events present a linkage based on theory or fact to various degrees based on our understanding. Working backwards, an observed effect can form the basis of emergent speculative theories, more concrete theories or lines of enquiry to establish the likely cause or possible causes. This is true across a range of fields from astronomy and detective work to the trialling of new drugs and also political debate.

As a project manager and programme manager I also get regularly involved in risk and issue management and looking at the causes of issues, risk mitigation tactics and the strategies and actions for dealing with these to prevent risks materialising, to reduce to effects of risks if they do materialise or for issues the actions necessary to resolve or reduce the issue that has happened.
There is an analogy between general cause and effect analysis and the type of analysis used in project management risk and issue management.

I had a looked for a framework to describe the end to end analysis of which cause and effect forms part and was unable to find one therefore put this method forward as a model for discussion in the hope it will help others and will illustrate it with a couple of examples.
I will use the following 5 step approach as a framework

Root Cause – Cause – Effect – Fix – Resolution

Root cause – origin of the problem
Cause - the immediate reason the effect occurred
Effect – the resultant problem or behaviour arising from the cause
Fix – a short term fix intended to mitigate or resolve the issue arising from the effect
Resolution – a longer term fix which represents a stable long term solution.

An effect is observed, e.g. a website is not performing.
The causes are investigated using monitoring tools, website logs, server reporting etc.
The causes might then be clarified in terms of whether it affects certain customers, certain types of transactions or patterns of usage, when the problem was first noticed, whether there were recent changes in the hardware or software configuration that might indicate a candidate for a line of investigation, there may be internal or external hacking activity and a denial of service attack and so on.

In parallel with investigating the causes, the overall effect on meeting service level agreements, impact on business is studied to ascertain the true impact. One server going out of service due to a problem will have less effect on an overall system if it is working in a cluster of other servers providing the same capability or whether a DR site can be hot swapped in than if the server represents a single point of failure or there is insufficient capacity elsewhere in the system to take up the slack. So the cause and the mitigation of the effect normally proceed as two parallel tracks.
An immediate fix may be put in place, such as spinning up another server or swapping in another disk while the problem is fully investigated. This fixes the problem in the short term and possibly the long term too.

When the cause is properly established then the fix can be validated to ensure it addresses the cause. A different fix may be necessary once the cause is understood. In the example, the cause may be a disk failure and it may be a one off event in which case a replacement disk on the original server or a replacement server will provide a fix. However a root cause is usually more beneficial for a more strategic solution. In the example, perhaps there is a generic manufacturing fault which has come to light and makes that hardware more unreliable than expected, perhaps there is an issue with the disk controller software, perhaps the disk was close to its expected end of life, perhaps the technology the disk used was inefficient and is no longer recommended and so on. A root cause can point to a solution that has more general applicability than one disk in one server. If it is a manufacturing problem, then other similar disks may be at risk and will need to be replaced or at least more closely monitored.

Therefore having a cause can be helpful in leading to a fix (replace the disk), however a root cause (an issue with the disk manufacturer) can lead to a more strategic resolution involving a disk replacement programme, better monitoring of the disk estate and so on.

None of this is particularly innovative but I feel the 5 step process above together with an example helps to introduce the method.

I will use another example, this one not IT related to illustrate the applicability to a more general context.

I am driving along a road, turn a corner and in front of me there is a car crashed into a tree. The driver of the car is not there. I have observed the effect, what is the cause? If the police are called to investigate the circumstances, some possible causes might include:
1. The driver was drunk and crashed the car, then ran off to avoid being caught.
2. The road had oil on it and the car skidded.
3. A farm animal escaped and the driver swerved to avoid it and is currently chasing the animal back to the farm
4. A car came quickly in the other direction and the first driver swerved to avoid it
5. There was an unknown manufacturing defect in the car which led to an unexpected failure in the steering or brakes
6. The driver fell asleep at the wheel
7. The car tyre suffered a sudden puncture and the driver lost control of the car as a result
8. The car was stolen by joy riders who after crashing the car, made off.
9. It was a windy day and debris was on the road which interfered with the car’s operation
10. An oncoming cyclist came round the corner and the driver swerved to avoid them
And so on, I’ll stop at 10 for brevity. It’s this sort of thing which might form the basis for a fatal accident enquiry in the event of a death in order that a true cause can be understood.
You can see from this subset of possible causes that it’s important to understand the cause in order to know what to do to take things further. The driver might be prosecuted for drunk driving, the farmer might need to fix their fence, the car manufacturer might need to issue a vehicle recall, the council might need to repair the road surface and so on.

We see from this that an observation of an effect leads to a number of possible causes, this set should be analysed and then limited to as small a number as possible, preferably one, in order that the correct action is taken as a result of the effect. As a result the cause, effect and actions should line up if the actions are to address the cause. I see this a lot with risk and issue management where actions are identified to reduce or mitigate a risk or issue and when I see actions associated with a risk or issue, a sanity check is always to confirm if the actions described effectively mitigates the risk or issue. This helps to identify whether there are any gaps, overlaps, unintentional side effects and to revisit whether the risk or issue has been clearly articulated and updated in the event that new information has come to light since it was first described.

As new legislation is proposed, seemingly in response to some problem the government has identified I find myself using the same logic as above to line up cause, effect and fix. The fix is proposed (legislation) and I work backwards to try and understand why the proposed fix is the only or best solution to whatever problem the government believes the legislation will fix. When they don’t line up, it appears the solution may be politically motivated, there may be other causes or problems in relation to the fix that the government might prefer not to discuss or that the evidence gathering has been flawed. However it might be that the legislation is simply the most pragmatic way forward at the time. The same logic can be applied in other contexts, ranging from when your children ask for something to address a need as they see it through to business cases for startups claiming to resolve a problem that the founders believe they have the answer to.

There are many papers discussing cause and effect, there are also many papers on root cause analysis however I am unaware of any using the above 5 step model from root cause to solution. This combines problem analysis and solution modelling and there appears to be no single name for this.
In using this framework, whilst useful there are a number of issues. Having worked as a consultant in regulation driven environments, I appreciate that in addition to the legal constraints there are oven considerable operational practices which layer on top of legal constraints. When solutions are documented in order that the same mistakes are avoided in future, the storage and indexing of these can also become unwieldy. This information needs to be readily accessible and searchable in order to be useful. It should also have a review or expiry date to ensure only relevant current information exists as a reference.

I believe also that the more strategic the root cause or resolution then there is a likelihood that the root cause or resolution will affect the wider organisation, therefore there should be a correlation between the root cause, the resolution and the seniority or experience of the person involved. This helps in ensuring the person with the appropriate authority or experience has visibility or sign off. In this way we also get a correlation between Root Cause - Cause – Effect – Fix – Resolution and who needs to be involved depending on the impact.

Silicon Glen

Total Pageviews

11 February 2015

A Cause and Effect management model

Popular Posts