The press reported the frustrations of more than 700 Scottish hospital patients who had treatment delayed last week because of an IT system failure.
What impact did this have?
- Think of those patients who had mentally prepared for a planned surgical operation over weeks or months, to be told at the last minute that the treatment has been postponed due to something as mundane an IT system failure.
- The CEO of Greater Glasgow and Clyde (NHSGGC) had to publically “apologise unreservedly for the inconvenience” of the IT failure. This is an embarrassing experience for any organisation CEO and their leadership team. Though being honest and releasing regretful information early is in our experience the best way of maintaining trust, and, in a commercial organisation, keeping customers.
- The Scottish Health Secretary’s reputation also suffered, the incident resulted in a 14-minute televised examination in the Scottish Parliament answering questions put to him by his parliamentary peers. Undoubtedly an uncomfortable experience which is likely to have carried some political impact.
Although this undoubtedly had a real impact on people’s wellbeing and the reputation of the Health Board and its leadership, in the interest of balance it was good to see that the vast majority (some 90%) of clinical sessions were completed successfully using “backup systems”. We suspect that this included reverting to paper records in some instances.
Information released suggests that Microsoft Active Directory (AD) was at fault and had corrupted over the weekend. AD is one of the core services behind most corporate networks; one of its primary functions is to hold a directory of valid users and to check login credentials so that only authorised users are allowed access to resources. Press articles suggest that the corruption was only discovered when staff tried to log into systems after the weekend and failed. TheHealth Board also states that this kind of problem is extremely unusual. While the technical information released is confused at best, we’d suggest that any IT system is vulnerable to corruption at some point. Although AD is extremely robust and an industry standard, Microsoft publish procedures on how to guard against corruption, fix corrupted AD databases and restore known-good copies of AD from backup. As all System Administrators know, AD replicates its data to multiple servers, to ensure that services are always available at the point that they’re needed, but unfortunately can also efficiently replicate corruption in some circumstances. Replication of data is never considered to be a backup for this very reason.
The Minister promised a review of the robustness of this and other similar systems being used across Scottish, it was suggested during the questioning that the fault was not down to “policy or management”, but rather “the failure of a [IT] system”.
The Minister also stated that the “backup servers failed to function”. We’re not sure what the Minister meant in this instance. If the corrupt AD database has replicated to all of the AD servers on the Health Board’s network, the most common action would be to restore the AD services from a recent known-good offline backup (not a backup server). We’ve all got a known-good, recently taken backup of AD and have tested our restore procedures, right? In practice, the backup of AD is something that’s easily missed, with replication taken for granted by some as a valid substitute for a solid point-in-time backup. As it’s not that simple to test a restore, it’s also easy to picture a scenario that even if a backup has been taken, it’s not been tested and restore procedures don’t exist. All a bit frustrating when there are good 3rd party backup and restore tools out there for AD which guard against these most dire of incidents as well as accidental damage to the AD from buggy software or tired administrators. We can’t comment on what actually happened in Glasgow though, as detailed information hasn’t been released.
The failure of the IT system may have been reported as the cause in this incident, but if procedures weren’t in place to cater for such an incident and policy hadn’t mandated that appropriate procedures must be formulated and tested, then it could be seen that the IT system failure may actually have been an unanticipated event that really should have been planned for at the management level.
We would hope that as part of the lessons learned exercise that questions are asked to find out what procedures are in place to regularly test the resilience of all criticalnetwork services, from the physical infrastructure upwards. What contingency plans could have been adopted in the event of a failure of these services? To what extent are these business continuity measures tested and audited? What resources to ensure resilience are provisioned to this and other IT systems? What kind of industry best practice can the draw on to make sure that these kinds of events are anticipated and prepared for?
The Minister has already ordered the review, so we all look forward to learning valuable lessons from the Health Board’s experience. We were impressed that the Minister took this action, but it should be common practice for any public body to perform post-incident analysis to provision for any gaps in technology and procedures, and to determine, act on and share any lessons learned.
Ideally, we would suggest a re-evaluation of the business continuity (or availability) levels of system components measured against the cost in business (i.e. impact on patients), financial and reputational terms of a system failure. These evaluations should be further measured in time, for example the cost of a system failure for five minutes, an hour, 12 hours etc., and used to prioritise any immediate remedial action.
In a wider sense, we would hope that organisations in general can begin to manage risk more effectively. IT is a core business service, and just as organisations manage financial or H&S risks effectively, it merits the same attention so that investment in IT becomes a business decision rather than ‘IT support’s problem’ or ‘an overhead’. More organisations need to make this leap, which can only start at board level.
An international standard called ISO 27001 introduces a management system to effectively identify and manage risk to information systems (whether these are electronic, on paper, or in people’s heads), to an acceptable level defined by the board. This can help move the ownership of risk from the technical domain into the board room, where requests for investment in order to reduce risk can be balanced against the impact if the risk was realised or against the objectives of the wider organisation. We think that it’s healthy for a business to fully own something as crucial as its information systems at board level, rather than just to think of the electronics on your desk as IT support’s responsibility.
Regency is one of the UK’s leading niche consultancies, specialising in the protection of information systems. Whether you need to evaluate the value of your information and complete a risk assessment to consider potential future events that could disrupt your business; review and test your business continuity plans to prepare for the worst; or achieve ISO 27001 Certification in order to get independent assurance that your organisation’s information systems are in good shape; we can help. We take the deep experience developed through helping to protect government and military information in order to pragmatically help clients from all areas of business to protect their information in a way that works for their organisation. Helping to recognise the value of information assets and work with leadership teams to gain sponsorship to improve the management of risk against the identified assets, as well as a practiced ability to make protecting the most complex of IT systems seem straight-forward. We can also help prepare your organisation for ISO 27001 Certification, working with you to transfer skills to maintain this certification in-house over the longer term.
Why not give one of our consultants a call, it doesn’t commit you to anything and we’re always happy to help put people on the right track, even if it’s only to answer a quick question!