Disaster recovery – searching for the best answer
The story of the BA network failure generated a lot of headlines over the holiday weekend and caused significant problems for anyone wanting to get to somewhere important on schedule. The full story of how a power failure could ground an airline has yet to be made public. However the range of what are referred to as uninterruptible power supplies is huge and the speed at which they can take over a data centre until applications can be rendered ‘safe’ is quite amazing. In the case of BA I was surprised that it could not only take out all the operating systems of the airline but also email servers, web servers and customer call centres so that no-one could communicate with anyone else inside the company. Even the special 24/7 media line for journalists went down as it relied on internal telephony. As so often the case in IT, the importance of avoiding these breaks in capability has been recognised for decades. Tandem Computers was established in 1974 by James Treybig and his colleagues who had worked on the problems at HP. Tandem was eventually acquired by Compaq which was then acquired by HP. What goes around comes around.
There are three elements of disaster recovery planning. The first of these is to define both the Recovery Point Objective and the Recovery Time Objective. These are not easy definitions for IT and business operations to agree on. The second is to build a risk management plan about what the implication of these Objectives is on business operations if they are met. The third is to work out the implications of what would happen if they are not met, and that could possibly lead to a revision of both Objectives. The problem from an IT perspective is trying to work through all the dependences are of the systems and processes that are linked to and/or are dependent on the server or network that has failed. There will be no doubt that plans will have been established for each individual element but I would doubt that anyone is going to stress-test the entire IT architecture in practice, and probably not even in simulation.
Disaster recovery for search applications is a fairly interesting exercise. To illustrate this take a look at the introductory document on the recovery options for disaster recovery on SharePoint 2013. In many respects quite a comforting document. Next take a look at the challenges for managing disaster recovery for search in SharePoint 2013. You will notice that it is a lot longer with a great deal of technical information. It also starts with noting that many of the approaches used for disaster recovery in earlier versions of SharePoint Server don’t provide the same level of recovery in SharePoint 2013. I’ve used SharePoint as an example as the documentation is easy to link to.
In general disaster recovery for search is much more difficult that most IT managers are aware of. There may even be a need to carry out a total or partial crawl to build a new index if the index itself is corrupted in some way and that is not good news. It is also important to recognise that a search application is a signpost to a server location for the document. The search application itself may be working ‘correctly’ but is pointing to content that for now does not seem to exist. Another issue is that often search analytics is a separate application and all the positive trend lines you have been building up for the last few months could suddenly have an unwelcome spike.
So this week would be a good week to ask for a copy of the disaster recovery plan for your search applications and consider whether the risk register you have for the applications and the disaster recovery plan meet somewhere in common ground. Don’t take as an excuse that this will not happen because the organisation uses a cloud service provider. However just asking for the documentation in the cloud service contract that applies specifically to your search applications is probably not good enough. You need to sit down with the IT team and work through the provisions of the contract on a line-by-line basis. Both groups of managers may learn much from the process.