What the Triple Zero Outage Taught Us
The recent outage of the National Triple Zero service has sparked the usual mix of outrage, confusion and blame from a salacious media, committed Telstra haters, smug ICT and telco ‘experts’ and obliging punters revelling in their 15 minutes of fame.
We lose perspective very quickly in our ‘lucky country’. The expectation for Governments to provide platinum plated services at bargain basement prices is at odds with the reality of the costs of providing critical infrastructure across our vast continent.
In 2014, the Australian Government undertook a review of the National Triple Zero service. While there is a general agreement that performance of the service year on year is actually pretty impressive, there are a number of areas identified where the service can and should be improved. A summary of the findings can be found here
Notwithstanding the above, the outage did highlight one particular and significant failure in the thinking around the design, and ultimately the operation of these critical network services. That failure was around the core tenets of network resilience: Keep it simple. Make it highly available. Spread the risk across separate carriers.
Keeping it simple means that the design and operation need to be simple to commission, modify and change. The more complex a system’s / network’s design, the more things can, and will go wrong.
Make it highly available means that there is duplication of power sources, switches and routers, transmission electronics, paths and routes etc. The more duplication, the lower the risk, but the higher the cost of the solution.
Spreading the risk across separate carriers means exactly that – different transmission services / network elements provided by completely separate carriers. Carrier diversity is the only real way to get a system or network to approach 100% availability. I say ‘approach 100% availability’ as there is a diminishing commercial return on building redundancy into a network at a certain point. People think that Murphy’s Law is regularly at work, but the reality is that you only ever experience the full effect of Murphy 0.005% of times out of 100 (on a network with 99.995% availability). How many lightning strikes, fibre cuts, switch failures, exchange outages have occurred in the last 12 months? The answer is, surprisingly, a lot. You just never experience them as the redundancy kicks in.
That said, people responsible for sourcing network services for critical ICT requirements must take a much more informed and demanding view around network diversity. No matter what your incumbent carrier tells you, their version of network diversity is inferior to that of a service provided over two different carrier networks.
In recent years it has become more and more common for a carrier organisation to have to prove that their services run over infrastructure that is completely separate from that of the primary or incumbent network provider.
Murphy’s Law should have come with the following caveat when buying network services “If an act of God doesn’t get you, human error will”. Don’t be that human error.
Find out how Vertel delivers critical network infrastructure and services here.