#003 – Event Prediction Protocol

The new year approaches and we are trying to make predictions of what happens in 2017. Networks have also its own predictions describing trends, investments and migrations. Everyone likes when data centers, networks and overall services are highly available. Wouldn’t we increase a network availability and security by anticipating failures and attacks?

Let’s imagine we have Event Prediction Protocol (EPP) which informs a headend node about a probability of a failure of a primary path or a component of this path. The most important attribute would be a probability of a failure/attack. How this could be measured? In terms of a failure we could use MTBF of each node, system, path in conjunction with other information. More we know the better. So statistical information about high availability of fibres, dwdm systems, used topologies, even NOC KPIs. All those metrics would be send as a parameter or a probability attribute for a final calculation. The issue here is that some data are misleading. A real MTBF is different than a theoretical MTBF from a data sheet. Even if a switch or a router can work without an outage more than 10 years someone would make a configuration mistake during that time for sure.

ha-net-mess

We should gather more operational data and do analysis. But this is not enough. We could use cognitive analysis which is the future of cloud services. Someone would like to plug a wrong cable and create a backdoor link or a layer 2 loop? No way! Behavioral alert level raised to high through an image or video identification. An example is available here: https://www.microsoft.com/cognitive-services/en-us/computer-vision-api

We can imagine that our work could be analysed on the fly and all valid sources could be informed through EPP notifications. This would be especially usefull when it comes to the network security. Minority Report becomes more and more a fact.

Happy New Year!

One thought on “#003 – Event Prediction Protocol

  1. Actually, more and more vendors try to provide that kind of ‘future failure propability’ per-node data by their own means. Including various ‘call home’ mechanisms, and building in features that test ASICs, FPGAs is quite common those days. Of course, white-box vendors lack here (as they usually don’t use or fully exploit diagnostic capabilities of underlying hardware and APIs), and traditional networking/telco vendors move quite slowly, but… as You wrote: just throwing the immense number of diagnostic and monitoring data at some system that has capabilities to learn, observe and adopt the data to formulate future events seems to be in our grasp today. I belive Google, Facebook and Amazon already try to predict server and rack/pod failures this way, scaling up from protocols like S.M.A.R.T. for HDDs to whole ecosystems. And there’s some academic research already pointing this way: http://ieeexplore.ieee.org/document/7460337.

    So, the only quiestion that remains is – should we pursue certification by IEEE or IETF? 🙂

    Like

Leave a reply to Lukasz Bromirski Cancel reply