StableNet® Improves Root Cause Algorithm with 12 SP4

Summary: Root Cause Analysis (RCA) is an integral part of StableNet® and essential for monitoring and alarming. Its purpose is to find out the actual cause of errors and failures in the most efficient way possible. This critical functionality relies on a core algorithm and the recent improvement will help users manage alarms even more efficiently.

General Description of Automated Root Cause Analysis in StableNet®

In StableNet® we have an automated Root Cause Analysis which is based on dependencies to possible root causes (= possible reasons for failure). For example, a device service (e.g. database service) depends on the device being available and running. The device availability, in turn, depends on the connectivity of the device (= the port the device is connected to). StableNet® automatically creates these dependencies based on information gathered during the discovery process. Also, additional dependencies can be added manually or by the discovery template.

When an alarm occurs, StableNet® starts the automated Root Cause Analysis, which is a heuristic algorithm to calculate the most probable root cause for the alarm.  An example could be a cut fiber leading to the port status going down and availability and services monitors of connected devices entering the alarm state.

Once a monitor enters an alarm state, the Root Cause Analysis starts using all defined alarm parents (= possible root cause monitors) of this monitor. The algorithm then recursively follows that parent and checks their alarm parents. This created dependency graph can also include loops. All monitors in a warn or alarm state in this graph could be possible root causes. Depending on the complexity, only one root cause is determined (or multiple monitors could be the root cause). This not only helps manage the number of alarms network operators receive, but also is a significant USP of our automated cross-vendor and -technology platform.

A More Powerful Algorithm for Greater Differentiation and Insight

Root cause requires establishing a hierarchy of dependencies amongst devices. The creation of the original algorithm was based mainly on connectivity in networks. The only involved monitors were availability and interface (port) status. However, nowadays the Root Cause algorithm must be much more generic, e.g. based on imported network services or other dependencies. Often additional possible root causes must be added to correlate them.

Have a look at a very simple screenshot out of StableNet® which captures the improvement succinctly. We have a device “server” which has two parents: a) the port of the switch “sw-intern-10.1.1.10” where the device is connected; and b) status of the power source for the device. With the new approach, an algorithm that accounts for the power supply monitor allows for greater differentiation of why an alarm occurs on our server device.  This in turns provides a more efficient and powerful RCA in StableNet®.

Intensive analyses of many different cases were done and also tested in live setups. We also used example data and models from customers to check the algorithm.

The improved algorithm is now readily integrated starting with StableNet® 12 SP4.

Fig. 1: A simple depiction of the impact of the improved root cause algorithmn

Cookie Consent with Real Cookie Banner