Predictive Networks: Networks that Learn, Predict and Plan
Abstract: Since the early days of the Internet (ARPANET in 1970), the topic of Routing Protocol Convergence Time (time required to detect and reroute traffic in order to handle a link/node failure) has been a top-of-mind issue so as to constantly improve the network availability. A number of protocols and technologies have been developed and deployed at a large scale with the objective of reducing the impact of failures in the network. Although such approaches have dramatically evolved, they all rely on the fundamental concept of reactive approaches where failure must first be detected followed by traffic rerouting along an alternate path (that may not provide guarantees). In contrast, a predictive approach relies on a different paradigm consisting in rerouting traffic before the occurrence of the (predicted) failure onto an alternate path that meets application Service Level Agreement (SLA) requirements. Being able to extend the scope to “paths providing poor application experience” (also referred to as Grey failure) is imperative. Extensive experiments have shown a high proportion of paths being “alive” (according to Layer-3 protocols) but performing poorly and thus impacting the user experience.
This white paper discusses the emergence of Predictive Networks using learning technologies capable of predicting a broad range “failures”. The first version of this paper was posted on June 2021 after 1.5 year of investigation. This version 2 includes results derived from the deployment of such technology at scale. The ability of Learning could very well be a paradigm shift in networking for the future.
1 A new world, with new challenges
Network recovery has been a topic of high interest in the networking industry since the early days of the ARPANET in 1970. Nonetheless, the paradigm has not changed much: first, a failure must be detected, followed by traffic rerouting along an alternate path; such a path can be either pre-computed (i.e., protection) or computed on-the-fly (i.e., restoration).
Let us first discuss “Failure detection”. The most efficient approach is to rely on inter-layer signaling whereby lower layer may be able to detect a layer-1 failure (e.g., fiber cut) triggering a signal across layers. Unfortunately, a large proportion of failures impacting a link or a node are not detectable by lower layers. Thus, other techniques such as Keep Alive (KA) messages have been designed. There is clearly no shortage of KA mechanisms implemented by routing protocols such as OSPF, ISIS or BFD. KA have their own shortcomings related to their parameter settings: aggressive timings introduce a risk of oscillations of traffic between multiple paths upon missing few KA messages, a real challenge on (lossy) links where packet loss is not negligible, which introduce high risks of oscillations (such issues can be partially mitigated using some form of hysteresis). Once the failure is detected, a plethora of techniques have been developed such as Fast IGP convergence (OSPF or ISIS using fast LSA/LSP generation, fast triggering of SPF, incremental SPF to mention a few), MPLS Fast Reroute (using a 1-hop backup tunnel for link protection or multi-hop backup tunnels for node protection), IP Fast Reroute (IPRR) or other protection mechanisms used at lower layers (1+1 protection, 1:N, etc.) have proven their efficacy at minimizing downtimes. Such recovery technologies have allowed for a fast convergence time in the order of a few milliseconds, while potentially guaranteeing equivalent SLA on other alternate paths (e.g., MPLS TE with bandwidth protection).
Unfortunately, there is a large category of failures that impact application experience that remain undetected/undetectable. The notion of grey failures has been covered in a companion document . These grey failures (sometimes also referred to as brown failures) may have a high impact on application experience because of high packet loss, delay or jitter without breaking the link/path connectivity (and thus they are not detected by the aforementioned technologies as “Failure”). In this case, most -- if not all -- KA-based approaches would fail, leaving the topology unchanged and the traffic highly impacted even though a preferable alternate path may exist.
Existing solutions such as Application Aware Routing or AAR  rely on the use of network-centric mechanisms such as BFD and HTTP probes to evaluate whether a path meets the SLA requirements of an application using a so-called SLA Template. The most common approach consists in specifying some hard threshold values for various network central KPIs such as the delay, loss and jitter averaged out over a given period of time (e.g., delay average computed over 1h should not exceed 150ms one-way for voice). Such templates may be highly debatable and the use of statistical moments such as average or other percentile values have the undesirable effect of smoothing out the signal and losing the necessary granularity to detect sporadic issues that do impact the user experience. One may then shorten probe frequencies, use higher percentiles but the granularity is unlikely to suffice for the detection of grey failures impacting the user experience (e.g., a 10s high delay/packet loss would highly impact voice experience while not being noticed when probing every 1-minute or even more frequently but averaging out over 1h periods).
Still, AAR is a great step forward when compared with the usual routing paradigm, according to which traffic rerouting occurs only in presence of dark failures (i.e., path connectivity loss). Note that AAR is a misnomer since the true application feedback signal is not taken into account for routing; path selection still relies on other static network metrics and SLA templates are a posteriori assessed and verified as explained above. AAR is reactive (the issue must first be detected and must last for a given period of time for a rerouting action to take place) with no visibility on the existence of a better alternate path: rerouting is triggered over alternate paths with no guarantees that SLA will be met once traffic is rerouted onto the alternate path. This last component is of the utmost importance. With most reactive systems using restoration such as IP, upon detecting that the current path is invalid (e.g., lack of connectivity, or not meeting the SLA) the secondary path is selected on-the-fly with no guarantee that it will keep satisfying the SLA once traffic will be rerouted. This is another major difference with Predictive network using models as discussed in Predictive Networks.
2 Path computation in the Internet
Several routing protocols used within Autonomous Systems (AS), also referred to as IGP (Interior Gateway Protocol), have been developed over the past few decades such as OSPF, ISIS or EIGRP, whereas BGP (an Exterior Gateway Protocols) has been widely deployed to exchange routes between ASes for the past four decades. BGP has been scaling remarkably well and as of 2022 routing tables comprise up to 915K IPv4 prefixes and 152K IPv6 prefixes.
How are paths/routes being computed in the Internet? This is the task of routing protocols. IGPs (ISIS or OSPF) make use of a Link State DataBase (LSDB) to compute shortest paths using the well-known Dijkstra algorithm. Link weights are statically reflecting some link properties (link bandwidth, delay, …). More dynamic solutions using control plane Call Admission Control (CAC) such as MPLS Traffic Engineering allows for steering traffic along Labeled Switched Path (LSP) computed using constraint-shortest paths using a distributed head-end driven Constrained SPF (Shortest Path computation). Alternatively, such TE LSPs may also be computed using a Path Computation Element (PCE) for Optical/IP-MPLS layers, intra and inter-domain. Constrained shortest paths may even be recomputed according to measured bandwidth usage (e.g., Auto-bandwidth). The PCE is a central engine that gathers topology and resources information to centrally compute Traffic Engineering LSP (TE LSP) paths. The PCE can be stateless or stateful performing global optimization of traffic placement in the network. Paths between AS are selected using BGP.
A path between a source and destination across the Internet will likely cross several Autonomous Systems (AS), each managed by distinct administrative domains using disparate traffic engineering policies, making the optimization of end-to-end path extremely challenging, if not impossible.
3 Before Predicting the network must be able to Learn
3.1 Learning in the human brain
The human brain is without a doubt the most advanced learning engine. Despite considerable breakthrough discoveries in neurosciences over the past few decades, the learning strategies of the human brain remain fairly unknown. Still, the human brain is a remarkable highly plastic learning machine. The learning capabilities manifest themselves in several ways:
· The Hebbian theory related to synaptic plasticity has been a key principle in neuroscience “What fire together wire together”; thanks to synaptic plasticity neural networks are formed (wire together) dynamically thus allowing us to learn and also unlearn (for example thanks to synaptic downscaling during sleep).
· The brain is made of a collection of networks (e.g., Default mode, sensimotor, visual, attention, …) that are also highly dynamic: new links and connections (synaptic plasticity), new nodes (neurogenesis in areas such as the hippocampus), and even new networks topologies (rewiring).
JP Vasseur (firstname.lastname@example.org), PhD - Cisco Fellow, Engineering Lead – Release v2.0 – April 2022
A well know theory known as “Predictive Coding” and depicted in Figure 3 show that some areas of the brain get inputs from other regions, provide predictions that get adjusted according to sensing. A number of theories on the critical topic of “Predictive Coding” such as the Free Energy Principle have been first explored using a Bayesian approach by Friston in 2005. Other theories state that the ability to learn with very few labels relies on the brain’s ability to build a predictive model of the world mostly thanks to observation allowing a human to quickly learn and predict. In contrast, existing supervised machine learning techniques require a massive amount of labeled data or reinforcement learning techniques that also require huge amount of action/feedback. Such an approach gave rise to new approaches in Machine Learning such as Self-Supervised learning.
Other forms of predictions involving hierarchical structures (Hierarchical coding) with interaction between sensing and prediction (in different brain areas communicating via different layers of the neocortex) are also very well-known in vision, auditory pathways and Natural language processing.
Higher level predictions are also known to be performed in the PreFrontal Cortext (PFC). Without elaborating too much the ability to decide (Automate in networking terms) is also a key ability human (and animals) have that heavily involves the Orbito Frontal Cortex (OFC).
3.2 How about learning in the Internet?
It is an unescapable fact that most of the control plane networking technologies do not incorporate learning. Instead, today’s technologies and protocols rather focus on the ability to react as quickly as possible using current states that do not leverage any form of learning.
Imagine a human brain incapable of learning and rather just reacting!
We have to face it: the Internet has not been designed with the ability to learn (from historical data), model (the network) and even less predict. Data collection has most often been used for monitoring and troubleshooting of past events. Protocols simply do not learn. There are some minor exceptions such as packet retransmissions (backoff) at lower transport layers, the ability to estimate/learn the instantaneous bandwidth along a given path, the use of hysteresis or some protocols trying to estimate capacities before using a path such as . But for the most part, protocols have adaptive behaviors according to a very recent past, without true learning/modelling. Control plane operations mostly focus on the ability to react.
Networks are highly diverse and dynamic: in a multi-cloud highly virtualized world where the network keeps changing and applications constantly move, it has never been so important to equip the Internet with the ability to learn.
Furthermore, networks are highly diverse, which underlines the necessity to learn and adapt using technologies such as Machine Learning (ML). Similar properties were observed in other networking areas: for instance, the ability to detect abnormal joining times or percentage of failed roaming in Wifi networks cannot rely on the use of static thresholds. In 2018 ML-based tools such as Cisco AI Network Analytics ) were developed where ML models were used to learn a “baseline” according to a number of parameters. The ability to learn allowed for a dramatic reduction of unnecessary noise (false alarms due to the use of static thresholds). Although the use cases for applying ML to the WAN are different (predict issues), this highlights the need to learn and dynamically adapt to the network characteristics.
2 Predictive Networks
Starting with a famous quote attributed to Niels Bohr: “It is hard to make predictions, especially about the future”. Cisco has invested considerable research since 2019 to investigate the ability for networks to learn, predict and plan.
The ability to predict requires building models trained with data. First, an unprecedented analysis has been made on millions of paths across the Internet, using different networking technologies (MPLS, Internet), access types (DSL, fiber, satellite, 4G), in various regions and thousands of Service Provider networks across the world. The objective was first to evaluate the characteristics of a vast number of paths .  provides an overview of such analysis. Figure 4 shows a few approaches for data path statistical models. More details can be found in our internet dyamics study .
The human brain is not only an impressive learning machine but also a remarkable predictive engine: the simple action of grabbing an object involves a complex series of predictive actions (e.g., just anticipating the shape of the object), or even the position of the ball when playing tennis due to the high processing times required by vision.
More advanced models relying on a variety of features and statistical variables have been designed (e.g., spectral entropy, welch spectral density, MACOS, …) along with their impact on application experience .
The fact that path “quality” across the Internet is diverse and varies over time is not new. But the main key take-away lies in the ability to quantify across space and time using a broad range of statistical and mathematical analysis.
Simply said, predicting/forecasting refers to the ability to make assumptions on the occurrence of events of interest (e.g., dark/grey failures) using statistical or ML models learned using historical data.
4.1 Short-term versus Long-term predictions
The forecasting horizon (how much time in advance should the prediction be made) is one of the most decisive criteria. Forecasting with very low time granularity (e.g., months) is usually less challenging especially when the objective is to capture trends. On the other side of the spectrum, a system capable of forecasting a specific event (e.g., failure) impacting the user experience is far more interesting (and challenging).
A number of approaches have been studied since 2019 both for short- and long-term predictions in predictive networks using both statistical and ML models.
For example, ML algorithms such as Recurrent Neural Network (RNN) like Long Short-Term memory (LSTM) may be used for forecasting, using local, per-cluster or even global models (Figure 5 shows an LSTM with Auto-encoder model used for prediction). The LSTM was trained using a large number of features such as various statistical moments (mean, median, percentiles) for KPI such as delay, loss and jitter computed over short and long-term periods of time, per-interface features, temporal features, categorical features related to Service Providers or even traffic-related features.
We have also developed other approaches such as State Transition Learning for forecasting failure events by observing the prominent subsequence of network state trajectories that may lead to event of interest (e.g., failures).
Mid- and Long-term prediction approaches ought to be considered whereby the system models the network to determine where/when actions should be taken to adapt routing policies and configuration changes in the network in light of the observed performance and state of the network (i.e., Internet behavior). Such predictions then allow for making recommendations (e.g., change of configuration or routing policies) that will improve the overall network Service Level Objectives (SLO) and application experience. Mid- and Long-term predictions have proven to be highly beneficial, although less efficient than short term recommendations that deal with short term predictions and remediations. Such systems must consider a series of risk factors including the stability of the network and traffic pattern in order to minimize the risk of predictions that would be quickly outdated. This contrasts with a short-term predictive engine used for “quick fixes” and avoidance of temporary failures that would be enabled with full automation (a topic that will be discussed in a companion document).
4.2 Forecasting accuracy
Forecasting accuracy is a recuring topic. Any forecasting system will make prediction errors. However, such a predictive engine can be designed so as to make trade-offs between True Positives (TP), False Positive (FP) and False Negative (FN). TP occurs when the model predicts a failure, and a failure occurs. FP means that a failure has been predicted and does not occur whereas a FN refers to the opposite situation (a failure happens that has not been predicted). For example, a Machine Learning (ML) classification algorithm may be tuned to deal with the well-known Precision/Recall tension where Precision=TP/(TP+FP) and Recall=TP/(TP+FN). In other words, the algorithm must be tuned to favor either Precision or Recall. Cisco’s predictive engine favors Precision over Recall, a safe approach for highly minimizing the risk of FPs. Said differently, the system has been optimized so as to not always predict all failures but having the highly desirable objectives of having almost no false positives: when a failure is predicted, it almost always happens. What does that mean for recall? As expected, such a system cannot always predict all issues but with existing reactive approaches there is no prediction at all!
Many other dimensions must be considered in such a predictive system. For example, will the proactive action of rerouting traffic impact the traffic already in place along the alternate path?
In a live system, such as the one Cisco has developed, other criteria must be taken into account and time granularity is of the utmost importance both for telemetry gathering and time to react (triggering close loop control) with tight implications on the architecture.
4.3 Are all failures predictable?
From a pure theoretical standpoint, yes, since true randomness is extremely rare in nature (in quantum physics true randomness is well-known). So, events such as failures usually are caused by other events that may theoretically be detectable. The networking day-to-day reality is of course very different. In most cases, events indicative of some upcoming failures may exist, but they are not always monitored. Moreover, the timeframe is not always compatible with the ability to trigger some actions; even a fiber cut may be predicted by monitoring the signal in real-time but a few nanoseconds before the damage leaving not enough time to trigger recovery action, even if a predictive signal exists. In reality, the ability to predict events is driven by the following factors:
· Finding “Signal” in telemetry (when such telemetry exists) with high SNR
· Computing a reliable ML model with sufficient Precision/Recall
• Designing an architecture at scale supporting a Predictive approach (this last aspect should not be overlooked; there is a considerable gap between an experiment in a lab and the scale of the Internet).
Predicting consists in finding (predictive) signals used to build a model and producing a given outcome (e.g., component X will fail within x ms, or probability of failing of component Y is Pb) using classification and/or regression approaches. The ability to predict raises a number of challenges Cisco managed to overcome; thanks to a decade of deep expertise in Machine Learning and analytics platforms.
4.4 Path computation with Predictive Networks in SD-WAN
The concept of Predictive Networks is not limited to a specific “use case” and could be applied to several contexts, use cases and networks using a broad range of technologies. In this paper, we will focus on Predictive Network for SD-WAN as shown in Figure 6 showing the edge path selection in a Predictive Internet. In this classic example, a remote site (called “Edge”) is connected to a Hub via SD-WAN using several tunnels having various properties. It is also quite common for the edge to be connected to the public Internet via an interface sometimes called DIA (Direct Internet Access). In such a situation, the traffic destined to, for example, a SaaS application S can be sent along one of the tunnels from the edge to the Hub (backhauling) or via the Internet using IP routing, or even using a (GRE or IPSec) tunnel to a Security Cloud.
The notion of Predictive Networks for SD-WAN refers to the ability to predict (dark/grey) failures for each of the paths from edge to destination, on a per-application basis.
Can the system predict where in the network the issues take place? The objective of such an engine is not to predict the exact location where the issue may take place in the Internet and then use some potential loose-hop routing or segment routing technique to avoid the predicted failed location. Instead, the engine predicts issues for a given path and allows for the selection of an alternate path end-to-end, making the prediction of the root cause not required.
5 Vision or Reality? Results
As pointed out, such a predictive system has been in production in 100 networks around the world, doing real-time predictions for several months and has been proven to perform predictions highly improving the overall network SLO and application experience. Although the details of the exact architecture, telemetry (and technique for noise reduction), algorithms, training strategies are out of the scope of this document, it is worth providing several examples of the overall benefits that a predictive system can bring.
Although the system continues to evolve and improve using new sources of telemetry, algorithms and tuning of many sorts, it is worth providing some high level overview of the performance one can expect from such predictive technologies.
5.1 Short term predictions
Let us start with a few examples. Figure 7 Error! Reference source not found.shows the overall number of minutes with SLA violation observed in a 30-day period on a network (in Red). Next to it are the number of minutes of SLA violation that would have been avoided (“saved”) using the predictive engine that managed to accurately predict such grey failures (application SLA violation) but also finding an alternate paths free of SLA violation in the same network (without adding any additional capacity).
For the sake of illustration, let us explore examples of such predictions (i.e., SLA violation) that were correctly predicted. The next set of figures show when a failure was predicted (see green dots on the timeline). Various time series show after the predictions the loss, delay and jitter. In ocean blue is the default path programmed on the network, in the dark blue color is the path recommended by the predictive engine, thus validating that the prediction was indeed correct.
In the first example (Figure 8), many minutes of traffic (11,000 minutes of voice traffic) with SLA violation could have been avoided (green) for traffic sent along Business Internet path (ocean blue) by proactively rerouting traffic onto an existing bronze internet path (a priori with less strict SLA) (dark blue).
Figure 9 shows a prediction of packet loss spike (way before they take place) along a short-distance MPLS paths resulting in 82% of SLA violation over a 30-day period.
Figure 10 shows another example of prediction of a sporadic packet loss (17%) in Australia.
The number of such successful predictions is endless and a few examples are provided for illustration. Even though all failures cannot be predicted, any single prediction means that an issue is being proactively avoided, while all other (unpredicted) failures are being handled by the reactive system, making the combination of both approaches a huge step forward for the Internet. A companion paper focusses on the reactive versus predictive modes of operation and their complementarity (see  for details).
Are accuracy and efficiency of predictions consistent across networks and time? No, as expected. First, each network has its peculiarities in terms of topology (and built-in redundancy), traffic profiles, provisioning, and access types to mention a few. A predictive engine may then exhibit different levels of efficiency, as expected. Furthermore, the objective is not just to Predict but also to find some alternate path free of SLA violation. Consequently, the built-in network redundancy (availability of alternate paths) will be a key factor. A very interesting fact that has been observed is that predictions do vary significantly over times as the Internet/Networks evolve (e.g., failures, capacity upgrades, new peering) requiring constant learning and adjustment. Figure 11 shows the number of accurate predictions and ability to find alternate path (with measures such as the number of minutes of traffic saved from SLA failures) and their variation other times, for multiple regions of the world and across multiple networks.
The Y-axis shows the regions of the world and the X-axis the number of saved minutes of traffic with SLA violations varies over time. It can be noticed that the amount of traffic saved varies significantly across networks and over time. On the top left corner, a network where the number of failures accurately predicted and avoided varies from low the high across all regions. On the top right, another network where most of the failures avoided are located in a given region and tends to exhibit some form of periodicity. We can observe yet another pattern on the bottom left network where at some point, most of the avoided failures were in a given region and later in another region. It is also worth noting that predictions are constantly adjusted; thanks to continuous learning. Indeed, the Internet and other SP networks are highly dynamic and experience failures due to several reasons such as new peering agreements, movement of SaaS applications across the Internet and evolution of traffic loads.
As often, there is no one-size-fits-all algorithm capable of predicting (grey) failures. Each algorithm has specific properties related to the type of telemetry used to train the model, the forecasting horizon (itself coupled with the availability of telemetry) and of course the overall efficacy.
For the sake of illustration, Figure 12 shows a performance accuracy metric for multiple paths in the world with different characteristics. The metric for measuring the accuracy is outside of the scope of this document but the point is to show that accuracy varies (in this particular example, a low value is indicative of higher performance).
Figure 13 provides another view of such variability of the predictive accuracy; one can observe the relationship between forecasting accuracy and some properties of the path such as the entropy, level of periodicity and many other path characteristics.
Figure 14 shows the remarkable performances obtained for two path categories (called Periodic simple and Periodic non-linear). For a detailed discussion on path categories, refer to the study on time-series categorization . For the periodic nonlinear category, 67% of the paths had at least 20% of Recall and 90% of Precision (called high precision paths), 91% of disrupted session minutes were correctly predicted (very high recall) and 89% of the application failures were predicted (recall for AF metrics). For this particular algorithm, lower performance was obtained for the recall, still allowing for the avoidance of a number of issues in the network thanks to accurate predictions.
Figure 15 shows the ability of the algorithm to anticipate and predict issues.
5.2 A deeper-dive into some results
Let us now have a deeper look at some of the prediction performance metrics for a given algorithm. A disclaimer: such numbers must not be considered as the definitive truth since they constantly evolve and they are algorithmic-specific, but provided here for the sake of illustration.
Precision and Recall may even be path-specific for a given category. Figure 16 Illustrates the Precision/Recall distribution for a number of paths for a given (LSTM) algorithm.
5.2.1 Long term predictions
In contrast, long-term predictions are made by other types of models to predict the probability of failures in upcoming days, weeks and even months. Because of a different forecasting timeline, such predictions may then be used by a recommendation system to suggest configuration changes in the networks. Note that this is in contrast with short-term predictions that work in-hand with full automation. Long-term prediction systems may take additional constraints into account such as the duration of validity of such predictions. Indeed, in absence of automation, it is inconceivable to make too many recommendations that would require manual intervention (configuration changes) by the network administrator.
The next figures show results obtained using a long-term prediction algorithm on dozens of networks for multiple applications. As always one must define the metrics of interest. In this case, in order to evaluate the impact of failures we use a metric called the MPDU (Mean across all time of the maximum number of users in a day; number of users are sampled every 1h).
Figure 17: Long-term predictive algorithm reducing the max number of impacted users by 10,000 (from 40K to 30K) summarizes the overall benefits for 61 networks showing that the maximimum number if impacted users (measured per hour) was reduced by 25% thanks to long term predictions with a very high accuracy.
Let’s now have a deeper dive into the results:
Figure 18: Various performance metrics of a long-term prediction algorithm shows a number of interesting performance metrics for all applications (Global), Office365 and Voice. The results speak for themselves. The first plot shows the percentage of users at peak that were impacted by failures (i.e., SLA violations), saved (thanks to the correct predictions of the algorithm, users were not impacted), and disrupted (the user experiences were worst on the recommended path than the default path – the percentage of time within SLA was less than 90% of the time). Box plots are used to illustrate the distributions across a number of sites and networks. The second plot illustrates the efficacy (percentage of impacted users that were no longer impacted thanks to the correct prediction). One can note also the remarkable accuracy of the algorithm. Moreover, the average lifetime of those predictions exceeded one month in the majority of the cases.
As illustrated in Figure 19, as for short-term predictions, efficacy varies over time, which once more, shows the need for constant learning so as to adapt to the high dynamicity of the networks. The evolution of efficacy is shown for 10 networks. It can be observed that efficacy is constantly high for some networks whereas it does vary for over networks. Still, it can be shown that the overall performance is significant.
Let us now analyze the performances of the prediction algorithm using additional metrics: quality (percentage of time during which SLA is met) for the default path (paths used by the network according to the SD-WAN policy) and the recommended path (recommended paths by the predictive engine), measuring also the total number of users impacted (with some SLA violation), for all applications.
Figure 21 shows a number of interesting insights. For example, there are peaks (high number of impacted users) with high efficacy where the default quality was relatively good. But, there are also may peaks showing situations where default quality was low and efficacy is high; thus denoting that such predictions help a lot of users when default quality is low.
6 Prediction: the first step towards Self-Healing networks thanks to Automation
Without a doubt Automation has always been the holy grail, allowing networks to make remediation action automatically, at scale, thus improving networks’ SLO. Still several roadblocks have been in the way, for a number of reasons. The ability to accurately predict may very well allow for a progressive use of trusted automation, especially for issues the system is capable of predicting. A number of mechanisms have been studied so as to allow networks to make use of predictions with user feedback thus making self-healing networks a reality. More details will be shared in a further revision of this document.
Predictive Networks path will take place over several years: additional telemetry allowing for true application feedback driven prediction will emerge, new algorithms are being evaluated including real-time prediction with on-premise inference and such technologies will apply to a number of already identified networking use cases.
Without a doubt, adding learning capabilities to the networks such as the Internet will increase the overall network SLO and application experience and may arguably be overdue.
Real data-driven experiments have shown that such an approach will complement the reactive approach that has governed the Internet for the past four decades.
Although there is no one-size-fit-all approach, various statistical and ML driven models have shown the possibility to predict future events and take proactive actions for short-term and long-term predictions with high accuracy, avoiding a high number of failures that would have significantly impacted the user experience.
Coupled with automation, Predictive Networks could lead to well overdue Self-Healing networks and they could be one of the most impactful technologies for the Internet. Many more innovations are in the works.
I would like to express my real gratitude to several key contributors I have been working with for a number of years: Gregory Mermoud, Vinay Kolar and PA Savalle with whom many ML/AI innovations gave birth to novel innovations for networking (Wireless Anomaly Detection with root causing, Self-Learning Networks, detection of spoofing attacks. … ). I would like to acknowledge the work of several highly talented engineers: Mukund Raghuprasad, Michal Garcarz, Eduard Schornig, Romain Kakko-Chiloff and Jurg Diemand to mention a few who had a major contribution to this work. Needless to say, that a close collaboration with a number of customers in the world has allowed for such an innovative work.
 J. P. Vasseur, "From Dark to Grey Failures in the Internet," June 2021.
 "SD-WAN: Application-Aware Routing Deployment Guide," 21 July 2020. [Online]. Available: https://www.cisco.com/c/en/us/td/docs/solutions/CVD/SDWAN/cisco-sdwan-application-aware-routing-deploy-guide.html.
 N. Cardwell, Y. Cheng, S. Gunn, S. H. Yeganeh and V. Jacobson, "BBR Congestion Control," ACM Queue, pp. 20 -- 53, 2016.
 "Cisco AI Network Analytics," [Online]. Available: https://www.cisco.com/c/en/us/solutions/collateral/enterprise-networks/nb-06-ai-nw-analytics-wp-cte-en.html.
 S. Dasgupta, V. Kolar and J. P. Vasseur, "Quantifying Network Path Dynamics using Time-series Features," Jan 2022.
 J. P. Vasseur, V. Kolar and M. Y. Raghuprasad, "An Analysis of Network KPI variability for various SaaS application in the Internet," Oct 2021.
 J. P. Vasseur, G. Mermoud, V. Kolar and E. Schornig, "Reactive versus Predictive Networks," 2022.