Introduction to Network Congestion
Network congestion can appear out of nowhere, especially in data centers. A sudden burst of traffic from distributed systems, microservices, or AI training jobs can overwhelm switch buffers in seconds. The problem is not just knowing when something goes wrong, but being able to see it coming before it happens. Telemetry systems are widely used to monitor network health, but most operate in a reactive mode, flagging congestion only after performance has degraded.
The Limitations of Traditional Telemetry
Traditional telemetry systems have limitations. Once a link is saturated or a queue is full, it’s already past the point of early diagnosis, and tracing the original cause becomes significantly harder. In-band Network Telemetry (INT) tries to solve this gap by tagging live packets with metadata as they travel through the network. It gives a real-time view of how traffic flows, where queues are building up, where latency is creeping in, and how each switch is handling forwarding. However, enabling INT on every packet can introduce serious overhead and push a flood of telemetry data to the control plane, much of which may not be needed.
A New Approach: Predictive Telemetry
What if we could be more selective? Instead of tracking everything, we could forecast where trouble is likely to form and enable INT just for those regions and just for a short time. This way, we get detailed visibility when it matters most without paying the full cost of always-on monitoring. This approach is based on predicting congestion before it happens and activating detailed telemetry in just the right place and time.
The Problem with Always-On Telemetry
INT gives a powerful, detailed view of what’s happening inside the network. However, there’s a cost: this telemetry data adds weight to every packet, and if applied to all traffic, it can eat up significant bandwidth and processing capacity. To get around that, many systems take shortcuts such as sampling or event-triggered telemetry. These techniques help control overhead but miss the critical early moments of a traffic surge.
Introducing a Predictive Approach
Instead of reacting to symptoms, a predictive system can forecast congestion before it happens and activate detailed telemetry proactively. The idea is simple: if we can anticipate when and where traffic is going to spike, we can selectively enable INT just for that hotspot and only for the right window of time. This keeps overhead low but gives deep visibility when it actually matters.
System Design
The system consists of four critical components: a data collector, a forecasting engine, a telemetry controller, and a programmable data plane. The data collector gathers network data to monitor traffic. The forecasting engine uses a Long Short-Term Memory (LSTM) model to predict when and where congestion is likely to occur. The telemetry controller listens to these forecasts and makes decisions about when to enable INT. The programmable data plane is the switch itself, which can adjust packet behavior on the fly.
Experimental Setup
The system was built using a simulation of a leaf-spine network, a P4 programmable software switch, real-time traffic statistics, and an LSTM forecasting model. The LSTM was trained on synthetic traffic traces and runs in a loop, making predictions every 30 seconds.
Why LSTM?
LSTM was chosen because network traffic tends to have structure, with patterns tied to time of day, background load, or batch processing jobs. LSTMs are particularly good at picking up on these temporal relationships, making them suitable for network traffic forecasting.
Evaluation
The system was evaluated based on its ability to catch trouble early and its monitoring efficiency. The predictive approach gives operators a clearer picture of what led to the issue, not just the symptoms once they appear. By selectively enabling high-fidelity telemetry for short bursts, the system keeps overhead low without compromising visibility.
Conceptual Comparison of Telemetry Strategies
The approach compares favorably to traditional telemetry strategies, delivering deeper visibility than sampling or reactive systems but at a fraction of the cost of always-on telemetry.
Conclusion
By combining machine learning and programmable switches, the system predicts congestion before it happens and activates detailed telemetry in just the right place and time. This opens up a new level of observability, making it a baseline expectation for future network monitoring. As telemetry becomes increasingly important in AI-scale data centers and low-latency services, intelligent monitoring will become essential.