JP Vasseur | Maximum Tolerable BER

You are a network expert. Could you state the maximum tolerable bit error rate? Sounds like a simple question. It is not.

The reason is not that BER is obscure. The reason is that BER is only the first symptom. The real impact depends on what the system does next: local correction, retransmission, backpressure, transport adaptation, workload delay, or application-visible degradation. This is a compact version of a broader argument made in Time to Revisit the Internet Layering Principle: AI-Driven Cross-Layer Optimization . Layering was essential. It gave the Internet modularity, interoperability, and scale. But strict layer isolation also creates blind spots. A physical or link-layer counter may look acceptable while the application has already paid the price.

So the first honest answer to the BER question is: tolerable for what? Is the corrupted bit corrected locally before anyone notices? Is the frame dropped and retransmitted? Is the event interpreted by TCP or SCTP as congestion? Is it hidden inside the tail latency of a storage write, an inference request, a distributed training step, or a video session? The number alone does not answer any of those questions.

Recovery is not free. It consumes bandwidth and time. It can create queues, jitter, delay bursts, duplicate work, and rate reduction. It can also hide the original failure mode from the upper layers. Each layer may behave exactly as specified, while the whole system moves away from the real objective.

RDMA makes this especially visible. InfiniBand stacks two mechanisms here. Reliable Connected transport handles packet sequencing, ACK/NAK behavior, and retries. Below it, link-layer credit-based flow control operates hop by hop on each virtual lane, so a sender does not overrun the next-hop receiver buffers. Together they can absorb loss or push back before queues spill. With RoCEv2, RDMA runs over Ethernet/IP and is often engineered as lossless or near-lossless using Priority Flow Control, ECN marking, and DCQCN or an equivalent congestion-control algorithm.

These mechanisms are useful and often very effective. The problem is that they are not coordinated around a single application objective. In a well-tuned RoCEv2 fabric, queue buildup should first trigger ECN marking; congestion feedback then reaches the sender, and DCQCN or an equivalent algorithm reduces the NIC injection rate. If the burst is too sharp, or if the feedback loop cannot drain the queue fast enough, PFC may then pause the affected priority class to avoid packet loss. From the RDMA application viewpoint, the visible symptom may simply be slower completion-queue progress, longer collective synchronization, or reduced useful throughput.

The fabric can therefore remain "lossless" while the workload slows down. That is the uncomfortable point: lossless is not the same as application-efficient. PFC can avoid drops, but it can also propagate backpressure, create head-of-line blocking among flows that share the paused priority, and in pathological cases contribute to pause storms or deadlock-like behavior.

Flow control tells the same story. In InfiniBand, credit-based flow control prevents a sender from overrunning receiver buffers on a virtual lane. In Ethernet fabrics, IEEE 802.3x PAUSE or IEEE 802.1Qbb Priority Flow Control can stop transmission at link or priority level. At IP and transport layers, ECN, DCTCP, CUBIC, BBR, RoCE congestion control, and application-level admission or batching can all react to pressure in different ways.

The cascade can be perfectly efficient locally and still wrong globally. A receiver pushes back. A priority is paused. Queue occupancy changes. ECN marks increase. A sender slows down. A collective operation waits. An application scheduler may interpret the delay as load or failure and change concurrency, placement, or retry behavior. No single layer sees the whole chain.

This is why no universal maximum tolerable BER exists. A threshold depends on the recovery stack, traffic mix, burstiness, impairment duration, congestion-control behavior, workload sensitivity, and cost of corrective action. A BER value that is harmless for one application may be catastrophic for another. Even for the same application, the answer can change with load, topology, redundancy, and time scale.

The same issue appeared in the work on Cognitive Networks: AI-Driven Quality of Experience : the relevant target is not a static network counter, but the learned relationship between cross-layer telemetry and user or application experience. In an autonomous data center, the target may be job completion time, tail latency, fabric stability, storage throughput, inference SLOs, or training efficiency. None of those can be inferred from BER alone.

That is where AI can be useful, if applied to the right problem. Not as a label on an old threshold-based control loop, and not as a black box replacing protocols. The useful role is to learn the relationships across physical symptoms, link-layer recovery, congestion signals, transport behavior, workload state, and application outcome. Then the system can decide whether to correct, reroute, rate-limit, reschedule, notify, or do nothing.

So what is the maximum tolerable bit error rate? There is no universal answer. The real question is which cross-layer state predicts application harm, and which action improves the global objective.

You Are a Network Expert. What Is the Maximum Tolerable BER?

Maximum tolerable BER