Oa5678 Stack
ArticlesCategories
Linux & DevOps

Mastering CUBIC Congestion Control: Debugging a Stuck Congestion Window in QUIC

Published 2026-05-20 15:51:29 · Linux & DevOps

Overview

This tutorial dives into a subtle but critical bug in the CUBIC congestion control algorithm when ported from the Linux kernel to a QUIC implementation (quiche). The bug caused the congestion window (cwnd) to become permanently stuck at its minimum value after a congestion collapse, preventing recovery. You'll learn the underlying mechanics, step-by-step reproduction, a simple fix, and common pitfalls to avoid.

Mastering CUBIC Congestion Control: Debugging a Stuck Congestion Window in QUIC
Source: blog.cloudflare.com

Prerequisites

  • Basic understanding of TCP/IP and QUIC protocols
  • Familiarity with congestion control concepts (cwnd, slow start, congestion avoidance)
  • Access to a Linux environment for testing (optional but recommended)
  • Knowledge of C or Rust (quiche is Rust, but examples are language-agnostic)

Step-by-Step Instructions

1. Understanding CUBIC's Core Logic

CUBIC, standardized in RFC 9438, is the default congestion controller in Linux. It manages the cwnd to probe for available bandwidth: increasing cwnd when no loss is detected (probing), and decreasing it when loss occurs (backoff). The algorithm uses a cubic function (hence the name) to grow cwnd after a loss event, aiming for better network utilization.

2. The Bug: A Stuck Congestion Window

The bug manifests when the connection experiences heavy loss early, driving cwnd to cwnd_min (typically 2 or 4 packets). Normally, after a loss event, CUBIC should eventually recover and grow cwnd. However, due to an interaction with the app-limited exclusion (RFC 9438 §4.2-12), the cwnd becomes permanently stuck at the minimum. The app-limited rule is designed to prevent premature growth when the application isn't sending enough data, but a logic error causes CUBIC to never exit the recovery state.

3. Reproducing the Bug

To reproduce, set up a QUIC connection using quiche with CUBIC as the congestion controller. Simulate heavy packet loss (e.g., 50% loss rate) during the first few round trips. Monitor cwnd over time. Expected behavior: cwnd drops to cwnd_min and stays there indefinitely.

# Example using quiche's test harness (pseudo-code)
let mut cc = Cubic::default();
cc.on_loss(initial_packet);  // heavy loss
assert!(cc.cwnd == cwnd_min);
// Simulate many ACK rounds without growth
for _ in 0..1000 {
    cc.on_ack(now());
}
assert!(cc.cwnd == cwnd_min);  // fails because cwnd never increases

4. Root Cause Analysis

The bug stems from the porting of a Linux kernel patch that aligned CUBIC with the app-limited exclusion. In the Linux TCP stack, the app-limited check is wrapped inside a larger condition that only applies when the connection is not in recovery (i.e., after a loss event). In quiche's port, that guard was omitted, causing the app-limited exclusion to fire even during recovery, preventing CUBIC from ever leaving the minimum cwnd. The exact location is in the cubic_update() function where tcp_friendliness adjustments are made.

Mastering CUBIC Congestion Control: Debugging a Stuck Congestion Window in QUIC
Source: blog.cloudflare.com

5. The Fix: A One-Line Change

The fix adds a condition to skip the app-limited check when the connection is still in the recovery phase. In the quiche source, this is a single line added to cubic.rs:

// Before (buggy):
if app_limited { return; }

// After (fixed):
if app_limited && !self.recovery { return; }

This ensures that during recovery (post-loss), CUBIC continues to grow cwnd even if the application is not fully utilizing the window. Once recovery ends, the original app-limited logic applies.

6. Verifying the Fix

Re-run the reproduction test. The cwnd should now start increasing after recovery, eventually leaving the minimum. Use a debug trace to confirm the sequence:

  • Initial loss -> cwnd drops to 2
  • ACKs arrive, cwnd bypasses app-limited check
  • cwnd grows (e.g., 3, 4, 5...)
  • Eventually leaves recovery, cwnd continues normal cubic growth

Common Mistakes

  • Assuming TCP and QUIC congestion control are identical: While RFC 9438 defines CUBIC for TCP, QUIC implementations may have subtle differences (e.g., loss detection, app-limited semantics). Always test both protocols.
  • Neglecting edge cases: The bug only appears under heavy early loss. Many tests skip this scenario. Ensure your test suite includes extreme loss patterns.
  • Overlooking the app-limited condition: App-limited is meant to prevent over-probing, but can interact poorly with recovery logic. Always audit all state transitions.
  • Copying kernel code verbatim: The Linux kernel has intricate dependencies. Porting requires understanding of the surrounding context (e.g., the recovery flag).

Summary

This tutorial walked through a real-world bug where CUBIC's congestion window got stuck at minimum due to a misapplied app-limited exclusion in a QUIC implementation. By understanding the core logic, reproducing the issue, and applying a one-line fix, we prevented permanent throughput collapse. Key takeaways: always verify edge-case behavior, avoid blind code porting, and test recovery paths thoroughly.

For further details, refer to the original overview or explore the quiche source code.