7 Game-Changing Scheduling Upgrades in Kubernetes v1.36

Introduction

Kubernetes v1.36 builds upon the groundwork laid in v1.35 to deliver a major leap forward in workload-aware scheduling, particularly for AI/ML and batch workloads. This release addresses the unique challenges of scheduling complex, multi-Pod groups by introducing a clean separation of API concerns, enhanced scheduler logic, and new capabilities like topology awareness and preemption. Here are seven critical enhancements you need to know.

7 Game-Changing Scheduling Upgrades in Kubernetes v1.36

1. Clean Separation of Workload and PodGroup APIs

The most architecturally significant change in v1.36 is the decoupling of the Workload API from the PodGroup API. Previously in v1.35, both the static template and runtime state were bundled together in one resource. Now, the Workload API serves purely as a static template—defining how Pod groups should be created—while the new PodGroup API handles all runtime scheduling states and conditions. This separation streamlines the scheduler's logic: the kube-scheduler can directly read the PodGroup object without parsing the Workload, improving performance and scalability. It also enables per-replica sharding of status updates, allowing large batch jobs to scale more efficiently.

2. Introduction of the PodGroup Scheduling Cycle

To support the new API decoupling, v1.36 introduces a dedicated PodGroup scheduling cycle in the kube-scheduler. This cycle enables atomic processing of an entire PodGroup, meaning the scheduler evaluates whether all required Pods can be placed simultaneously before committing any single Pod to a node. This atomicity is crucial for gang scheduling scenarios—common in AI training and high-performance computing—where a job can only start when a minimum number of Pods (the minCount) are ready. The new cycle also lays the foundation for future enhancements like advanced orchestration and resource reservation.

3. Topology-Aware Scheduling – First Iteration

Kubernetes v1.36 debuts the first iteration of topology-aware scheduling for workload groups. This feature allows the scheduler to consider physical network topology—such as rack or zone proximity—when placing Pods belonging to the same PodGroup. For AI/ML workloads that require low-latency communication (e.g., collective operations in distributed training), co-locating Pods on the same topology domain reduces network hops and improves performance. The initial release supports configurable topology keys and will be extended in future versions to include automatic topology discovery and adaptive placement.

4. Workload-Aware Preemption

Preemption in Kubernetes has traditionally been generic, but v1.36 introduces workload-aware preemption tailored for PodGroups. When a high-priority workload needs resources but faces contention, the scheduler can now preempt lower-priority Pods that belong to the same batch workflow or workload type more intelligently. The preemption logic respects gang scheduling constraints: it avoids partially preempting a PodGroup if it would leave the remaining Pods unschedulable. This results in more predictable resource availability and reduces wasted compute cycles for batch and AI jobs.

5. Dynamic Resource Allocation via ResourceClaim Support

With v1.36, PodGroups can now leverage Dynamic Resource Allocation (DRA) through native ResourceClaim support. This means batch workloads can request specialized hardware (like GPUs, FPGAs or TPU slices) at the group level, and the scheduler will handle allocation across all Pods in the group. The integration ensures that all Pods in the PodGroup receive consistent resource claims, eliminating the risk of partial assignment. This is a major step toward full support for complex AI training pipelines that require exclusive access to accelerators.

6. First Phase of Job Controller Integration

To demonstrate real-world readiness, v1.36 delivers the first phase of integration between the existing Job controller and the new PodGroup API. Job objects can now create PodGroup instances automatically, making it seamless for users to adopt workload-aware scheduling without rewriting job definitions. This integration handles template creation, PodGroup lifecycle management, and status reconciliation. It paves the way for a future where any controller can leverage the PodGroup API natively, reducing manual configuration and enabling more sophisticated job orchestration.

7. Performance and Scalability Improvements

The decoupling of Workload and PodGroup APIs directly improves performance. Because the PodGroup API supports per-replica sharding of status updates, large batch workloads with thousands of Pods no longer bottleneck on a single status object. Additionally, the scheduler’s streamlined dependency on PodGroup (rather than Workload) reduces watch operations and API server load. Early benchmarks show measurable improvement in scheduling throughput for gang‑scheduled jobs, especially in clusters with high concurrency. These changes ensure that Kubernetes v1.36 scales better for modern AI and batch computing demands.

Conclusion

Kubernetes v1.36 marks a pivotal evolution in workload-aware scheduling, directly addressing the unique demands of AI/ML and batch workloads. From cleaner API boundaries and atomic scheduling cycles to topology awareness and preemption, each enhancement contributes to a more efficient, scalable, and intelligent scheduler. As these features mature, they will form the backbone of next‑generation orchestration for data‑intensive and parallel computing. Keep an eye on upcoming releases for even deeper integration and automation. For more details, check the API separation and scheduling cycle sections.