Conference item icon

Conference item

Opportunistic resource reclamation in Kubernetes: from aggressive resizing to flash jobs

Abstract:
Modern cloud data centers suffer from chronic resource underutilization. The gap between static resource allocations and dynamic workload demand creates systemic inefficiency that current orchestration platforms fail to address adequately. In this work, we explore resource reclamation strategies in production Kubernetes clusters using emerging infrastructure-level primitives—in-place resource resizing and transparent checkpoint/restore (C/R). For CPU resources, we analyze a production workload trace, which we release publicly, and reveal significant allocation-utilization gaps. Through trace-driven simulation, we demonstrate that aggressive in-place resizing substantially increases resource utilization as well as workload evictions. We find a balanced strategy for in-place resizing and identify C/R as the missing primitive that makes aggressive resizing safe by enabling graceful termination and resumable migrations instead of progress loss. For GPU resources, where dynamic resizing is infeasible, we propose a C/R-enabled sharing strategy that allocates reserved-but-idle GPU memory to secondary workloads (flash jobs) with safety guarantees for reclamation. Our work demonstrates how the same infrastructure primitives address resource reclamation across different resource types, each with distinct technical constraints, validated through real production cluster deployments.
Publication status:
Accepted
Peer review status:
Peer reviewed

Actions

Authors

More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Oxford college:
Somerville College
Role:
Author
ORCID:
0000-0001-9688-2615


Host title:
Job Scheduling Strategies for Parallel Processing 2026
Acceptance date:
2026-03-17
Event title:
29th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP 2026)
Event location:
New Orleans, USA
Event website:
https://jsspp.org/
Event start date:
2026-05-25
Event end date:
2026-05-29


Language:
English
Keywords:
Pubs id:
2415573
Local pid:
pubs:2415573
Deposit date:
2026-05-06
ARK identifier:


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP