Conference item
Engine-agnostic model hot-swapping for cost-effective LLM inference
- Abstract:
-
The widespread adoption of Large Language Models (LLMs) has led to an increased demand for large-scale inference services, presenting a unique set of challenges for the HPC community. These services are characterized by moderate-scale models that require dedicating expensive GPUs to handle bursty inference requests, leading to high costs and resource underutilization. In this paper, we propose SwapServeLLM — a novel engine-agnostic hot-swapping method for cost-effective inference. This model hot-swapping approach is enabled by recent driver capabilities for transparent GPU checkpointing. SwapServeLLM optimizes resource utilization by dynamically allocating GPU resources with two key mechanisms: (1) a demand-aware preemption leveraging information about concurrent requests, and (2) efficient request routing with memory reservation minimizing inference latency. Our evaluation demonstrates that SwapServeLLM optimizes model loading for state-ofthe-art inference engines by 31× compared to vLLM and up to 29% compared to Ollama, enabling cost-effective inference.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Version of record, pdf, 1.0MB, Terms of use)
-
- Publisher copy:
- 10.1145/3731599.3767354
Authors
- Publisher:
- Association for Computing Machinery
- Host title:
- SC Workshops '25: Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
- Pages:
- 114-125
- Publication date:
- 2025-11-15
- Acceptance date:
- 2025-09-05
- Event title:
- 7th International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC
- Event location:
- St. Louis, Missouri, USA
- Event website:
- https://sc25.supercomputing.org/
- Event start date:
- 2025-11-16
- Event end date:
- 2025-11-21
- DOI:
- ISBN:
- 9798400718717
- Language:
-
English
- Keywords:
- Pubs id:
-
2292837
- Local pid:
-
pubs:2292837
- Deposit date:
-
2025-09-25
Terms of use
- Copyright holder:
- Stoyanov et al.
- Copyright date:
- 2025
- Rights statement:
- Copyright © 2025 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution 4.0 International License.
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record