Conference item
Engine-agnostic model hot-swapping for cost-effective LLM inference
- Abstract:
-
The widespread adoption of Large Language Models (LLMs) has led to an increased demand for large-scale inference services, presenting a unique set of challenges for the HPC community. These services are characterized by moderate-scale models that require dedicating expensive GPUs to handle bursty inference requests, leading to high costs and resource underutilization. In this paper, we propose SwapServeLLM — a novel engine-agnostic hot-swapping method for cost-effective inference. This model hot-swapping approach is enabled by recent driver capabilities for transparent GPU checkpointing. SwapServeLLM optimizes resource utilization by dynamically allocating GPU resources with two key mechanisms: (1) a demand-aware preemption leveraging information about concurrent requests, and (2) efficient request routing with memory reservation minimizing inference latency. Our evaluation demonstrates that SwapServeLLM optimizes model loading for state-ofthe-art inference engines by 31× compared to vLLM and up to 29% compared to Ollama, enabling cost-effective inference.
- Publication status:
- Accepted
- Peer review status:
- Peer reviewed
Actions
Authors
- Publisher:
- Association for Computing Machinery
- Acceptance date:
- 2025-09-05
- Event title:
- 7th International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC
- Event location:
- St. Louis, Missouri, USA
- Event website:
- https://sc25.supercomputing.org/
- Event start date:
- 2025-11-16
- Event end date:
- 2025-11-21
- Language:
-
English
- Keywords:
- Pubs id:
-
2292837
- Local pid:
-
pubs:2292837
- Deposit date:
-
2025-09-25
Terms of use
- Notes:
- This conference paper has been accepted for presentation at the 2025 International Conference for High Performance Computing, Networking, Storage, and Analysis.
If you are the owner of this record, you can report an update to it here: Report update to this record