Engine-agnostic model hot-swapping for cost-effective LLM inference

Stoyanov, R; Spišaková, V; Reber, A; Armour, W; Copik, M; Bruno, R

Conference item

Engine-agnostic model hot-swapping for cost-effective LLM inference

Abstract:: The widespread adoption of Large Language Models (LLMs) has led to an increased demand for large-scale inference services, presenting a unique set of challenges for the HPC community. These services are characterized by moderate-scale models that require dedicating expensive GPUs to handle bursty inference requests, leading to high costs and resource underutilization. In this paper, we propose SwapServeLLM — a novel engine-agnostic hot-swapping method for cost-effective inference. This model hot-swapping approach is enabled by recent driver capabilities for transparent GPU checkpointing. SwapServeLLM optimizes resource utilization by dynamically allocating GPU resources with two key mechanisms: (1) a demand-aware preemption leveraging information about concurrent requests, and (2) efficient request routing with memory reservation minimizing inference latency. Our evaluation demonstrates that SwapServeLLM optimizes model loading for state-ofthe-art inference engines by 31× compared to vLLM and up to 29% compared to Ollama, enabling cost-effective inference.

Publication status:: Published

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Cite

Cite this record

APA Style

Stoyanov, R., Spišaková, V., Reber, A., Armour, W., Copik, M., & Bruno, R. (2025). Engine-agnostic model hot-swapping for cost-effective LLM inference. SC Workshops '25: Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 114–125.

MLA Style

Stoyanov, R., et al. “Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference.” SC Workshops '25: Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, Association for Computing Machinery, 2025, pp. 114–25.

Chicago Style

Stoyanov, R, V Spišaková, A Reber, W Armour, M Copik, and R Bruno. 2025. “Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference.” In SC Workshops '25: Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 114–25. Association for Computing Machinery.
Share
Print

Access Document

Files:: Stoyanov_et_al_2025_Engine-agnostic_model_hot-swapping.pdf

(Preview, Version of record, pdf, 1.0MB, Terms of use)

Publisher copy:: 10.1145/3731599.3767354

Authors

+ Stoyanov, R More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Oxford college:: Somerville College
Role:: Author
ORCID:: 0000-0001-9688-2615

+ Spišaková, V More by this author

Role:: Author

+ Reber, A More by this author

Role:: Author

+ Armour, W More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Copik, M More by this author

Role:: Author

More authors...

Publisher:: Association for Computing Machinery
Host title:: SC Workshops '25: Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
Pages:: 114-125
Publication date:: 2025-11-15
Acceptance date:: 2025-09-05
Event title:: 7th International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC
Event location:: St. Louis, Missouri, USA
Event website:: https://sc25.supercomputing.org/
Event start date:: 2025-11-16
Event end date:: 2025-11-21
DOI:: 10.1145/3731599.3767354
ISBN:: 9798400718717

Language:: English
Keywords:: cloud computing

containers

LLM inference

GPU checkpointing
Pubs id:: 2292837
Local pid:: pubs:2292837
Deposit date:: 2025-09-25

Terms of use

Copyright holder:: Stoyanov et al.

Licence:: CC Attribution (CC BY)

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Conference item

Engine-agnostic model hot-swapping for cost-effective LLM inference

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Conference item

Engine-agnostic model hot-swapping for cost-effective LLM inference

Actions

Access Document

Authors

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions