Journal article
polars-bio—fast, scalable, and out-of-core operations on large genomic interval datasets
- Abstract:
- Motivation: Genomic studies very often rely on computationally intensive analyses of relationships between features, which are typically represented as intervals along a 1D coordinate system (such as positions on a chromosome). In this context, the Python programming language is extensively used for manipulating and analyzing data stored in a tabular form of rows and columns, called a DataFrame. Pandas is the most widely used Python DataFrame package and has been criticized for inefficiencies and scalability issues, which its modern alternative—Polars—aims to address with a native backend written in the Rust programming language. Results: polars-bio is a Python library that enables fast, parallel and out-of-core operations on large genomic interval datasets. Its main components are implemented in Rust, using the Apache DataFusion query engine and Apache Arrow for efficient data representation. It is compatible with Polars and Pandas DataFrame formats. In a real-world comparison (107 versus 1.2×106 intervals), our library runs overlap queries 6.5×, nearest queries 15.5×, count_overlaps queries 38×, and coverage queries 15× faster than Bioframe. On equally sized synthetic sets (107 versus 107), the corresponding speedups are 1.6×, 5.5×, 6×, and 6×. In streaming mode, on real and synthetic interval pairs, our implementation uses 90× and 15× less memory for overlap, 4.5× and 6.5× less for nearest, 60× and 12× less for count_overlaps, and 34× and 7× less for coverage than Bioframe. Multi-threaded benchmarks show good scalability characteristics. To the best of our knowledge, polars-bio is the most efficient single-node library for genomic interval DataFrames in Python. Availability and implementation: polars-bio is an open-source Python package distributed under the Apache License available for major platforms, including Linux, macOS, and Windows in the PyPI registry. The online documentation is https://biodatageeks.org/polars-bio/ and the source code is available on GitHub: https://github.com/biodatageeks/polars-bio and Zenodo: https://doi.org/10.5281/zenodo.16374290. are available at Bioinformatics online.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Version of record, pdf, 2.4MB, Terms of use)
-
(Preview, Other, pdf, 6.1MB, Terms of use)
-
- Publisher copy:
- 10.1093/bioinformatics/btaf640
Authors
- Publisher:
- Oxford University Press
- Journal:
- Bioinformatics More from this journal
- Volume:
- 41
- Issue:
- 12
- Article number:
- btaf640
- Publication date:
- 2025-12-01
- Acceptance date:
- 2025-11-17
- DOI:
- EISSN:
-
1367-4811
- ISSN:
-
1367-4803
- Language:
-
English
- Pubs id:
-
2360021
- UUID:
-
uuid_5e537d69-947d-4e88-8c06-7736644f7a05
- Local pid:
-
pubs:2360021
- Source identifiers:
-
3563557
- Deposit date:
-
2025-12-14
- ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.
Terms of use
- Copyright date:
- 2025
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record