Spaces:
Running
Running
| title: README | |
| emoji: 🐨 | |
| colorFrom: purple | |
| colorTo: green | |
| sdk: static | |
| pinned: false | |
| short_description: Storage and compute for the virtual cell era | |
| # SLAF (Sparse Lazy Array Format) | |
| **SLAF** is a high-performance format for single-cell transcriptomics data built on top of the [Lance](https://lancedb.github.io/lance/) table format and [Polars](https://pola.rs). For users of scanpy or anndata, it should feel like you never left. SLAF provides and advanced dataloader that looks and feels like PyTorch, but runs its own multi-threaded async prefetcher under the hood. Bleeding-edge internals, familiar interfaces. | |
| ``` | |
| pip install slafdb[ml] | |
| ``` | |
| [GitHub](https://github.com/slaf-project/slaf) | [Documentation](https://slaf-project.github.io/slaf/) | |
| ## Why SLAF? | |
| Single-cell transcriptomics datasets have scaled **2,000-fold** in less than a decade. A typical study used to have 50k cells that could be copied to SSD and processed in memory. At the 100M-cell scale, network, storage, and memory become bottlenecks. | |
| 1. The traditional analytic workload is stuck in in-memory single-node operations. Today we need to do cell/gene filtering, normalization, PCA/UMAP, and differential expression at **2000x the scale**. | |
| 2. New **AI-native workloads** have arrived: | |
| - cell typing with nearest neighbor search on embeddings | |
| - transformer-based foundation model training with efficient tokenization | |
| - distribute workloads across nodes or GPUs by streaming random batches concurrently | |
| For these, we need **cloud-native, zero-copy, query-in-place storage** --- without maintaining multiple copies per user, workload, application, or node --- while retaining the interfaces for numpy-like sparse matrix slicing and the scanpy pipelines we already use. | |
| ## Who is SLAF for? | |
| - **Bioinformaticians** — Struggling with OOM and data transfer on 10M+ cell datasets. SLAF eliminates the bottleneck with lazy evaluation. | |
| - **Foundation Model Builders** — SLAF enables cloud-native streaming and removes data duplication. | |
| - **Tech Leaders & Architects** — SLAF provides zero-copy, query-in-place storage instead of duplicated datasets per user. | |
| - **Tool Builders** — SLAF enables concurrent, cloud-scale access with high QPS for interactive experiences. | |
| - **Atlas Builders** — SLAF provides cloud-native, zero-copy storage for global distribution. | |
| - **Data Integrators** — SLAF’s SQL-native design enables complex data integration with pushdown optimization. | |
| ## Quick examples | |
| **Query with SQL (no full download):** | |
| ```python | |
| from slaf import SLAFArray | |
| slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M") | |
| results = slaf_array.query(""" | |
| SELECT | |
| cytokine, | |
| cell_type, | |
| AVG(gene_count) as avg_gene_count | |
| FROM cells | |
| WHERE donor = 'Donor10' | |
| AND cytokine IN ('C5a', 'CD40L') | |
| GROUP BY cytokine, cell_type | |
| ORDER BY cytokine, avg_gene_count DESC | |
| """) | |
| ``` | |
| **Lazy Scanpy-style slicing:** | |
| ```python | |
| from slaf.integrations import read_slaf | |
| adata = read_slaf("hf://datasets/slaf-project/Parse-10M") | |
| subset = adata[ | |
| ( | |
| (adata.obs.cell_type == "CD8 Naive") & | |
| (adata.obs.cytokine == "C5a") & | |
| (adata.obs.donor == "Donor10") | |
| ), : | |
| ] | |
| expression = subset[:10, :].X.compute() # Only now is data loaded | |
| ``` | |
| **Stream tokenized batches for training:** | |
| ```python | |
| from slaf import SLAFArray | |
| from slaf.ml.dataloaders import SLAFDataLoader | |
| slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M") | |
| dataloader = SLAFDataLoader( | |
| slaf_array=slaf_array, | |
| tokenizer_type="geneformer", | |
| batch_size=32, | |
| max_genes=2048, | |
| vocab_size=50000, | |
| prefetch_batch_size=1_000_000 | |
| ) | |
| for batch in dataloader: | |
| input_ids = batch["input_ids"] | |
| attention_mask = batch["attention_mask"] | |
| # Your training code here | |
| ``` | |