--- title: README emoji: 🐨 colorFrom: purple colorTo: green sdk: static pinned: false short_description: Storage and compute for the virtual cell era --- # SLAF (Sparse Lazy Array Format) **SLAF** is a high-performance format for single-cell transcriptomics data built on top of the [Lance](https://lancedb.github.io/lance/) table format and [Polars](https://pola.rs). For users of scanpy or anndata, it should feel like you never left. SLAF provides and advanced dataloader that looks and feels like PyTorch, but runs its own multi-threaded async prefetcher under the hood. Bleeding-edge internals, familiar interfaces. ``` pip install slafdb[ml] ``` [GitHub](https://github.com/slaf-project/slaf) | [Documentation](https://slaf-project.github.io/slaf/) ## Why SLAF? Single-cell transcriptomics datasets have scaled **2,000-fold** in less than a decade. A typical study used to have 50k cells that could be copied to SSD and processed in memory. At the 100M-cell scale, network, storage, and memory become bottlenecks. 1. The traditional analytic workload is stuck in in-memory single-node operations. Today we need to do cell/gene filtering, normalization, PCA/UMAP, and differential expression at **2000x the scale**. 2. New **AI-native workloads** have arrived: - cell typing with nearest neighbor search on embeddings - transformer-based foundation model training with efficient tokenization - distribute workloads across nodes or GPUs by streaming random batches concurrently For these, we need **cloud-native, zero-copy, query-in-place storage** --- without maintaining multiple copies per user, workload, application, or node --- while retaining the interfaces for numpy-like sparse matrix slicing and the scanpy pipelines we already use. ## Who is SLAF for? - **Bioinformaticians** β€” Struggling with OOM and data transfer on 10M+ cell datasets. SLAF eliminates the bottleneck with lazy evaluation. - **Foundation Model Builders** β€” SLAF enables cloud-native streaming and removes data duplication. - **Tech Leaders & Architects** β€” SLAF provides zero-copy, query-in-place storage instead of duplicated datasets per user. - **Tool Builders** β€” SLAF enables concurrent, cloud-scale access with high QPS for interactive experiences. - **Atlas Builders** β€” SLAF provides cloud-native, zero-copy storage for global distribution. - **Data Integrators** β€” SLAF’s SQL-native design enables complex data integration with pushdown optimization. ## Quick examples **Query with SQL (no full download):** ```python from slaf import SLAFArray slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M") results = slaf_array.query(""" SELECT cytokine, cell_type, AVG(gene_count) as avg_gene_count FROM cells WHERE donor = 'Donor10' AND cytokine IN ('C5a', 'CD40L') GROUP BY cytokine, cell_type ORDER BY cytokine, avg_gene_count DESC """) ``` **Lazy Scanpy-style slicing:** ```python from slaf.integrations import read_slaf adata = read_slaf("hf://datasets/slaf-project/Parse-10M") subset = adata[ ( (adata.obs.cell_type == "CD8 Naive") & (adata.obs.cytokine == "C5a") & (adata.obs.donor == "Donor10") ), : ] expression = subset[:10, :].X.compute() # Only now is data loaded ``` **Stream tokenized batches for training:** ```python from slaf import SLAFArray from slaf.ml.dataloaders import SLAFDataLoader slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M") dataloader = SLAFDataLoader( slaf_array=slaf_array, tokenizer_type="geneformer", batch_size=32, max_genes=2048, vocab_size=50000, prefetch_batch_size=1_000_000 ) for batch in dataloader: input_ids = batch["input_ids"] attention_mask = batch["attention_mask"] # Your training code here ```