README / README.md
pavan-ramkumar's picture
huggingface examples
45e6056 verified
---
title: README
emoji: 🐨
colorFrom: purple
colorTo: green
sdk: static
pinned: false
short_description: Storage and compute for the virtual cell era
---
# SLAF (Sparse Lazy Array Format)
**SLAF** is a high-performance format for single-cell transcriptomics data built on top of the [Lance](https://lancedb.github.io/lance/) table format and [Polars](https://pola.rs). For users of scanpy or anndata, it should feel like you never left. SLAF provides and advanced dataloader that looks and feels like PyTorch, but runs its own multi-threaded async prefetcher under the hood. Bleeding-edge internals, familiar interfaces.
```
pip install slafdb[ml]
```
[GitHub](https://github.com/slaf-project/slaf) | [Documentation](https://slaf-project.github.io/slaf/)
## Why SLAF?
Single-cell transcriptomics datasets have scaled **2,000-fold** in less than a decade. A typical study used to have 50k cells that could be copied to SSD and processed in memory. At the 100M-cell scale, network, storage, and memory become bottlenecks.
1. The traditional analytic workload is stuck in in-memory single-node operations. Today we need to do cell/gene filtering, normalization, PCA/UMAP, and differential expression at **2000x the scale**.
2. New **AI-native workloads** have arrived:
- cell typing with nearest neighbor search on embeddings
- transformer-based foundation model training with efficient tokenization
- distribute workloads across nodes or GPUs by streaming random batches concurrently
For these, we need **cloud-native, zero-copy, query-in-place storage** --- without maintaining multiple copies per user, workload, application, or node --- while retaining the interfaces for numpy-like sparse matrix slicing and the scanpy pipelines we already use.
## Who is SLAF for?
- **Bioinformaticians** — Struggling with OOM and data transfer on 10M+ cell datasets. SLAF eliminates the bottleneck with lazy evaluation.
- **Foundation Model Builders** — SLAF enables cloud-native streaming and removes data duplication.
- **Tech Leaders & Architects** — SLAF provides zero-copy, query-in-place storage instead of duplicated datasets per user.
- **Tool Builders** — SLAF enables concurrent, cloud-scale access with high QPS for interactive experiences.
- **Atlas Builders** — SLAF provides cloud-native, zero-copy storage for global distribution.
- **Data Integrators** — SLAF’s SQL-native design enables complex data integration with pushdown optimization.
## Quick examples
**Query with SQL (no full download):**
```python
from slaf import SLAFArray
slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
results = slaf_array.query("""
SELECT
cytokine,
cell_type,
AVG(gene_count) as avg_gene_count
FROM cells
WHERE donor = 'Donor10'
AND cytokine IN ('C5a', 'CD40L')
GROUP BY cytokine, cell_type
ORDER BY cytokine, avg_gene_count DESC
""")
```
**Lazy Scanpy-style slicing:**
```python
from slaf.integrations import read_slaf
adata = read_slaf("hf://datasets/slaf-project/Parse-10M")
subset = adata[
(
(adata.obs.cell_type == "CD8 Naive") &
(adata.obs.cytokine == "C5a") &
(adata.obs.donor == "Donor10")
), :
]
expression = subset[:10, :].X.compute() # Only now is data loaded
```
**Stream tokenized batches for training:**
```python
from slaf import SLAFArray
from slaf.ml.dataloaders import SLAFDataLoader
slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
dataloader = SLAFDataLoader(
slaf_array=slaf_array,
tokenizer_type="geneformer",
batch_size=32,
max_genes=2048,
vocab_size=50000,
prefetch_batch_size=1_000_000
)
for batch in dataloader:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
# Your training code here
```