---
title: README
emoji: 🐨
colorFrom: purple
colorTo: green
sdk: static
pinned: false
short_description: Storage and compute for the virtual cell era
---

# SLAF (Sparse Lazy Array Format)

**SLAF** is a high-performance format for single-cell transcriptomics data built on top of the [Lance](https://lancedb.github.io/lance/) table format and [Polars](https://pola.rs). For users of scanpy or anndata, it should feel like you never left. SLAF provides and advanced dataloader that looks and feels like PyTorch, but runs its own multi-threaded async prefetcher under the hood. Bleeding-edge internals, familiar interfaces.

```
pip install slafdb[ml]

```

[GitHub](https://github.com/slaf-project/slaf) | [Documentation](https://slaf-project.github.io/slaf/)

## Why SLAF?

Single-cell transcriptomics datasets have scaled **2,000-fold** in less than a decade. A typical study used to have 50k cells that could be copied to SSD and processed in memory. At the 100M-cell scale, network, storage, and memory become bottlenecks.

1. The traditional analytic workload is stuck in in-memory single-node operations. Today we need to do cell/gene filtering, normalization, PCA/UMAP, and differential expression at **2000x the scale**.

2. New **AI-native workloads** have arrived:

- cell typing with nearest neighbor search on embeddings
- transformer-based foundation model training with efficient tokenization
- distribute workloads across nodes or GPUs by streaming random batches concurrently

For these, we need **cloud-native, zero-copy, query-in-place storage** --- without maintaining multiple copies per user, workload, application, or node --- while retaining the interfaces for numpy-like sparse matrix slicing and the scanpy pipelines we already use.

## Who is SLAF for?

- **Bioinformaticians** — Struggling with OOM and data transfer on 10M+ cell datasets. SLAF eliminates the bottleneck with lazy evaluation.
- **Foundation Model Builders** — SLAF enables cloud-native streaming and removes data duplication.
- **Tech Leaders & Architects** — SLAF provides zero-copy, query-in-place storage instead of duplicated datasets per user.
- **Tool Builders** — SLAF enables concurrent, cloud-scale access with high QPS for interactive experiences.
- **Atlas Builders** — SLAF provides cloud-native, zero-copy storage for global distribution.
- **Data Integrators** — SLAF’s SQL-native design enables complex data integration with pushdown optimization.

## Quick examples

**Query with SQL (no full download):**

```python
from slaf import SLAFArray

slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
results = slaf_array.query("""
    SELECT
        cytokine,
        cell_type,
        AVG(gene_count) as avg_gene_count
    FROM cells
    WHERE donor = 'Donor10'
      AND cytokine IN ('C5a', 'CD40L')
    GROUP BY cytokine, cell_type
    ORDER BY cytokine, avg_gene_count DESC
""")
```

**Lazy Scanpy-style slicing:**

```python
from slaf.integrations import read_slaf

adata = read_slaf("hf://datasets/slaf-project/Parse-10M")
subset = adata[
    (
        (adata.obs.cell_type == "CD8 Naive") &
        (adata.obs.cytokine == "C5a") &
        (adata.obs.donor == "Donor10")
    ), :
]
expression = subset[:10, :].X.compute()  # Only now is data loaded
```

**Stream tokenized batches for training:**

```python
from slaf import SLAFArray
from slaf.ml.dataloaders import SLAFDataLoader

slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
dataloader = SLAFDataLoader(
    slaf_array=slaf_array,
    tokenizer_type="geneformer",
    batch_size=32,
    max_genes=2048,
    vocab_size=50000,
    prefetch_batch_size=1_000_000
)
for batch in dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    # Your training code here
```