Spaces:

slaf-project
/

README

Running

App Files Files Community

README / README.md

pavan-ramkumar

huggingface examples

45e6056 verified about 2 months ago

preview code

raw

history blame contribute delete

3.81 kB

	---
	title: README
	emoji: 🐨
	colorFrom: purple
	colorTo: green
	sdk: static
	pinned: false
	short_description: Storage and compute for the virtual cell era
	---

	# SLAF (Sparse Lazy Array Format)

	SLAF is a high-performance format for single-cell transcriptomics data built on top of the [Lance](https://lancedb.github.io/lance/) table format and [Polars](https://pola.rs). For users of scanpy or anndata, it should feel like you never left. SLAF provides and advanced dataloader that looks and feels like PyTorch, but runs its own multi-threaded async prefetcher under the hood. Bleeding-edge internals, familiar interfaces.

	```
	pip install slafdb[ml]

	```

	[GitHub](https://github.com/slaf-project/slaf) \| [Documentation](https://slaf-project.github.io/slaf/)

	## Why SLAF?

	Single-cell transcriptomics datasets have scaled 2,000-fold in less than a decade. A typical study used to have 50k cells that could be copied to SSD and processed in memory. At the 100M-cell scale, network, storage, and memory become bottlenecks.

	1. The traditional analytic workload is stuck in in-memory single-node operations. Today we need to do cell/gene filtering, normalization, PCA/UMAP, and differential expression at 2000x the scale.

	2. New AI-native workloads have arrived:

	- cell typing with nearest neighbor search on embeddings
	- transformer-based foundation model training with efficient tokenization
	- distribute workloads across nodes or GPUs by streaming random batches concurrently

	For these, we need cloud-native, zero-copy, query-in-place storage --- without maintaining multiple copies per user, workload, application, or node --- while retaining the interfaces for numpy-like sparse matrix slicing and the scanpy pipelines we already use.

	## Who is SLAF for?

	- Bioinformaticians — Struggling with OOM and data transfer on 10M+ cell datasets. SLAF eliminates the bottleneck with lazy evaluation.
	- Foundation Model Builders — SLAF enables cloud-native streaming and removes data duplication.
	- Tech Leaders & Architects — SLAF provides zero-copy, query-in-place storage instead of duplicated datasets per user.
	- Tool Builders — SLAF enables concurrent, cloud-scale access with high QPS for interactive experiences.
	- Atlas Builders — SLAF provides cloud-native, zero-copy storage for global distribution.
	- Data Integrators — SLAF’s SQL-native design enables complex data integration with pushdown optimization.

	## Quick examples

	Query with SQL (no full download):

	```python
	from slaf import SLAFArray

	slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
	results = slaf_array.query("""
	SELECT
	cytokine,
	cell_type,
	AVG(gene_count) as avg_gene_count
	FROM cells
	WHERE donor = 'Donor10'
	AND cytokine IN ('C5a', 'CD40L')
	GROUP BY cytokine, cell_type
	ORDER BY cytokine, avg_gene_count DESC
	""")
	```

	Lazy Scanpy-style slicing:

	```python
	from slaf.integrations import read_slaf

	adata = read_slaf("hf://datasets/slaf-project/Parse-10M")
	subset = adata[
	(
	(adata.obs.cell_type == "CD8 Naive") &
	(adata.obs.cytokine == "C5a") &
	(adata.obs.donor == "Donor10")
	), :
	]
	expression = subset[:10, :].X.compute() # Only now is data loaded
	```

	Stream tokenized batches for training:

	```python
	from slaf import SLAFArray
	from slaf.ml.dataloaders import SLAFDataLoader

	slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
	dataloader = SLAFDataLoader(
	slaf_array=slaf_array,
	tokenizer_type="geneformer",
	batch_size=32,
	max_genes=2048,
	vocab_size=50000,
	prefetch_batch_size=1_000_000
	)
	for batch in dataloader:
	input_ids = batch["input_ids"]
	attention_mask = batch["attention_mask"]
	# Your training code here
	```