statLens
Key features
- Self-hosted, no external API calls. Your data never leaves the box.
- 13 differential-expression pipelines covering every common DEA
scenario (Count / Continuous × basic / batch / paired / multi-group /
time-course / interaction, plus ZINB) — or
none_of_theseif your study sits outside the supported space. - Editable schema in the middle. statLens shows you the 21-field study-design summary it extracted, lets you fix anything wrong, then picks the pipeline.
- End-to-end ≈ 25–45 s per request on a single 24 GB-class NVIDIA GPU.
- Wheel install + one command to launch —
pip installandstatlens serveis all you need. - Reproducible — every run is a self-contained folder you can zip and ship.
Table of contents
- Prerequisite
- Quick start
- Where statlens looks for the base model
- Screenshots
- Hardware
- TSV format
- Output
- Use cases
- Pipeline labels
- Interpreting results
- Headless server
- Subcommands
- API endpoints
- Models
- Training (LoRA only)
- Troubleshooting
- Citation
Prerequisite
You need Qwen3-32B (~64 GB BF16) on local disk. If you don't have it:
huggingface-cli download Qwen/Qwen3-32B --local-dir ~/models/qwen3-32b
Quick start
# 1. Install statLens (~5 min on first run; pulls ~2 GB of CUDA dependencies)
pip install ${HF_ENDPOINT:-https://huggingface.co}/domizzz2025/statLens/resolve/main/statlens-0.1.11-py3-none-any.whl
# 2. Launch
statlens serve
After ~80 s you should see:
══════════════════════════════════════════════════════
✅ statLens ready
open in browser: http://localhost:7860/
Ctrl+C to stop.
══════════════════════════════════════════════════════
Open http://localhost:7860, drop a TSV, write a short study-context
description, click Classify & run, review the extracted schema, then
Run pipeline →.
Where statlens looks for the base model
When you run statlens serve, it auto-discovers Qwen3-32B in any of:
~/models/qwen3-32b//root/autodl-tmp/models/qwen3-32b/(AutoDL)/workspace/models/qwen3-32b/(RunPod / Lambda)/data/models/qwen3-32b/,/mnt/models/qwen3-32b/- the HuggingFace Hub cache
If yours is elsewhere, point statlens serve at it explicitly:
statlens serve --base-model /path/to/qwen3-32b
# or, persistently:
export STATLENS_BASE_MODEL=/path/to/qwen3-32b
Screenshots
Hardware
| GPU | NVIDIA, ≥ 22 GB VRAM (RTX 3090 / 4090 / A40 / A100 / 5090 …) |
| OS | Linux x86_64 |
| RAM | 32 GB+ |
| Disk | 75 GB free (64 GB Qwen + 1 GB LoRA + working space) |
Mac / Windows / AMD: not supported (LLaMA-Factory + bitsandbytes are CUDA-only).
TSV format
Wide format, one row per sample. Required columns:
| column | meaning |
|---|---|
sample_id |
unique per row |
subject_id |
repeats for paired or longitudinal samples |
| a design column | group / condition / treatment / clinical_group / tumor_stage / subtype / arm / … (fuzzy-matched) |
| feature columns | prefixed exactly one of: gene_, asv_, prot_, metab_, otu_, feat_. Other prefixes are not recognised and the adapter will report No feature columns found. |
Optional: time-like columns (time_day, collection_day, time_week, …)
and batch-like columns (batch, site, run, ms_batch, …).
13 demonstration TSVs ship with the package — list them with:
ls $(python3 -c 'import statlens, os; print(os.path.dirname(statlens.__file__))')/data/examples/
Output
Each run lands under ~/.cache/statlens/runs/<run_id>/:
out/
├── statlens_report.md # human-readable summary + reasoning
├── statlens_report.json # machine-readable sidecar
└── pipeline_output/
├── volcano_plot.png
├── PCA_plot.png
├── MA_plot.png
├── top_DE_genes_heatmap.png
├── top20_DE_genes_barplot.png
├── results.csv
├── significant_genes.csv
└── run.log
result.zip # everything above, packaged
The Download all button in the web UI returns result.zip.
Use cases
Three representative scenarios from the 13 supported pipelines:
| Scenario | Study | Required columns | Expected label |
|---|---|---|---|
| Bulk RNA-seq case-control | 30 patients, 15 case vs 15 control, single sequencing batch, looking for DE genes | sample_id, subject_id, group, gene_* |
Count_DESeq2_basic |
| Plasma proteomics two-arm | LC-MS/MS, 12 cases vs 12 controls, log2 intensity, single MS run | sample_id, subject_id, group, prot_* |
Continuous_limma_basic |
| 16S microbiome IBD vs Healthy | sparse ASV counts dominated by zeros (>40 %) | sample_id, subject_id, condition, asv_* |
Count_DESeq2_ZINB |
A full set of 13 paired demos — one (.tsv + matching .context.txt)
per label — lives under examples/. Drop any .tsv
into the web UI and paste the matching .context.txt as your study
description to reproduce the scenario in one click.
Pipeline labels
statLens classifies a study into one of 13 DEA scenarios, or returns
none_of_these (a 14th "kill-switch" output) when the design falls outside
its training space.
| family | label | when |
|---|---|---|
| Count (DESeq2) | Count_DESeq2_basic |
2 groups, no batch / time / pairing |
Count_DESeq2_with_batch |
2 groups + batch covariate | |
Count_DESeq2_paired_or_repeated |
matched samples within subject | |
Count_DESeq2_multi_group |
≥ 3 independent groups | |
Count_DESeq2_time_course |
single cohort, ≥ 3 time points | |
Count_DESeq2_group_time_interaction |
≥ 2 groups × multiple time points | |
Count_DESeq2_ZINB |
counts dominated by zeros (>40 %), e.g. 16S | |
| Continuous (limma) | Continuous_limma_basic |
2 groups, no batch / time / pairing |
Continuous_limma_with_batch |
2 groups + batch covariate | |
Continuous_limma_paired_or_repeated |
pre/post or matched samples | |
Continuous_limma_multi_group |
≥ 3 independent groups | |
Continuous_limma_time_course |
single cohort, ≥ 3 time points | |
Continuous_limma_group_time_interaction |
groups × time within subject | |
| (decline) | none_of_these |
survival / network inference / single-sample / non-omics — no forced fit |
Interpreting results
Every successful run produces 5 plots and 2 result tables under
pipeline_output/:
| File | What it shows |
|---|---|
volcano_plot.png |
Each feature plotted by log2 fold-change (x) vs −log10 adjusted p-value (y). Top-right and top-left points are the significantly up- and down-regulated features. |
MA_plot.png |
Log2 fold-change (y) vs mean expression (x). Diagnostic for fold-change vs abundance bias. |
PCA_plot.png |
First two principal components of the normalized expression matrix, colored by group. Sanity check for class separation. |
top_DE_genes_heatmap.png |
Top 20 most-significant DE features as a heatmap of z-scored expression across samples. |
top20_DE_genes_barplot.png |
Top 20 features by absolute log2 fold-change as a barplot. |
results.csv |
Full DE table — feature_id, log2FoldChange, lfcSE, stat, pvalue, padj. |
significant_genes.csv |
Subset of results.csv filtered at padj < 0.05 (or the family default). |
For paired / time-course / interaction designs the results.csv schema
is the same; only the underlying model and the contrast definition
change. See statlens_report.md produced alongside the run for the
exact model formula used.
Headless server: reaching localhost:7860 from elsewhere
If statlens serve cannot open a browser (AutoDL, RunPod, Lambda, …), use
one of these:
| command | works for | |
|---|---|---|
| Public URL | cloudflared tunnel --url http://localhost:7860 |
any device, any network |
| SSH tunnel | ssh -fNL 7860:localhost:7860 user@server (run on your laptop) |
quick local dev |
| curl only | curl -X POST http://localhost:7860/api/run -F "context=..." -F "tsv=@data.tsv" |
scripting / no browser |
Subcommands
statlens serve # main entry point
statlens download # pre-fetch the LoRA only (~1 GB)
statlens info # show GPU / cache / paths
statlens classify --tsv DATA --context CTX --out DIR
# one-shot CLI mode (no browser)
statlens --version
statlens classify runs both LLM stages back-to-back without a review pause —
useful for batch processing.
API endpoints
| route | method | purpose |
|---|---|---|
/ |
GET | serve the web UI |
/api/extract |
POST (multipart: tsv, context) |
stage 1 — return a SchemaSummary |
/api/run_pipeline |
POST (form: run_id, schema JSON) |
stage 3 — pick label + run pipeline |
/api/run |
POST (multipart: tsv, context) |
legacy single-shot path (no review) |
/api/artifact/{run_id}/{filename} |
GET | fetch a single PNG/CSV |
/api/zip/{run_id} |
GET | fetch the packaged result |
/api/csv_preview/{run_id}/{filename} |
GET | first N rows of a result CSV as JSON |
Models
| component | source | size | license |
|---|---|---|---|
| base | Qwen/Qwen3-32B (BF16) |
64 GB | Apache-2.0 |
| LoRA | domizzz2025/statLens |
1 GB | Apache-2.0 |
The LoRA is auto-downloaded on first run; the base model is yours to provide.
Training (LoRA only)
The classifier LoRA was fine-tuned on top of Qwen/Qwen3-32B with
LLaMA-Factory:
| Adapter rank / alpha | 32 / 64 |
| Target modules | q / k / v / o / up / down / gate proj |
| Optimizer · schedule | AdamW · cosine, 3 epochs (~636 steps) |
| Training data | curated study descriptions covering the 13 DEA scenarios |
Loss curves and trainer state live under
qwen3_32b_lora_v1/:
training_loss.png, training_eval_loss.png, trainer_state.json,
trainer_log.jsonl.
Generalization to real-world TSVs with non-canonical column conventions is recovered via the user-editable schema layer at run time.
Troubleshooting
| symptom | fix |
|---|---|
Network is unreachable during pip install (mainland China) |
export HF_ENDPOINT=https://hf-mirror.com and retry |
LocalEntryNotFoundError when LoRA auto-fetches |
same as above — set HF_ENDPOINT before statlens serve |
no base model found |
put Qwen3-32B in one of the auto-search paths, or pass --base-model PATH |
CUDA out of memory on startup |
a previous statlens serve is still holding GPU memory: pkill -9 -f statlens; nvidia-smi --query-compute-apps=pid --format=csv,noheader | xargs -r kill -9 |
address already in use |
a previous instance is bound — kill it first |
| LLM never becomes ready | tail ~/.cache/statlens/llm.log to see the LLaMA-Factory error |
| schema field looks wrong in the browser | edit it directly; the LLM picks the label from your edits, not the original extraction |
Schema specified <field>=…, but no such column in TSV |
a column-name field in the schema doesn't match your TSV. Either fix the field, or clear it to use auto-detection. |
Schema reference_level=… not in observed levels |
reference_level doesn't match any actual group level. Set it to one of the values shown in group_levels, or clear it. |
No feature columns found. Expected one of these prefixes: … |
rename your feature columns to start with gene_ / asv_ / prot_ / metab_ / otu_ / feat_. |
upload exceeds N MB limit |
raise the cap with STATLENS_MAX_UPLOAD_MB=500 statlens serve (default 100 MB). |
Source · License
- Wheel + LoRA + source : https://huggingface.co/domizzz2025/statLens
- License : Apache-2.0
Citation
If you use statLens in academic work, please cite:
@software{statlens_2025,
title = {statLens: A self-hosted DEA method selector backed by a Qwen3-32B + LoRA classifier},
author = {statLens contributors},
year = {2025},
url = {https://huggingface.co/domizzz2025/statLens},
note = {Apache-2.0},
}
A peer-reviewed manuscript is in preparation.
- Downloads last month
- -
Model tree for domizzz2025/statLens
Base model
Qwen/Qwen3-32B

