statLens

Key features

  • Self-hosted, no external API calls. Your data never leaves the box.
  • 13 differential-expression pipelines covering every common DEA scenario (Count / Continuous × basic / batch / paired / multi-group / time-course / interaction, plus ZINB) — or none_of_these if your study sits outside the supported space.
  • Editable schema in the middle. statLens shows you the 21-field study-design summary it extracted, lets you fix anything wrong, then picks the pipeline.
  • End-to-end ≈ 25–45 s per request on a single 24 GB-class NVIDIA GPU.
  • Wheel install + one command to launchpip install and statlens serve is all you need.
  • Reproducible — every run is a self-contained folder you can zip and ship.

Table of contents


Prerequisite

You need Qwen3-32B (~64 GB BF16) on local disk. If you don't have it:

huggingface-cli download Qwen/Qwen3-32B --local-dir ~/models/qwen3-32b

Quick start

# 1. Install statLens (~5 min on first run; pulls ~2 GB of CUDA dependencies)
pip install ${HF_ENDPOINT:-https://huggingface.co}/domizzz2025/statLens/resolve/main/statlens-0.1.11-py3-none-any.whl

# 2. Launch
statlens serve

After ~80 s you should see:

══════════════════════════════════════════════════════
  ✅ statLens ready
     open in browser:  http://localhost:7860/
     Ctrl+C to stop.
══════════════════════════════════════════════════════

Open http://localhost:7860, drop a TSV, write a short study-context description, click Classify & run, review the extracted schema, then Run pipeline →.


Where statlens looks for the base model

When you run statlens serve, it auto-discovers Qwen3-32B in any of:

  • ~/models/qwen3-32b/
  • /root/autodl-tmp/models/qwen3-32b/ (AutoDL)
  • /workspace/models/qwen3-32b/ (RunPod / Lambda)
  • /data/models/qwen3-32b/, /mnt/models/qwen3-32b/
  • the HuggingFace Hub cache

If yours is elsewhere, point statlens serve at it explicitly:

statlens serve --base-model /path/to/qwen3-32b
# or, persistently:
export STATLENS_BASE_MODEL=/path/to/qwen3-32b

Screenshots

New analysis form Step 1 — Upload + describe. Drop a wide-format TSV and write a short study description.
Schema review Step 2 — Review the extracted schema. statLens shows the 21 fields it inferred from your data; edit anything that looks wrong before continuing.
Result with plots and tables Step 3 — Get plots and tables. The matched DESeq2 / limma pipeline runs and returns 5 plots and the result tables, packaged as a downloadable zip.

Hardware

GPU NVIDIA, ≥ 22 GB VRAM (RTX 3090 / 4090 / A40 / A100 / 5090 …)
OS Linux x86_64
RAM 32 GB+
Disk 75 GB free (64 GB Qwen + 1 GB LoRA + working space)

Mac / Windows / AMD: not supported (LLaMA-Factory + bitsandbytes are CUDA-only).


TSV format

Wide format, one row per sample. Required columns:

column meaning
sample_id unique per row
subject_id repeats for paired or longitudinal samples
a design column group / condition / treatment / clinical_group / tumor_stage / subtype / arm / … (fuzzy-matched)
feature columns prefixed exactly one of: gene_, asv_, prot_, metab_, otu_, feat_. Other prefixes are not recognised and the adapter will report No feature columns found.

Optional: time-like columns (time_day, collection_day, time_week, …) and batch-like columns (batch, site, run, ms_batch, …).

13 demonstration TSVs ship with the package — list them with:

ls $(python3 -c 'import statlens, os; print(os.path.dirname(statlens.__file__))')/data/examples/

Output

Each run lands under ~/.cache/statlens/runs/<run_id>/:

out/
├── statlens_report.md              # human-readable summary + reasoning
├── statlens_report.json            # machine-readable sidecar
└── pipeline_output/
    ├── volcano_plot.png
    ├── PCA_plot.png
    ├── MA_plot.png
    ├── top_DE_genes_heatmap.png
    ├── top20_DE_genes_barplot.png
    ├── results.csv
    ├── significant_genes.csv
    └── run.log
result.zip                          # everything above, packaged

The Download all button in the web UI returns result.zip.


Use cases

Three representative scenarios from the 13 supported pipelines:

Scenario Study Required columns Expected label
Bulk RNA-seq case-control 30 patients, 15 case vs 15 control, single sequencing batch, looking for DE genes sample_id, subject_id, group, gene_* Count_DESeq2_basic
Plasma proteomics two-arm LC-MS/MS, 12 cases vs 12 controls, log2 intensity, single MS run sample_id, subject_id, group, prot_* Continuous_limma_basic
16S microbiome IBD vs Healthy sparse ASV counts dominated by zeros (>40 %) sample_id, subject_id, condition, asv_* Count_DESeq2_ZINB

A full set of 13 paired demos — one (.tsv + matching .context.txt) per label — lives under examples/. Drop any .tsv into the web UI and paste the matching .context.txt as your study description to reproduce the scenario in one click.


Pipeline labels

statLens classifies a study into one of 13 DEA scenarios, or returns none_of_these (a 14th "kill-switch" output) when the design falls outside its training space.

family label when
Count (DESeq2) Count_DESeq2_basic 2 groups, no batch / time / pairing
Count_DESeq2_with_batch 2 groups + batch covariate
Count_DESeq2_paired_or_repeated matched samples within subject
Count_DESeq2_multi_group ≥ 3 independent groups
Count_DESeq2_time_course single cohort, ≥ 3 time points
Count_DESeq2_group_time_interaction ≥ 2 groups × multiple time points
Count_DESeq2_ZINB counts dominated by zeros (>40 %), e.g. 16S
Continuous (limma) Continuous_limma_basic 2 groups, no batch / time / pairing
Continuous_limma_with_batch 2 groups + batch covariate
Continuous_limma_paired_or_repeated pre/post or matched samples
Continuous_limma_multi_group ≥ 3 independent groups
Continuous_limma_time_course single cohort, ≥ 3 time points
Continuous_limma_group_time_interaction groups × time within subject
(decline) none_of_these survival / network inference / single-sample / non-omics — no forced fit

Interpreting results

Every successful run produces 5 plots and 2 result tables under pipeline_output/:

File What it shows
volcano_plot.png Each feature plotted by log2 fold-change (x) vs −log10 adjusted p-value (y). Top-right and top-left points are the significantly up- and down-regulated features.
MA_plot.png Log2 fold-change (y) vs mean expression (x). Diagnostic for fold-change vs abundance bias.
PCA_plot.png First two principal components of the normalized expression matrix, colored by group. Sanity check for class separation.
top_DE_genes_heatmap.png Top 20 most-significant DE features as a heatmap of z-scored expression across samples.
top20_DE_genes_barplot.png Top 20 features by absolute log2 fold-change as a barplot.
results.csv Full DE table — feature_id, log2FoldChange, lfcSE, stat, pvalue, padj.
significant_genes.csv Subset of results.csv filtered at padj < 0.05 (or the family default).

For paired / time-course / interaction designs the results.csv schema is the same; only the underlying model and the contrast definition change. See statlens_report.md produced alongside the run for the exact model formula used.


Headless server: reaching localhost:7860 from elsewhere

If statlens serve cannot open a browser (AutoDL, RunPod, Lambda, …), use one of these:

command works for
Public URL cloudflared tunnel --url http://localhost:7860 any device, any network
SSH tunnel ssh -fNL 7860:localhost:7860 user@server (run on your laptop) quick local dev
curl only curl -X POST http://localhost:7860/api/run -F "context=..." -F "tsv=@data.tsv" scripting / no browser

Subcommands

statlens serve                                   # main entry point
statlens download                                # pre-fetch the LoRA only (~1 GB)
statlens info                                    # show GPU / cache / paths
statlens classify --tsv DATA --context CTX --out DIR
                                                 # one-shot CLI mode (no browser)
statlens --version

statlens classify runs both LLM stages back-to-back without a review pause — useful for batch processing.


API endpoints

route method purpose
/ GET serve the web UI
/api/extract POST (multipart: tsv, context) stage 1 — return a SchemaSummary
/api/run_pipeline POST (form: run_id, schema JSON) stage 3 — pick label + run pipeline
/api/run POST (multipart: tsv, context) legacy single-shot path (no review)
/api/artifact/{run_id}/{filename} GET fetch a single PNG/CSV
/api/zip/{run_id} GET fetch the packaged result
/api/csv_preview/{run_id}/{filename} GET first N rows of a result CSV as JSON

Models

component source size license
base Qwen/Qwen3-32B (BF16) 64 GB Apache-2.0
LoRA domizzz2025/statLens 1 GB Apache-2.0

The LoRA is auto-downloaded on first run; the base model is yours to provide.


Training (LoRA only)

The classifier LoRA was fine-tuned on top of Qwen/Qwen3-32B with LLaMA-Factory:

Adapter rank / alpha 32 / 64
Target modules q / k / v / o / up / down / gate proj
Optimizer · schedule AdamW · cosine, 3 epochs (~636 steps)
Training data curated study descriptions covering the 13 DEA scenarios

Loss curves and trainer state live under qwen3_32b_lora_v1/: training_loss.png, training_eval_loss.png, trainer_state.json, trainer_log.jsonl.

Generalization to real-world TSVs with non-canonical column conventions is recovered via the user-editable schema layer at run time.


Troubleshooting

symptom fix
Network is unreachable during pip install (mainland China) export HF_ENDPOINT=https://hf-mirror.com and retry
LocalEntryNotFoundError when LoRA auto-fetches same as above — set HF_ENDPOINT before statlens serve
no base model found put Qwen3-32B in one of the auto-search paths, or pass --base-model PATH
CUDA out of memory on startup a previous statlens serve is still holding GPU memory: pkill -9 -f statlens; nvidia-smi --query-compute-apps=pid --format=csv,noheader | xargs -r kill -9
address already in use a previous instance is bound — kill it first
LLM never becomes ready tail ~/.cache/statlens/llm.log to see the LLaMA-Factory error
schema field looks wrong in the browser edit it directly; the LLM picks the label from your edits, not the original extraction
Schema specified <field>=…, but no such column in TSV a column-name field in the schema doesn't match your TSV. Either fix the field, or clear it to use auto-detection.
Schema reference_level=… not in observed levels reference_level doesn't match any actual group level. Set it to one of the values shown in group_levels, or clear it.
No feature columns found. Expected one of these prefixes: … rename your feature columns to start with gene_ / asv_ / prot_ / metab_ / otu_ / feat_.
upload exceeds N MB limit raise the cap with STATLENS_MAX_UPLOAD_MB=500 statlens serve (default 100 MB).

Source · License


Citation

If you use statLens in academic work, please cite:

@software{statlens_2025,
  title  = {statLens: A self-hosted DEA method selector backed by a Qwen3-32B + LoRA classifier},
  author = {statLens contributors},
  year   = {2025},
  url    = {https://huggingface.co/domizzz2025/statLens},
  note   = {Apache-2.0},
}

A peer-reviewed manuscript is in preparation.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for domizzz2025/statLens

Base model

Qwen/Qwen3-32B
Adapter
(283)
this model