statLens / src /README.md
domizzz2025's picture
sync: src/README.md aligned
ebb32b9 verified
---
license: apache-2.0
language:
- en
library_name: peft
base_model: Qwen/Qwen3-32B
tags:
- bioinformatics
- differential-expression
- DESeq2
- limma
- omics
- LoRA
pipeline_tag: text-generation
---
# statLens
## Key features
- **Self-hosted, no external API calls.** Your data never leaves the box.
- **13 differential-expression pipelines** covering every common DEA
scenario (Count / Continuous × basic / batch / paired / multi-group /
time-course / interaction, plus ZINB) — or `none_of_these` if your
study sits outside the supported space.
- **Editable schema in the middle.** statLens shows you the 21-field
study-design summary it extracted, lets you fix anything wrong, *then*
picks the pipeline.
- **End-to-end ≈ 25–45 s** per request on a single 24 GB-class NVIDIA GPU.
- **Wheel install + one command to launch**`pip install` and
`statlens serve` is all you need.
- **Reproducible** — every run is a self-contained folder you can zip
and ship.
---
## Table of contents
- [Prerequisite](#prerequisite-you-do-this-once)
- [Quick start](#quick-start)
- [Where statlens looks for the base model](#where-statlens-looks-for-the-base-model)
- [Screenshots](#screenshots)
- [Hardware](#hardware)
- [TSV format](#tsv-format)
- [Output](#output)
- [Use cases](#use-cases)
- [Pipeline labels](#pipeline-labels)
- [Interpreting results](#interpreting-results)
- [Headless server](#headless-server-reaching-localhost7860-from-elsewhere)
- [Subcommands](#subcommands)
- [API endpoints](#api-endpoints)
- [Models](#models)
- [Training (LoRA only)](#training-lora-only)
- [Troubleshooting](#troubleshooting)
- [Citation](#citation)
---
## Prerequisite
You need **Qwen3-32B** (~64 GB BF16) on local disk. If you don't have it:
```bash
huggingface-cli download Qwen/Qwen3-32B --local-dir ~/models/qwen3-32b
```
---
## Quick start
```bash
# 1. Install statLens (~5 min on first run; pulls ~2 GB of CUDA dependencies)
pip install ${HF_ENDPOINT:-https://huggingface.co}/domizzz2025/statLens/resolve/main/statlens-0.1.11-py3-none-any.whl
# 2. Launch
statlens serve
```
After ~80 s you should see:
```
══════════════════════════════════════════════════════
✅ statLens ready
open in browser: http://localhost:7860/
Ctrl+C to stop.
══════════════════════════════════════════════════════
```
Open `http://localhost:7860`, drop a TSV, write a short study-context
description, click **Classify & run**, review the extracted schema, then
**Run pipeline →**.
---
## Where statlens looks for the base model
When you run `statlens serve`, it auto-discovers Qwen3-32B in any of:
- `~/models/qwen3-32b/`
- `/root/autodl-tmp/models/qwen3-32b/` (AutoDL)
- `/workspace/models/qwen3-32b/` (RunPod / Lambda)
- `/data/models/qwen3-32b/`, `/mnt/models/qwen3-32b/`
- the HuggingFace Hub cache
If yours is elsewhere, point `statlens serve` at it explicitly:
```bash
statlens serve --base-model /path/to/qwen3-32b
# or, persistently:
export STATLENS_BASE_MODEL=/path/to/qwen3-32b
```
---
## Screenshots
| | |
|---|---|
| ![New analysis form](./screenshots/01_home.png) | **Step 1 — Upload + describe.** Drop a wide-format TSV and write a short study description. |
| ![Schema review](./screenshots/02_schema.png) | **Step 2 — Review the extracted schema.** statLens shows the 21 fields it inferred from your data; edit anything that looks wrong before continuing. |
| ![Result with plots and tables](./screenshots/03_result.png) | **Step 3 — Get plots and tables.** The matched DESeq2 / limma pipeline runs and returns 5 plots and the result tables, packaged as a downloadable zip. |
---
## Hardware
| | |
|---|---|
| GPU | NVIDIA, **≥ 22 GB VRAM** (RTX 3090 / 4090 / A40 / A100 / 5090 …) |
| OS | Linux x86_64 |
| RAM | 32 GB+ |
| Disk | 75 GB free (64 GB Qwen + 1 GB LoRA + working space) |
Mac / Windows / AMD: not supported (LLaMA-Factory + bitsandbytes are CUDA-only).
---
## TSV format
Wide format, one row per sample. Required columns:
| column | meaning |
|---|---|
| `sample_id` | unique per row |
| `subject_id` | repeats for paired or longitudinal samples |
| a **design column** | `group` / `condition` / `treatment` / `clinical_group` / `tumor_stage` / `subtype` / `arm` / … (fuzzy-matched) |
| **feature columns** | prefixed **exactly** one of: `gene_`, `asv_`, `prot_`, `metab_`, `otu_`, `feat_`. Other prefixes are not recognised and the adapter will report `No feature columns found`. |
Optional: time-like columns (`time_day`, `collection_day`, `time_week`, …)
and batch-like columns (`batch`, `site`, `run`, `ms_batch`, …).
13 demonstration TSVs ship with the package — list them with:
```bash
ls $(python3 -c 'import statlens, os; print(os.path.dirname(statlens.__file__))')/data/examples/
```
---
## Output
Each run lands under `~/.cache/statlens/runs/<run_id>/`:
```
out/
├── statlens_report.md # human-readable summary + reasoning
├── statlens_report.json # machine-readable sidecar
└── pipeline_output/
├── volcano_plot.png
├── PCA_plot.png
├── MA_plot.png
├── top_DE_genes_heatmap.png
├── top20_DE_genes_barplot.png
├── results.csv
├── significant_genes.csv
└── run.log
result.zip # everything above, packaged
```
The **Download all** button in the web UI returns `result.zip`.
---
## Use cases
Three representative scenarios from the 13 supported pipelines:
| Scenario | Study | Required columns | Expected label |
|---|---|---|---|
| **Bulk RNA-seq case-control** | 30 patients, 15 case vs 15 control, single sequencing batch, looking for DE genes | `sample_id`, `subject_id`, `group`, `gene_*` | `Count_DESeq2_basic` |
| **Plasma proteomics two-arm** | LC-MS/MS, 12 cases vs 12 controls, log2 intensity, single MS run | `sample_id`, `subject_id`, `group`, `prot_*` | `Continuous_limma_basic` |
| **16S microbiome IBD vs Healthy** | sparse ASV counts dominated by zeros (>40 %) | `sample_id`, `subject_id`, `condition`, `asv_*` | `Count_DESeq2_ZINB` |
A full set of 13 paired demos — one (`.tsv` + matching `.context.txt`)
per label — lives under [`examples/`](./examples). Drop any `.tsv`
into the web UI and paste the matching `.context.txt` as your study
description to reproduce the scenario in one click.
---
## Pipeline labels
statLens classifies a study into one of **13** DEA scenarios, or returns
`none_of_these` (a 14th "kill-switch" output) when the design falls outside
its training space.
| family | label | when |
|---|---|---|
| Count (DESeq2) | `Count_DESeq2_basic` | 2 groups, no batch / time / pairing |
| | `Count_DESeq2_with_batch` | 2 groups + batch covariate |
| | `Count_DESeq2_paired_or_repeated` | matched samples within subject |
| | `Count_DESeq2_multi_group` | ≥ 3 independent groups |
| | `Count_DESeq2_time_course` | single cohort, ≥ 3 time points |
| | `Count_DESeq2_group_time_interaction` | ≥ 2 groups × multiple time points |
| | `Count_DESeq2_ZINB` | counts dominated by zeros (>40 %), e.g. 16S |
| Continuous (limma) | `Continuous_limma_basic` | 2 groups, no batch / time / pairing |
| | `Continuous_limma_with_batch` | 2 groups + batch covariate |
| | `Continuous_limma_paired_or_repeated` | pre/post or matched samples |
| | `Continuous_limma_multi_group` | ≥ 3 independent groups |
| | `Continuous_limma_time_course` | single cohort, ≥ 3 time points |
| | `Continuous_limma_group_time_interaction` | groups × time within subject |
| (decline) | `none_of_these` | survival / network inference / single-sample / non-omics — no forced fit |
---
## Interpreting results
Every successful run produces 5 plots and 2 result tables under
`pipeline_output/`:
| File | What it shows |
|---|---|
| `volcano_plot.png` | Each feature plotted by log2 fold-change (x) vs −log10 adjusted p-value (y). Top-right and top-left points are the significantly up- and down-regulated features. |
| `MA_plot.png` | Log2 fold-change (y) vs mean expression (x). Diagnostic for fold-change vs abundance bias. |
| `PCA_plot.png` | First two principal components of the normalized expression matrix, colored by group. Sanity check for class separation. |
| `top_DE_genes_heatmap.png` | Top 20 most-significant DE features as a heatmap of z-scored expression across samples. |
| `top20_DE_genes_barplot.png` | Top 20 features by absolute log2 fold-change as a barplot. |
| `results.csv` | Full DE table — `feature_id`, `log2FoldChange`, `lfcSE`, `stat`, `pvalue`, `padj`. |
| `significant_genes.csv` | Subset of `results.csv` filtered at `padj < 0.05` (or the family default). |
For paired / time-course / interaction designs the `results.csv` schema
is the same; only the underlying model and the contrast definition
change. See `statlens_report.md` produced alongside the run for the
exact model formula used.
---
## Headless server: reaching `localhost:7860` from elsewhere
If `statlens serve` cannot open a browser (AutoDL, RunPod, Lambda, …), use
one of these:
| | command | works for |
|---|---|---|
| **Public URL** | `cloudflared tunnel --url http://localhost:7860` | any device, any network |
| **SSH tunnel** | `ssh -fNL 7860:localhost:7860 user@server` (run on your laptop) | quick local dev |
| **curl only** | `curl -X POST http://localhost:7860/api/run -F "context=..." -F "tsv=@data.tsv"` | scripting / no browser |
---
## Subcommands
```
statlens serve # main entry point
statlens download # pre-fetch the LoRA only (~1 GB)
statlens info # show GPU / cache / paths
statlens classify --tsv DATA --context CTX --out DIR
# one-shot CLI mode (no browser)
statlens --version
```
`statlens classify` runs both LLM stages back-to-back without a review pause —
useful for batch processing.
---
## API endpoints
| route | method | purpose |
|---|---|---|
| `/` | GET | serve the web UI |
| `/api/extract` | POST (multipart: `tsv`, `context`) | stage 1 — return a SchemaSummary |
| `/api/run_pipeline` | POST (form: `run_id`, `schema` JSON) | stage 3 — pick label + run pipeline |
| `/api/run` | POST (multipart: `tsv`, `context`) | legacy single-shot path (no review) |
| `/api/artifact/{run_id}/{filename}` | GET | fetch a single PNG/CSV |
| `/api/zip/{run_id}` | GET | fetch the packaged result |
| `/api/csv_preview/{run_id}/{filename}` | GET | first N rows of a result CSV as JSON |
---
## Models
| component | source | size | license |
|---|---|---|---|
| base | [`Qwen/Qwen3-32B`](https://huggingface.co/Qwen/Qwen3-32B) (BF16) | 64 GB | Apache-2.0 |
| LoRA | [`domizzz2025/statLens`](https://huggingface.co/domizzz2025/statLens) | 1 GB | Apache-2.0 |
The LoRA is auto-downloaded on first run; the base model is yours to provide.
---
## Training (LoRA only)
The classifier LoRA was fine-tuned on top of `Qwen/Qwen3-32B` with
[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory):
| | |
|---|---|
| Adapter rank / alpha | 32 / 64 |
| Target modules | q / k / v / o / up / down / gate proj |
| Optimizer · schedule | AdamW · cosine, 3 epochs (~636 steps) |
| Training data | curated study descriptions covering the 13 DEA scenarios |
Loss curves and trainer state live under
[`qwen3_32b_lora_v1/`](./qwen3_32b_lora_v1):
`training_loss.png`, `training_eval_loss.png`, `trainer_state.json`,
`trainer_log.jsonl`.
Generalization to real-world TSVs with non-canonical column conventions is
recovered via the user-editable schema layer at run time.
---
## Troubleshooting
| symptom | fix |
|---|---|
| `Network is unreachable` during `pip install` (mainland China) | `export HF_ENDPOINT=https://hf-mirror.com` and retry |
| `LocalEntryNotFoundError` when LoRA auto-fetches | same as above — set `HF_ENDPOINT` before `statlens serve` |
| `no base model found` | put Qwen3-32B in one of the auto-search paths, or pass `--base-model PATH` |
| `CUDA out of memory` on startup | a previous `statlens serve` is still holding GPU memory: `pkill -9 -f statlens; nvidia-smi --query-compute-apps=pid --format=csv,noheader \| xargs -r kill -9` |
| `address already in use` | a previous instance is bound — kill it first |
| LLM never becomes ready | tail `~/.cache/statlens/llm.log` to see the LLaMA-Factory error |
| schema field looks wrong in the browser | edit it directly; the LLM picks the label from your edits, not the original extraction |
| `Schema specified <field>=…, but no such column in TSV` | a column-name field in the schema doesn't match your TSV. Either fix the field, or clear it to use auto-detection. |
| `Schema reference_level=… not in observed levels` | `reference_level` doesn't match any actual group level. Set it to one of the values shown in `group_levels`, or clear it. |
| `No feature columns found. Expected one of these prefixes: …` | rename your feature columns to start with `gene_` / `asv_` / `prot_` / `metab_` / `otu_` / `feat_`. |
| `upload exceeds N MB limit` | raise the cap with `STATLENS_MAX_UPLOAD_MB=500 statlens serve` (default 100 MB). |
---
## Source · License
- Wheel + LoRA + source : <https://huggingface.co/domizzz2025/statLens>
- License : Apache-2.0
---
## Citation
If you use statLens in academic work, please cite:
```bibtex
@software{statlens_2025,
title = {statLens: A self-hosted DEA method selector backed by a Qwen3-32B + LoRA classifier},
author = {statLens contributors},
year = {2025},
url = {https://huggingface.co/domizzz2025/statLens},
note = {Apache-2.0},
}
```
A peer-reviewed manuscript is in preparation.