domizzz2025

sync: src/README.md aligned

ebb32b9 verified 19 days ago

14 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: peft
	base_model: Qwen/Qwen3-32B
	tags:
	- bioinformatics
	- differential-expression
	- DESeq2
	- limma
	- omics
	- LoRA
	pipeline_tag: text-generation
	---

	# statLens

	## Key features

	- Self-hosted, no external API calls. Your data never leaves the box.
	- 13 differential-expression pipelines covering every common DEA
	scenario (Count / Continuous × basic / batch / paired / multi-group /
	time-course / interaction, plus ZINB) — or `none_of_these` if your
	study sits outside the supported space.
	- Editable schema in the middle. statLens shows you the 21-field
	study-design summary it extracted, lets you fix anything wrong, then
	picks the pipeline.
	- End-to-end ≈ 25–45 s per request on a single 24 GB-class NVIDIA GPU.
	- Wheel install + one command to launch — `pip install` and
	`statlens serve` is all you need.
	- Reproducible — every run is a self-contained folder you can zip
	and ship.

	---

	## Table of contents

	- [Prerequisite](#prerequisite-you-do-this-once)
	- [Quick start](#quick-start)
	- [Where statlens looks for the base model](#where-statlens-looks-for-the-base-model)
	- [Screenshots](#screenshots)
	- [Hardware](#hardware)
	- [TSV format](#tsv-format)
	- [Output](#output)
	- [Use cases](#use-cases)
	- [Pipeline labels](#pipeline-labels)
	- [Interpreting results](#interpreting-results)
	- [Headless server](#headless-server-reaching-localhost7860-from-elsewhere)
	- [Subcommands](#subcommands)
	- [API endpoints](#api-endpoints)
	- [Models](#models)
	- [Training (LoRA only)](#training-lora-only)
	- [Troubleshooting](#troubleshooting)
	- [Citation](#citation)

	---

	## Prerequisite

	You need Qwen3-32B (~64 GB BF16) on local disk. If you don't have it:

	```bash
	huggingface-cli download Qwen/Qwen3-32B --local-dir ~/models/qwen3-32b
	```

	---

	## Quick start

	```bash
	# 1. Install statLens (~5 min on first run; pulls ~2 GB of CUDA dependencies)
	pip install ${HF_ENDPOINT:-https://huggingface.co}/domizzz2025/statLens/resolve/main/statlens-0.1.11-py3-none-any.whl

	# 2. Launch
	statlens serve
	```

	After ~80 s you should see:

	```
	══════════════════════════════════════════════════════
	✅ statLens ready
	open in browser: http://localhost:7860/
	Ctrl+C to stop.
	══════════════════════════════════════════════════════
	```

	Open `http://localhost:7860`, drop a TSV, write a short study-context
	description, click Classify & run, review the extracted schema, then
	Run pipeline →.

	---

	## Where statlens looks for the base model

	When you run `statlens serve`, it auto-discovers Qwen3-32B in any of:

	- `~/models/qwen3-32b/`
	- `/root/autodl-tmp/models/qwen3-32b/` (AutoDL)
	- `/workspace/models/qwen3-32b/` (RunPod / Lambda)
	- `/data/models/qwen3-32b/`, `/mnt/models/qwen3-32b/`
	- the HuggingFace Hub cache

	If yours is elsewhere, point `statlens serve` at it explicitly:

	```bash
	statlens serve --base-model /path/to/qwen3-32b
	# or, persistently:
	export STATLENS_BASE_MODEL=/path/to/qwen3-32b
	```

	---

	## Screenshots

	\| \| \|
	\|---\|---\|
	\| ![New analysis form](./screenshots/01_home.png) \| Step 1 — Upload + describe. Drop a wide-format TSV and write a short study description. \|
	\| ![Schema review](./screenshots/02_schema.png) \| Step 2 — Review the extracted schema. statLens shows the 21 fields it inferred from your data; edit anything that looks wrong before continuing. \|
	\| ![Result with plots and tables](./screenshots/03_result.png) \| Step 3 — Get plots and tables. The matched DESeq2 / limma pipeline runs and returns 5 plots and the result tables, packaged as a downloadable zip. \|

	---

	## Hardware

	\| \| \|
	\|---\|---\|
	\| GPU \| NVIDIA, ≥ 22 GB VRAM (RTX 3090 / 4090 / A40 / A100 / 5090 …) \|
	\| OS \| Linux x86_64 \|
	\| RAM \| 32 GB+ \|
	\| Disk \| 75 GB free (64 GB Qwen + 1 GB LoRA + working space) \|

	Mac / Windows / AMD: not supported (LLaMA-Factory + bitsandbytes are CUDA-only).

	---

	## TSV format

	Wide format, one row per sample. Required columns:

	\| column \| meaning \|
	\|---\|---\|
	\| `sample_id` \| unique per row \|
	\| `subject_id` \| repeats for paired or longitudinal samples \|
	\| a design column \| `group` / `condition` / `treatment` / `clinical_group` / `tumor_stage` / `subtype` / `arm` / … (fuzzy-matched) \|
	\| feature columns \| prefixed exactly one of: `gene_`, `asv_`, `prot_`, `metab_`, `otu_`, `feat_`. Other prefixes are not recognised and the adapter will report `No feature columns found`. \|

	Optional: time-like columns (`time_day`, `collection_day`, `time_week`, …)
	and batch-like columns (`batch`, `site`, `run`, `ms_batch`, …).

	13 demonstration TSVs ship with the package — list them with:

	```bash
	ls $(python3 -c 'import statlens, os; print(os.path.dirname(statlens.__file__))')/data/examples/
	```

	---

	## Output

	Each run lands under `~/.cache/statlens/runs/<run_id>/`:

	```
	out/
	├── statlens_report.md # human-readable summary + reasoning
	├── statlens_report.json # machine-readable sidecar
	└── pipeline_output/
	├── volcano_plot.png
	├── PCA_plot.png
	├── MA_plot.png
	├── top_DE_genes_heatmap.png
	├── top20_DE_genes_barplot.png
	├── results.csv
	├── significant_genes.csv
	└── run.log
	result.zip # everything above, packaged
	```

	The Download all button in the web UI returns `result.zip`.

	---

	## Use cases

	Three representative scenarios from the 13 supported pipelines:

	\| Scenario \| Study \| Required columns \| Expected label \|
	\|---\|---\|---\|---\|
	\| Bulk RNA-seq case-control \| 30 patients, 15 case vs 15 control, single sequencing batch, looking for DE genes \| `sample_id`, `subject_id`, `group`, `gene_*` \| `Count_DESeq2_basic` \|
	\| Plasma proteomics two-arm \| LC-MS/MS, 12 cases vs 12 controls, log2 intensity, single MS run \| `sample_id`, `subject_id`, `group`, `prot_*` \| `Continuous_limma_basic` \|
	\| 16S microbiome IBD vs Healthy \| sparse ASV counts dominated by zeros (>40 %) \| `sample_id`, `subject_id`, `condition`, `asv_*` \| `Count_DESeq2_ZINB` \|

	A full set of 13 paired demos — one (`.tsv` + matching `.context.txt`)
	per label — lives under [`examples/`](./examples). Drop any `.tsv`
	into the web UI and paste the matching `.context.txt` as your study
	description to reproduce the scenario in one click.

	---

	## Pipeline labels

	statLens classifies a study into one of 13 DEA scenarios, or returns
	`none_of_these` (a 14th "kill-switch" output) when the design falls outside
	its training space.

	\| family \| label \| when \|
	\|---\|---\|---\|
	\| Count (DESeq2) \| `Count_DESeq2_basic` \| 2 groups, no batch / time / pairing \|
	\| \| `Count_DESeq2_with_batch` \| 2 groups + batch covariate \|
	\| \| `Count_DESeq2_paired_or_repeated` \| matched samples within subject \|
	\| \| `Count_DESeq2_multi_group` \| ≥ 3 independent groups \|
	\| \| `Count_DESeq2_time_course` \| single cohort, ≥ 3 time points \|
	\| \| `Count_DESeq2_group_time_interaction` \| ≥ 2 groups × multiple time points \|
	\| \| `Count_DESeq2_ZINB` \| counts dominated by zeros (>40 %), e.g. 16S \|
	\| Continuous (limma) \| `Continuous_limma_basic` \| 2 groups, no batch / time / pairing \|
	\| \| `Continuous_limma_with_batch` \| 2 groups + batch covariate \|
	\| \| `Continuous_limma_paired_or_repeated` \| pre/post or matched samples \|
	\| \| `Continuous_limma_multi_group` \| ≥ 3 independent groups \|
	\| \| `Continuous_limma_time_course` \| single cohort, ≥ 3 time points \|
	\| \| `Continuous_limma_group_time_interaction` \| groups × time within subject \|
	\| (decline) \| `none_of_these` \| survival / network inference / single-sample / non-omics — no forced fit \|

	---

	## Interpreting results

	Every successful run produces 5 plots and 2 result tables under
	`pipeline_output/`:

	\| File \| What it shows \|
	\|---\|---\|
	\| `volcano_plot.png` \| Each feature plotted by log2 fold-change (x) vs −log10 adjusted p-value (y). Top-right and top-left points are the significantly up- and down-regulated features. \|
	\| `MA_plot.png` \| Log2 fold-change (y) vs mean expression (x). Diagnostic for fold-change vs abundance bias. \|
	\| `PCA_plot.png` \| First two principal components of the normalized expression matrix, colored by group. Sanity check for class separation. \|
	\| `top_DE_genes_heatmap.png` \| Top 20 most-significant DE features as a heatmap of z-scored expression across samples. \|
	\| `top20_DE_genes_barplot.png` \| Top 20 features by absolute log2 fold-change as a barplot. \|
	\| `results.csv` \| Full DE table — `feature_id`, `log2FoldChange`, `lfcSE`, `stat`, `pvalue`, `padj`. \|
	\| `significant_genes.csv` \| Subset of `results.csv` filtered at `padj < 0.05` (or the family default). \|

	For paired / time-course / interaction designs the `results.csv` schema
	is the same; only the underlying model and the contrast definition
	change. See `statlens_report.md` produced alongside the run for the
	exact model formula used.

	---

	## Headless server: reaching `localhost:7860` from elsewhere

	If `statlens serve` cannot open a browser (AutoDL, RunPod, Lambda, …), use
	one of these:

	\| \| command \| works for \|
	\|---\|---\|---\|
	\| Public URL \| `cloudflared tunnel --url http://localhost:7860` \| any device, any network \|
	\| SSH tunnel \| `ssh -fNL 7860:localhost:7860 user@server` (run on your laptop) \| quick local dev \|
	\| curl only \| `curl -X POST http://localhost:7860/api/run -F "context=..." -F "tsv=@data.tsv"` \| scripting / no browser \|

	---

	## Subcommands

	```
	statlens serve # main entry point
	statlens download # pre-fetch the LoRA only (~1 GB)
	statlens info # show GPU / cache / paths
	statlens classify --tsv DATA --context CTX --out DIR
	# one-shot CLI mode (no browser)
	statlens --version
	```

	`statlens classify` runs both LLM stages back-to-back without a review pause —
	useful for batch processing.

	---

	## API endpoints

	\| route \| method \| purpose \|
	\|---\|---\|---\|
	\| `/` \| GET \| serve the web UI \|
	\| `/api/extract` \| POST (multipart: `tsv`, `context`) \| stage 1 — return a SchemaSummary \|
	\| `/api/run_pipeline` \| POST (form: `run_id`, `schema` JSON) \| stage 3 — pick label + run pipeline \|
	\| `/api/run` \| POST (multipart: `tsv`, `context`) \| legacy single-shot path (no review) \|
	\| `/api/artifact/{run_id}/{filename}` \| GET \| fetch a single PNG/CSV \|
	\| `/api/zip/{run_id}` \| GET \| fetch the packaged result \|
	\| `/api/csv_preview/{run_id}/{filename}` \| GET \| first N rows of a result CSV as JSON \|

	---

	## Models

	\| component \| source \| size \| license \|
	\|---\|---\|---\|---\|
	\| base \| [`Qwen/Qwen3-32B`](https://huggingface.co/Qwen/Qwen3-32B) (BF16) \| 64 GB \| Apache-2.0 \|
	\| LoRA \| [`domizzz2025/statLens`](https://huggingface.co/domizzz2025/statLens) \| 1 GB \| Apache-2.0 \|

	The LoRA is auto-downloaded on first run; the base model is yours to provide.

	---

	## Training (LoRA only)

	The classifier LoRA was fine-tuned on top of `Qwen/Qwen3-32B` with
	[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory):

	\| \| \|
	\|---\|---\|
	\| Adapter rank / alpha \| 32 / 64 \|
	\| Target modules \| q / k / v / o / up / down / gate proj \|
	\| Optimizer · schedule \| AdamW · cosine, 3 epochs (~636 steps) \|
	\| Training data \| curated study descriptions covering the 13 DEA scenarios \|

	Loss curves and trainer state live under
	[`qwen3_32b_lora_v1/`](./qwen3_32b_lora_v1):
	`training_loss.png`, `training_eval_loss.png`, `trainer_state.json`,
	`trainer_log.jsonl`.

	Generalization to real-world TSVs with non-canonical column conventions is
	recovered via the user-editable schema layer at run time.

	---

	## Troubleshooting

	\| symptom \| fix \|
	\|---\|---\|
	\| `Network is unreachable` during `pip install` (mainland China) \| `export HF_ENDPOINT=https://hf-mirror.com` and retry \|
	\| `LocalEntryNotFoundError` when LoRA auto-fetches \| same as above — set `HF_ENDPOINT` before `statlens serve` \|
	\| `no base model found` \| put Qwen3-32B in one of the auto-search paths, or pass `--base-model PATH` \|
	\| `CUDA out of memory` on startup \| a previous `statlens serve` is still holding GPU memory: `pkill -9 -f statlens; nvidia-smi --query-compute-apps=pid --format=csv,noheader \\| xargs -r kill -9` \|
	\| `address already in use` \| a previous instance is bound — kill it first \|
	\| LLM never becomes ready \| tail `~/.cache/statlens/llm.log` to see the LLaMA-Factory error \|
	\| schema field looks wrong in the browser \| edit it directly; the LLM picks the label from your edits, not the original extraction \|
	\| `Schema specified <field>=…, but no such column in TSV` \| a column-name field in the schema doesn't match your TSV. Either fix the field, or clear it to use auto-detection. \|
	\| `Schema reference_level=… not in observed levels` \| `reference_level` doesn't match any actual group level. Set it to one of the values shown in `group_levels`, or clear it. \|
	\| `No feature columns found. Expected one of these prefixes: …` \| rename your feature columns to start with `gene_` / `asv_` / `prot_` / `metab_` / `otu_` / `feat_`. \|
	\| `upload exceeds N MB limit` \| raise the cap with `STATLENS_MAX_UPLOAD_MB=500 statlens serve` (default 100 MB). \|

	---

	## Source · License

	- Wheel + LoRA + source : <https://huggingface.co/domizzz2025/statLens>
	- License : Apache-2.0

	---

	## Citation

	If you use statLens in academic work, please cite:

	```bibtex
	@software{statlens_2025,
	title = {statLens: A self-hosted DEA method selector backed by a Qwen3-32B + LoRA classifier},
	author = {statLens contributors},
	year = {2025},
	url = {https://huggingface.co/domizzz2025/statLens},
	note = {Apache-2.0},
	}
	```

	A peer-reviewed manuscript is in preparation.