--- license: apache-2.0 language: - en library_name: peft base_model: Qwen/Qwen3-32B tags: - bioinformatics - differential-expression - DESeq2 - limma - omics - LoRA pipeline_tag: text-generation --- # statLens ## Key features - **Self-hosted, no external API calls.** Your data never leaves the box. - **13 differential-expression pipelines** covering every common DEA scenario (Count / Continuous × basic / batch / paired / multi-group / time-course / interaction, plus ZINB) — or `none_of_these` if your study sits outside the supported space. - **Editable schema in the middle.** statLens shows you the 21-field study-design summary it extracted, lets you fix anything wrong, *then* picks the pipeline. - **End-to-end ≈ 25–45 s** per request on a single 24 GB-class NVIDIA GPU. - **Wheel install + one command to launch** — `pip install` and `statlens serve` is all you need. - **Reproducible** — every run is a self-contained folder you can zip and ship. --- ## Table of contents - [Prerequisite](#prerequisite-you-do-this-once) - [Quick start](#quick-start) - [Where statlens looks for the base model](#where-statlens-looks-for-the-base-model) - [Screenshots](#screenshots) - [Hardware](#hardware) - [TSV format](#tsv-format) - [Output](#output) - [Use cases](#use-cases) - [Pipeline labels](#pipeline-labels) - [Interpreting results](#interpreting-results) - [Headless server](#headless-server-reaching-localhost7860-from-elsewhere) - [Subcommands](#subcommands) - [API endpoints](#api-endpoints) - [Models](#models) - [Training (LoRA only)](#training-lora-only) - [Troubleshooting](#troubleshooting) - [Citation](#citation) --- ## Prerequisite You need **Qwen3-32B** (~64 GB BF16) on local disk. If you don't have it: ```bash huggingface-cli download Qwen/Qwen3-32B --local-dir ~/models/qwen3-32b ``` --- ## Quick start ```bash # 1. Install statLens (~5 min on first run; pulls ~2 GB of CUDA dependencies) pip install ${HF_ENDPOINT:-https://huggingface.co}/domizzz2025/statLens/resolve/main/statlens-0.1.11-py3-none-any.whl # 2. Launch statlens serve ``` After ~80 s you should see: ``` ══════════════════════════════════════════════════════ ✅ statLens ready open in browser: http://localhost:7860/ Ctrl+C to stop. ══════════════════════════════════════════════════════ ``` Open `http://localhost:7860`, drop a TSV, write a short study-context description, click **Classify & run**, review the extracted schema, then **Run pipeline →**. --- ## Where statlens looks for the base model When you run `statlens serve`, it auto-discovers Qwen3-32B in any of: - `~/models/qwen3-32b/` - `/root/autodl-tmp/models/qwen3-32b/` (AutoDL) - `/workspace/models/qwen3-32b/` (RunPod / Lambda) - `/data/models/qwen3-32b/`, `/mnt/models/qwen3-32b/` - the HuggingFace Hub cache If yours is elsewhere, point `statlens serve` at it explicitly: ```bash statlens serve --base-model /path/to/qwen3-32b # or, persistently: export STATLENS_BASE_MODEL=/path/to/qwen3-32b ``` --- ## Screenshots | | | |---|---| | ![New analysis form](./screenshots/01_home.png) | **Step 1 — Upload + describe.** Drop a wide-format TSV and write a short study description. | | ![Schema review](./screenshots/02_schema.png) | **Step 2 — Review the extracted schema.** statLens shows the 21 fields it inferred from your data; edit anything that looks wrong before continuing. | | ![Result with plots and tables](./screenshots/03_result.png) | **Step 3 — Get plots and tables.** The matched DESeq2 / limma pipeline runs and returns 5 plots and the result tables, packaged as a downloadable zip. | --- ## Hardware | | | |---|---| | GPU | NVIDIA, **≥ 22 GB VRAM** (RTX 3090 / 4090 / A40 / A100 / 5090 …) | | OS | Linux x86_64 | | RAM | 32 GB+ | | Disk | 75 GB free (64 GB Qwen + 1 GB LoRA + working space) | Mac / Windows / AMD: not supported (LLaMA-Factory + bitsandbytes are CUDA-only). --- ## TSV format Wide format, one row per sample. Required columns: | column | meaning | |---|---| | `sample_id` | unique per row | | `subject_id` | repeats for paired or longitudinal samples | | a **design column** | `group` / `condition` / `treatment` / `clinical_group` / `tumor_stage` / `subtype` / `arm` / … (fuzzy-matched) | | **feature columns** | prefixed **exactly** one of: `gene_`, `asv_`, `prot_`, `metab_`, `otu_`, `feat_`. Other prefixes are not recognised and the adapter will report `No feature columns found`. | Optional: time-like columns (`time_day`, `collection_day`, `time_week`, …) and batch-like columns (`batch`, `site`, `run`, `ms_batch`, …). 13 demonstration TSVs ship with the package — list them with: ```bash ls $(python3 -c 'import statlens, os; print(os.path.dirname(statlens.__file__))')/data/examples/ ``` --- ## Output Each run lands under `~/.cache/statlens/runs//`: ``` out/ ├── statlens_report.md # human-readable summary + reasoning ├── statlens_report.json # machine-readable sidecar └── pipeline_output/ ├── volcano_plot.png ├── PCA_plot.png ├── MA_plot.png ├── top_DE_genes_heatmap.png ├── top20_DE_genes_barplot.png ├── results.csv ├── significant_genes.csv └── run.log result.zip # everything above, packaged ``` The **Download all** button in the web UI returns `result.zip`. --- ## Use cases Three representative scenarios from the 13 supported pipelines: | Scenario | Study | Required columns | Expected label | |---|---|---|---| | **Bulk RNA-seq case-control** | 30 patients, 15 case vs 15 control, single sequencing batch, looking for DE genes | `sample_id`, `subject_id`, `group`, `gene_*` | `Count_DESeq2_basic` | | **Plasma proteomics two-arm** | LC-MS/MS, 12 cases vs 12 controls, log2 intensity, single MS run | `sample_id`, `subject_id`, `group`, `prot_*` | `Continuous_limma_basic` | | **16S microbiome IBD vs Healthy** | sparse ASV counts dominated by zeros (>40 %) | `sample_id`, `subject_id`, `condition`, `asv_*` | `Count_DESeq2_ZINB` | A full set of 13 paired demos — one (`.tsv` + matching `.context.txt`) per label — lives under [`examples/`](./examples). Drop any `.tsv` into the web UI and paste the matching `.context.txt` as your study description to reproduce the scenario in one click. --- ## Pipeline labels statLens classifies a study into one of **13** DEA scenarios, or returns `none_of_these` (a 14th "kill-switch" output) when the design falls outside its training space. | family | label | when | |---|---|---| | Count (DESeq2) | `Count_DESeq2_basic` | 2 groups, no batch / time / pairing | | | `Count_DESeq2_with_batch` | 2 groups + batch covariate | | | `Count_DESeq2_paired_or_repeated` | matched samples within subject | | | `Count_DESeq2_multi_group` | ≥ 3 independent groups | | | `Count_DESeq2_time_course` | single cohort, ≥ 3 time points | | | `Count_DESeq2_group_time_interaction` | ≥ 2 groups × multiple time points | | | `Count_DESeq2_ZINB` | counts dominated by zeros (>40 %), e.g. 16S | | Continuous (limma) | `Continuous_limma_basic` | 2 groups, no batch / time / pairing | | | `Continuous_limma_with_batch` | 2 groups + batch covariate | | | `Continuous_limma_paired_or_repeated` | pre/post or matched samples | | | `Continuous_limma_multi_group` | ≥ 3 independent groups | | | `Continuous_limma_time_course` | single cohort, ≥ 3 time points | | | `Continuous_limma_group_time_interaction` | groups × time within subject | | (decline) | `none_of_these` | survival / network inference / single-sample / non-omics — no forced fit | --- ## Interpreting results Every successful run produces 5 plots and 2 result tables under `pipeline_output/`: | File | What it shows | |---|---| | `volcano_plot.png` | Each feature plotted by log2 fold-change (x) vs −log10 adjusted p-value (y). Top-right and top-left points are the significantly up- and down-regulated features. | | `MA_plot.png` | Log2 fold-change (y) vs mean expression (x). Diagnostic for fold-change vs abundance bias. | | `PCA_plot.png` | First two principal components of the normalized expression matrix, colored by group. Sanity check for class separation. | | `top_DE_genes_heatmap.png` | Top 20 most-significant DE features as a heatmap of z-scored expression across samples. | | `top20_DE_genes_barplot.png` | Top 20 features by absolute log2 fold-change as a barplot. | | `results.csv` | Full DE table — `feature_id`, `log2FoldChange`, `lfcSE`, `stat`, `pvalue`, `padj`. | | `significant_genes.csv` | Subset of `results.csv` filtered at `padj < 0.05` (or the family default). | For paired / time-course / interaction designs the `results.csv` schema is the same; only the underlying model and the contrast definition change. See `statlens_report.md` produced alongside the run for the exact model formula used. --- ## Headless server: reaching `localhost:7860` from elsewhere If `statlens serve` cannot open a browser (AutoDL, RunPod, Lambda, …), use one of these: | | command | works for | |---|---|---| | **Public URL** | `cloudflared tunnel --url http://localhost:7860` | any device, any network | | **SSH tunnel** | `ssh -fNL 7860:localhost:7860 user@server` (run on your laptop) | quick local dev | | **curl only** | `curl -X POST http://localhost:7860/api/run -F "context=..." -F "tsv=@data.tsv"` | scripting / no browser | --- ## Subcommands ``` statlens serve # main entry point statlens download # pre-fetch the LoRA only (~1 GB) statlens info # show GPU / cache / paths statlens classify --tsv DATA --context CTX --out DIR # one-shot CLI mode (no browser) statlens --version ``` `statlens classify` runs both LLM stages back-to-back without a review pause — useful for batch processing. --- ## API endpoints | route | method | purpose | |---|---|---| | `/` | GET | serve the web UI | | `/api/extract` | POST (multipart: `tsv`, `context`) | stage 1 — return a SchemaSummary | | `/api/run_pipeline` | POST (form: `run_id`, `schema` JSON) | stage 3 — pick label + run pipeline | | `/api/run` | POST (multipart: `tsv`, `context`) | legacy single-shot path (no review) | | `/api/artifact/{run_id}/{filename}` | GET | fetch a single PNG/CSV | | `/api/zip/{run_id}` | GET | fetch the packaged result | | `/api/csv_preview/{run_id}/{filename}` | GET | first N rows of a result CSV as JSON | --- ## Models | component | source | size | license | |---|---|---|---| | base | [`Qwen/Qwen3-32B`](https://huggingface.co/Qwen/Qwen3-32B) (BF16) | 64 GB | Apache-2.0 | | LoRA | [`domizzz2025/statLens`](https://huggingface.co/domizzz2025/statLens) | 1 GB | Apache-2.0 | The LoRA is auto-downloaded on first run; the base model is yours to provide. --- ## Training (LoRA only) The classifier LoRA was fine-tuned on top of `Qwen/Qwen3-32B` with [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): | | | |---|---| | Adapter rank / alpha | 32 / 64 | | Target modules | q / k / v / o / up / down / gate proj | | Optimizer · schedule | AdamW · cosine, 3 epochs (~636 steps) | | Training data | curated study descriptions covering the 13 DEA scenarios | Loss curves and trainer state live under [`qwen3_32b_lora_v1/`](./qwen3_32b_lora_v1): `training_loss.png`, `training_eval_loss.png`, `trainer_state.json`, `trainer_log.jsonl`. Generalization to real-world TSVs with non-canonical column conventions is recovered via the user-editable schema layer at run time. --- ## Troubleshooting | symptom | fix | |---|---| | `Network is unreachable` during `pip install` (mainland China) | `export HF_ENDPOINT=https://hf-mirror.com` and retry | | `LocalEntryNotFoundError` when LoRA auto-fetches | same as above — set `HF_ENDPOINT` before `statlens serve` | | `no base model found` | put Qwen3-32B in one of the auto-search paths, or pass `--base-model PATH` | | `CUDA out of memory` on startup | a previous `statlens serve` is still holding GPU memory: `pkill -9 -f statlens; nvidia-smi --query-compute-apps=pid --format=csv,noheader \| xargs -r kill -9` | | `address already in use` | a previous instance is bound — kill it first | | LLM never becomes ready | tail `~/.cache/statlens/llm.log` to see the LLaMA-Factory error | | schema field looks wrong in the browser | edit it directly; the LLM picks the label from your edits, not the original extraction | | `Schema specified =…, but no such column in TSV` | a column-name field in the schema doesn't match your TSV. Either fix the field, or clear it to use auto-detection. | | `Schema reference_level=… not in observed levels` | `reference_level` doesn't match any actual group level. Set it to one of the values shown in `group_levels`, or clear it. | | `No feature columns found. Expected one of these prefixes: …` | rename your feature columns to start with `gene_` / `asv_` / `prot_` / `metab_` / `otu_` / `feat_`. | | `upload exceeds N MB limit` | raise the cap with `STATLENS_MAX_UPLOAD_MB=500 statlens serve` (default 100 MB). | --- ## Source · License - Wheel + LoRA + source : - License : Apache-2.0 --- ## Citation If you use statLens in academic work, please cite: ```bibtex @software{statlens_2025, title = {statLens: A self-hosted DEA method selector backed by a Qwen3-32B + LoRA classifier}, author = {statLens contributors}, year = {2025}, url = {https://huggingface.co/domizzz2025/statLens}, note = {Apache-2.0}, } ``` A peer-reviewed manuscript is in preparation.