Text Generation
PEFT
Safetensors
English
bioinformatics
differential-expression
DESeq2
limma
omics
LoRA
Instructions to use domizzz2025/statLens with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use domizzz2025/statLens with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: peft | |
| base_model: Qwen/Qwen3-32B | |
| tags: | |
| - bioinformatics | |
| - differential-expression | |
| - DESeq2 | |
| - limma | |
| - omics | |
| - LoRA | |
| pipeline_tag: text-generation | |
| # statLens | |
| ## Key features | |
| - **Self-hosted, no external API calls.** Your data never leaves the box. | |
| - **13 differential-expression pipelines** covering every common DEA | |
| scenario (Count / Continuous × basic / batch / paired / multi-group / | |
| time-course / interaction, plus ZINB) — or `none_of_these` if your | |
| study sits outside the supported space. | |
| - **Editable schema in the middle.** statLens shows you the 21-field | |
| study-design summary it extracted, lets you fix anything wrong, *then* | |
| picks the pipeline. | |
| - **End-to-end ≈ 25–45 s** per request on a single 24 GB-class NVIDIA GPU. | |
| - **Wheel install + one command to launch** — `pip install` and | |
| `statlens serve` is all you need. | |
| - **Reproducible** — every run is a self-contained folder you can zip | |
| and ship. | |
| --- | |
| ## Table of contents | |
| - [Prerequisite](#prerequisite-you-do-this-once) | |
| - [Quick start](#quick-start) | |
| - [Where statlens looks for the base model](#where-statlens-looks-for-the-base-model) | |
| - [Screenshots](#screenshots) | |
| - [Hardware](#hardware) | |
| - [TSV format](#tsv-format) | |
| - [Output](#output) | |
| - [Use cases](#use-cases) | |
| - [Pipeline labels](#pipeline-labels) | |
| - [Interpreting results](#interpreting-results) | |
| - [Headless server](#headless-server-reaching-localhost7860-from-elsewhere) | |
| - [Subcommands](#subcommands) | |
| - [API endpoints](#api-endpoints) | |
| - [Models](#models) | |
| - [Training (LoRA only)](#training-lora-only) | |
| - [Troubleshooting](#troubleshooting) | |
| - [Citation](#citation) | |
| --- | |
| ## Prerequisite | |
| You need **Qwen3-32B** (~64 GB BF16) on local disk. If you don't have it: | |
| ```bash | |
| huggingface-cli download Qwen/Qwen3-32B --local-dir ~/models/qwen3-32b | |
| ``` | |
| --- | |
| ## Quick start | |
| ```bash | |
| # 1. Install statLens (~5 min on first run; pulls ~2 GB of CUDA dependencies) | |
| pip install ${HF_ENDPOINT:-https://huggingface.co}/domizzz2025/statLens/resolve/main/statlens-0.1.11-py3-none-any.whl | |
| # 2. Launch | |
| statlens serve | |
| ``` | |
| After ~80 s you should see: | |
| ``` | |
| ══════════════════════════════════════════════════════ | |
| ✅ statLens ready | |
| open in browser: http://localhost:7860/ | |
| Ctrl+C to stop. | |
| ══════════════════════════════════════════════════════ | |
| ``` | |
| Open `http://localhost:7860`, drop a TSV, write a short study-context | |
| description, click **Classify & run**, review the extracted schema, then | |
| **Run pipeline →**. | |
| --- | |
| ## Where statlens looks for the base model | |
| When you run `statlens serve`, it auto-discovers Qwen3-32B in any of: | |
| - `~/models/qwen3-32b/` | |
| - `/root/autodl-tmp/models/qwen3-32b/` (AutoDL) | |
| - `/workspace/models/qwen3-32b/` (RunPod / Lambda) | |
| - `/data/models/qwen3-32b/`, `/mnt/models/qwen3-32b/` | |
| - the HuggingFace Hub cache | |
| If yours is elsewhere, point `statlens serve` at it explicitly: | |
| ```bash | |
| statlens serve --base-model /path/to/qwen3-32b | |
| # or, persistently: | |
| export STATLENS_BASE_MODEL=/path/to/qwen3-32b | |
| ``` | |
| --- | |
| ## Screenshots | |
| | | | | |
| |---|---| | |
| |  | **Step 1 — Upload + describe.** Drop a wide-format TSV and write a short study description. | | |
| |  | **Step 2 — Review the extracted schema.** statLens shows the 21 fields it inferred from your data; edit anything that looks wrong before continuing. | | |
| |  | **Step 3 — Get plots and tables.** The matched DESeq2 / limma pipeline runs and returns 5 plots and the result tables, packaged as a downloadable zip. | | |
| --- | |
| ## Hardware | |
| | | | | |
| |---|---| | |
| | GPU | NVIDIA, **≥ 22 GB VRAM** (RTX 3090 / 4090 / A40 / A100 / 5090 …) | | |
| | OS | Linux x86_64 | | |
| | RAM | 32 GB+ | | |
| | Disk | 75 GB free (64 GB Qwen + 1 GB LoRA + working space) | | |
| Mac / Windows / AMD: not supported (LLaMA-Factory + bitsandbytes are CUDA-only). | |
| --- | |
| ## TSV format | |
| Wide format, one row per sample. Required columns: | |
| | column | meaning | | |
| |---|---| | |
| | `sample_id` | unique per row | | |
| | `subject_id` | repeats for paired or longitudinal samples | | |
| | a **design column** | `group` / `condition` / `treatment` / `clinical_group` / `tumor_stage` / `subtype` / `arm` / … (fuzzy-matched) | | |
| | **feature columns** | prefixed **exactly** one of: `gene_`, `asv_`, `prot_`, `metab_`, `otu_`, `feat_`. Other prefixes are not recognised and the adapter will report `No feature columns found`. | | |
| Optional: time-like columns (`time_day`, `collection_day`, `time_week`, …) | |
| and batch-like columns (`batch`, `site`, `run`, `ms_batch`, …). | |
| 13 demonstration TSVs ship with the package — list them with: | |
| ```bash | |
| ls $(python3 -c 'import statlens, os; print(os.path.dirname(statlens.__file__))')/data/examples/ | |
| ``` | |
| --- | |
| ## Output | |
| Each run lands under `~/.cache/statlens/runs/<run_id>/`: | |
| ``` | |
| out/ | |
| ├── statlens_report.md # human-readable summary + reasoning | |
| ├── statlens_report.json # machine-readable sidecar | |
| └── pipeline_output/ | |
| ├── volcano_plot.png | |
| ├── PCA_plot.png | |
| ├── MA_plot.png | |
| ├── top_DE_genes_heatmap.png | |
| ├── top20_DE_genes_barplot.png | |
| ├── results.csv | |
| ├── significant_genes.csv | |
| └── run.log | |
| result.zip # everything above, packaged | |
| ``` | |
| The **Download all** button in the web UI returns `result.zip`. | |
| --- | |
| ## Use cases | |
| Three representative scenarios from the 13 supported pipelines: | |
| | Scenario | Study | Required columns | Expected label | | |
| |---|---|---|---| | |
| | **Bulk RNA-seq case-control** | 30 patients, 15 case vs 15 control, single sequencing batch, looking for DE genes | `sample_id`, `subject_id`, `group`, `gene_*` | `Count_DESeq2_basic` | | |
| | **Plasma proteomics two-arm** | LC-MS/MS, 12 cases vs 12 controls, log2 intensity, single MS run | `sample_id`, `subject_id`, `group`, `prot_*` | `Continuous_limma_basic` | | |
| | **16S microbiome IBD vs Healthy** | sparse ASV counts dominated by zeros (>40 %) | `sample_id`, `subject_id`, `condition`, `asv_*` | `Count_DESeq2_ZINB` | | |
| A full set of 13 paired demos — one (`.tsv` + matching `.context.txt`) | |
| per label — lives under [`examples/`](./examples). Drop any `.tsv` | |
| into the web UI and paste the matching `.context.txt` as your study | |
| description to reproduce the scenario in one click. | |
| --- | |
| ## Pipeline labels | |
| statLens classifies a study into one of **13** DEA scenarios, or returns | |
| `none_of_these` (a 14th "kill-switch" output) when the design falls outside | |
| its training space. | |
| | family | label | when | | |
| |---|---|---| | |
| | Count (DESeq2) | `Count_DESeq2_basic` | 2 groups, no batch / time / pairing | | |
| | | `Count_DESeq2_with_batch` | 2 groups + batch covariate | | |
| | | `Count_DESeq2_paired_or_repeated` | matched samples within subject | | |
| | | `Count_DESeq2_multi_group` | ≥ 3 independent groups | | |
| | | `Count_DESeq2_time_course` | single cohort, ≥ 3 time points | | |
| | | `Count_DESeq2_group_time_interaction` | ≥ 2 groups × multiple time points | | |
| | | `Count_DESeq2_ZINB` | counts dominated by zeros (>40 %), e.g. 16S | | |
| | Continuous (limma) | `Continuous_limma_basic` | 2 groups, no batch / time / pairing | | |
| | | `Continuous_limma_with_batch` | 2 groups + batch covariate | | |
| | | `Continuous_limma_paired_or_repeated` | pre/post or matched samples | | |
| | | `Continuous_limma_multi_group` | ≥ 3 independent groups | | |
| | | `Continuous_limma_time_course` | single cohort, ≥ 3 time points | | |
| | | `Continuous_limma_group_time_interaction` | groups × time within subject | | |
| | (decline) | `none_of_these` | survival / network inference / single-sample / non-omics — no forced fit | | |
| --- | |
| ## Interpreting results | |
| Every successful run produces 5 plots and 2 result tables under | |
| `pipeline_output/`: | |
| | File | What it shows | | |
| |---|---| | |
| | `volcano_plot.png` | Each feature plotted by log2 fold-change (x) vs −log10 adjusted p-value (y). Top-right and top-left points are the significantly up- and down-regulated features. | | |
| | `MA_plot.png` | Log2 fold-change (y) vs mean expression (x). Diagnostic for fold-change vs abundance bias. | | |
| | `PCA_plot.png` | First two principal components of the normalized expression matrix, colored by group. Sanity check for class separation. | | |
| | `top_DE_genes_heatmap.png` | Top 20 most-significant DE features as a heatmap of z-scored expression across samples. | | |
| | `top20_DE_genes_barplot.png` | Top 20 features by absolute log2 fold-change as a barplot. | | |
| | `results.csv` | Full DE table — `feature_id`, `log2FoldChange`, `lfcSE`, `stat`, `pvalue`, `padj`. | | |
| | `significant_genes.csv` | Subset of `results.csv` filtered at `padj < 0.05` (or the family default). | | |
| For paired / time-course / interaction designs the `results.csv` schema | |
| is the same; only the underlying model and the contrast definition | |
| change. See `statlens_report.md` produced alongside the run for the | |
| exact model formula used. | |
| --- | |
| ## Headless server: reaching `localhost:7860` from elsewhere | |
| If `statlens serve` cannot open a browser (AutoDL, RunPod, Lambda, …), use | |
| one of these: | |
| | | command | works for | | |
| |---|---|---| | |
| | **Public URL** | `cloudflared tunnel --url http://localhost:7860` | any device, any network | | |
| | **SSH tunnel** | `ssh -fNL 7860:localhost:7860 user@server` (run on your laptop) | quick local dev | | |
| | **curl only** | `curl -X POST http://localhost:7860/api/run -F "context=..." -F "tsv=@data.tsv"` | scripting / no browser | | |
| --- | |
| ## Subcommands | |
| ``` | |
| statlens serve # main entry point | |
| statlens download # pre-fetch the LoRA only (~1 GB) | |
| statlens info # show GPU / cache / paths | |
| statlens classify --tsv DATA --context CTX --out DIR | |
| # one-shot CLI mode (no browser) | |
| statlens --version | |
| ``` | |
| `statlens classify` runs both LLM stages back-to-back without a review pause — | |
| useful for batch processing. | |
| --- | |
| ## API endpoints | |
| | route | method | purpose | | |
| |---|---|---| | |
| | `/` | GET | serve the web UI | | |
| | `/api/extract` | POST (multipart: `tsv`, `context`) | stage 1 — return a SchemaSummary | | |
| | `/api/run_pipeline` | POST (form: `run_id`, `schema` JSON) | stage 3 — pick label + run pipeline | | |
| | `/api/run` | POST (multipart: `tsv`, `context`) | legacy single-shot path (no review) | | |
| | `/api/artifact/{run_id}/{filename}` | GET | fetch a single PNG/CSV | | |
| | `/api/zip/{run_id}` | GET | fetch the packaged result | | |
| | `/api/csv_preview/{run_id}/{filename}` | GET | first N rows of a result CSV as JSON | | |
| --- | |
| ## Models | |
| | component | source | size | license | | |
| |---|---|---|---| | |
| | base | [`Qwen/Qwen3-32B`](https://huggingface.co/Qwen/Qwen3-32B) (BF16) | 64 GB | Apache-2.0 | | |
| | LoRA | [`domizzz2025/statLens`](https://huggingface.co/domizzz2025/statLens) | 1 GB | Apache-2.0 | | |
| The LoRA is auto-downloaded on first run; the base model is yours to provide. | |
| --- | |
| ## Training (LoRA only) | |
| The classifier LoRA was fine-tuned on top of `Qwen/Qwen3-32B` with | |
| [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): | |
| | | | | |
| |---|---| | |
| | Adapter rank / alpha | 32 / 64 | | |
| | Target modules | q / k / v / o / up / down / gate proj | | |
| | Optimizer · schedule | AdamW · cosine, 3 epochs (~636 steps) | | |
| | Training data | curated study descriptions covering the 13 DEA scenarios | | |
| Loss curves and trainer state live under | |
| [`qwen3_32b_lora_v1/`](./qwen3_32b_lora_v1): | |
| `training_loss.png`, `training_eval_loss.png`, `trainer_state.json`, | |
| `trainer_log.jsonl`. | |
| Generalization to real-world TSVs with non-canonical column conventions is | |
| recovered via the user-editable schema layer at run time. | |
| --- | |
| ## Troubleshooting | |
| | symptom | fix | | |
| |---|---| | |
| | `Network is unreachable` during `pip install` (mainland China) | `export HF_ENDPOINT=https://hf-mirror.com` and retry | | |
| | `LocalEntryNotFoundError` when LoRA auto-fetches | same as above — set `HF_ENDPOINT` before `statlens serve` | | |
| | `no base model found` | put Qwen3-32B in one of the auto-search paths, or pass `--base-model PATH` | | |
| | `CUDA out of memory` on startup | a previous `statlens serve` is still holding GPU memory: `pkill -9 -f statlens; nvidia-smi --query-compute-apps=pid --format=csv,noheader \| xargs -r kill -9` | | |
| | `address already in use` | a previous instance is bound — kill it first | | |
| | LLM never becomes ready | tail `~/.cache/statlens/llm.log` to see the LLaMA-Factory error | | |
| | schema field looks wrong in the browser | edit it directly; the LLM picks the label from your edits, not the original extraction | | |
| | `Schema specified <field>=…, but no such column in TSV` | a column-name field in the schema doesn't match your TSV. Either fix the field, or clear it to use auto-detection. | | |
| | `Schema reference_level=… not in observed levels` | `reference_level` doesn't match any actual group level. Set it to one of the values shown in `group_levels`, or clear it. | | |
| | `No feature columns found. Expected one of these prefixes: …` | rename your feature columns to start with `gene_` / `asv_` / `prot_` / `metab_` / `otu_` / `feat_`. | | |
| | `upload exceeds N MB limit` | raise the cap with `STATLENS_MAX_UPLOAD_MB=500 statlens serve` (default 100 MB). | | |
| --- | |
| ## Source · License | |
| - Wheel + LoRA + source : <https://huggingface.co/domizzz2025/statLens> | |
| - License : Apache-2.0 | |
| --- | |
| ## Citation | |
| If you use statLens in academic work, please cite: | |
| ```bibtex | |
| @software{statlens_2025, | |
| title = {statLens: A self-hosted DEA method selector backed by a Qwen3-32B + LoRA classifier}, | |
| author = {statLens contributors}, | |
| year = {2025}, | |
| url = {https://huggingface.co/domizzz2025/statLens}, | |
| note = {Apache-2.0}, | |
| } | |
| ``` | |
| A peer-reviewed manuscript is in preparation. | |