Spaces:
Sleeping
title: GenoTriage
emoji: π§¬
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
tags:
- openenv
GenoTriage π§¬
An OpenEnv environment where AI agents classify real ClinVar SNP variants using ACMG criteria across three clinical difficulty tiers.
Overview
Clinical geneticists classify genetic variants daily to determine whether a mutation causes disease. This judgment β Pathogenic, Likely Pathogenic, Uncertain, Likely Benign, or Benign β directly impacts patient care, yet remains time-consuming, expert-dependent, and difficult to scale.
GenoTriage turns this into a structured RL environment. Agents receive real SNP variants from ClinVar enriched with population frequency data from gnomAD, and must classify them using the standard ACMG/AMP five-tier system. Each episode is single-step β the agent reads the evidence and submits one classification β making it fast, deterministic, and well-suited for both RL training and LLM evaluation.
Environment Description
| Property | Value |
|---|---|
| Variant type | SNPs (single nucleotide polymorphisms) only |
| Data source | ClinVar (NCBI) + gnomAD v4 population frequencies |
| Genome build | GRCh38 |
| Episode structure | Single-step (reset β observe β classify β reward β done) |
| Tasks | 3 (easy, medium, hard) |
| Variants per task | 8 |
| Interface | OpenEnv-compatible (step / reset / state) |
Action Space
The agent submits a VepAction with three fields:
| Field | Type | Description |
|---|---|---|
classification |
str (one of 5) |
ACMG tier: Pathogenic, Likely_pathogenic, Uncertain_significance, Likely_benign, or Benign |
reasoning |
str |
Explanation citing specific evidence from the observation (min 20 chars encouraged) |
criteria_used |
list[str] |
List of specific criteria that drove the decision (e.g. "high population frequency", "nonsense variant") |
Observation Space
The agent receives a VepObservation with the following fields:
| Field | Type | Description |
|---|---|---|
gene |
str |
Gene symbol (e.g. BRCA1, CFTR, MSH2) |
chromosome |
str |
Chromosome (e.g. 17) |
position |
int |
GRCh38 genomic position |
ref / alt |
str |
Reference and alternate alleles |
hgvs |
str |
HGVS genomic notation |
consequence |
str | None |
Molecular consequence (e.g. missense_variant, nonsense, synonymous_variant) |
disease |
str |
Primary disease associated with this gene |
population_frequency |
float | None |
gnomAD v4 allele frequency (None if absent from gnomAD) |
evidence_snippets |
list[str] |
3β4 evidence snippets: gene-disease context, consequence interpretation, frequency context, functional evidence |
task_description |
str |
Instructions for the agent |
feedback |
str |
Grader feedback after step() β empty on reset() |
done |
bool |
True after first step |
reward |
float |
Reward received (0.0 on reset) |
Tasks
Task 1 β easy (Benign / Likely Benign)
Variants with clear benign signals: moderate-to-high population frequency, synonymous or non-coding consequence, and no functional evidence linking the specific variant to disease. Agents should score well by correctly reading population frequency and consequence type.
Expected agent score: 0.75 β 0.95
Task 2 β medium (Pathogenic / Likely Pathogenic)
Variants with clear pathogenic signals: loss-of-function consequences (nonsense, splice-site), absent from gnomAD, and strong gene-disease association with clinical literature support. Agents must distinguish signal from noise and identify loss-of-function as a strong pathogenicity indicator.
Expected agent score: 0.55 β 0.80
Task 3 β hard (Uncertain Significance)
Variants where evidence is genuinely ambiguous: missense or regulatory variants in disease genes with no functional studies, conflicting computational predictions, or intermediate frequency. Agents must recognise when evidence is insufficient rather than defaulting to a confident classification.
Expected agent score: 0.35 β 0.60
Reward Function
Each step returns a reward in [0.0, 1.0] composed of three components:
| Component | Max | Criteria |
|---|---|---|
| Classification accuracy | 0.70 | Exact match=0.70, one tier off=0.25, two off=0.05, three+ off=0.00 |
| Reasoning quality | 0.20 | Keyword matches in reasoning (+0.12) + length β₯50 chars (+0.08) |
| Criteria used | 0.10 | Non-empty list (+0.04) + β₯2 items (+0.06) |
Important: Reasoning and criteria bonuses are fully suppressed when the classification is 3+ tiers away from ground truth (e.g. Benign for a Pathogenic variant). Good writing cannot rescue a catastrophically wrong answer.
Setup
Prerequisites
- Python 3.10+
- Docker Desktop or Docker Engine
- A Hugging Face API token (free at huggingface.co)
Install
git clone https://huggingface.co/spaces/fierce74/GenoTriage
cd GenoTriage
pip install openenv-core>=0.2.2
Configure environment variables
Copy .env.example to .env and fill in your values:
cp .env.example .env
HF_TOKEN=hf_your_token_here
API_BASE_URL=https://router.huggingface.co/v1
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
LOCAL_IMAGE_NAME=vep_env_env:latest
Build the Docker image
docker build -t vep_env_env:latest .
Run the server locally (without Docker)
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
Usage
Run the baseline inference script
python inference.py
This runs all 3 tasks sequentially (easy β medium β hard), printing structured logs:
[START] task=easy env=vep_env model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=Benign|CFTR reward=1.00 done=true error=null
...
[END] success=true steps=8 score=0.875 rewards=1.00,0.90,...
Use the client in your own code
import asyncio
from vep_env import VepAction, VepEnv
async def main():
async with VepEnv(base_url="http://localhost:8000") as env:
# Reset β receive a variant case
result = await env.reset()
obs = result.observation
print(f"Gene: {obs.gene} | Disease: {obs.disease}")
print(f"Consequence: {obs.consequence}")
print(f"Population frequency: {obs.population_frequency}")
for snippet in obs.evidence_snippets:
print(f" - {snippet}")
# Submit classification
action = VepAction(
classification="Pathogenic",
reasoning="Nonsense variant in MSH2, absent from gnomAD, causes Lynch syndrome.",
criteria_used=["nonsense variant", "absent from gnomAD", "disease gene"],
)
result = await env.step(action)
print(f"Reward: {result.reward}")
print(f"Feedback: {result.observation.feedback}")
asyncio.run(main())
Control the task tier
VEP_TASK=medium python inference.py # run medium tier only
VEP_TASK=hard uvicorn server.app:app # start server in hard mode
Baseline Scores
Evaluated using Qwen/Qwen2.5-72B-Instruct via Hugging Face Inference Router.
| Task | Score | Notes |
|---|---|---|
| easy | 0.875 | Model correctly identifies benign signals in most cases |
| medium | 0.800 | Strong on loss-of-function; occasionally misses subtle pathogenic signals |
| hard | 0.738 | Tends toward confident classifications when VUS is correct answer |
| overall | 0.804 | Average across all 3 tasks |
Project Structure
GenoTriage/
βββ __init__.py # Package exports
βββ models.py # VepAction, VepObservation (Pydantic)
βββ client.py # VepEnv client (WebSocket)
βββ inference.py # Baseline inference script
βββ variants.json # Curated ClinVar variants (ground truth)
βββ openenv.yaml # OpenEnv spec manifest
βββ pyproject.toml # Package config
βββ Dockerfile # Container definition
βββ server/
βββ app.py # FastAPI application
βββ vep_env_environment.py # Environment logic + grader
βββ requirements.txt # Server dependencies
Data
Variants are sourced from ClinVar (April 2026 release, GRCh38) filtered to:
- SNPs only (
CLNVC=single_nucleotide_variant) - Trusted review status (
criteria_providedor better) - Named disease association
- 8 well-known disease genes: MSH2, MLH1, VHL, CFTR, SCN5A, APC, TSC1, RET
Population allele frequencies are from gnomAD v4 (queried at curation time and stored statically β no live API calls at runtime).