Spaces:
Sleeping
Sleeping
| title: GenoTriage | |
| emoji: 𧬠| |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| # GenoTriage 𧬠| |
| **An OpenEnv environment where AI agents classify real ClinVar SNP variants using ACMG criteria across three clinical difficulty tiers.** | |
| [](https://meta-pytorch.org/OpenEnv/) | |
| [](https://pypi.org/project/openenv-core/) | |
| --- | |
| ## Overview | |
| Clinical geneticists classify genetic variants daily to determine whether a mutation causes disease. This judgment β Pathogenic, Likely Pathogenic, Uncertain, Likely Benign, or Benign β directly impacts patient care, yet remains time-consuming, expert-dependent, and difficult to scale. | |
| **GenoTriage** turns this into a structured RL environment. Agents receive real SNP variants from [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) enriched with population frequency data from [gnomAD](https://gnomad.broadinstitute.org/), and must classify them using the standard [ACMG/AMP five-tier system](https://www.acmg.net/). Each episode is single-step β the agent reads the evidence and submits one classification β making it fast, deterministic, and well-suited for both RL training and LLM evaluation. | |
| --- | |
| ## Environment Description | |
| | Property | Value | | |
| |---|---| | |
| | Variant type | SNPs (single nucleotide polymorphisms) only | | |
| | Data source | ClinVar (NCBI) + gnomAD v4 population frequencies | | |
| | Genome build | GRCh38 | | |
| | Episode structure | Single-step (reset β observe β classify β reward β done) | | |
| | Tasks | 3 (easy, medium, hard) | | |
| | Variants per task | 8 | | |
| | Interface | OpenEnv-compatible (step / reset / state) | | |
| --- | |
| ## Action Space | |
| The agent submits a `VepAction` with three fields: | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `classification` | `str` (one of 5) | ACMG tier: `Pathogenic`, `Likely_pathogenic`, `Uncertain_significance`, `Likely_benign`, or `Benign` | | |
| | `reasoning` | `str` | Explanation citing specific evidence from the observation (min 20 chars encouraged) | | |
| | `criteria_used` | `list[str]` | List of specific criteria that drove the decision (e.g. `"high population frequency"`, `"nonsense variant"`) | | |
| --- | |
| ## Observation Space | |
| The agent receives a `VepObservation` with the following fields: | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `gene` | `str` | Gene symbol (e.g. `BRCA1`, `CFTR`, `MSH2`) | | |
| | `chromosome` | `str` | Chromosome (e.g. `17`) | | |
| | `position` | `int` | GRCh38 genomic position | | |
| | `ref` / `alt` | `str` | Reference and alternate alleles | | |
| | `hgvs` | `str` | HGVS genomic notation | | |
| | `consequence` | `str \| None` | Molecular consequence (e.g. `missense_variant`, `nonsense`, `synonymous_variant`) | | |
| | `disease` | `str` | Primary disease associated with this gene | | |
| | `population_frequency` | `float \| None` | gnomAD v4 allele frequency (None if absent from gnomAD) | | |
| | `evidence_snippets` | `list[str]` | 3β4 evidence snippets: gene-disease context, consequence interpretation, frequency context, functional evidence | | |
| | `task_description` | `str` | Instructions for the agent | | |
| | `feedback` | `str` | Grader feedback after step() β empty on reset() | | |
| | `done` | `bool` | True after first step | | |
| | `reward` | `float` | Reward received (0.0 on reset) | | |
| --- | |
| ## Tasks | |
| ### Task 1 β `easy` (Benign / Likely Benign) | |
| Variants with clear benign signals: moderate-to-high population frequency, synonymous or non-coding consequence, and no functional evidence linking the specific variant to disease. Agents should score well by correctly reading population frequency and consequence type. | |
| **Expected agent score: 0.75 β 0.95** | |
| ### Task 2 β `medium` (Pathogenic / Likely Pathogenic) | |
| Variants with clear pathogenic signals: loss-of-function consequences (nonsense, splice-site), absent from gnomAD, and strong gene-disease association with clinical literature support. Agents must distinguish signal from noise and identify loss-of-function as a strong pathogenicity indicator. | |
| **Expected agent score: 0.55 β 0.80** | |
| ### Task 3 β `hard` (Uncertain Significance) | |
| Variants where evidence is genuinely ambiguous: missense or regulatory variants in disease genes with no functional studies, conflicting computational predictions, or intermediate frequency. Agents must recognise when evidence is insufficient rather than defaulting to a confident classification. | |
| **Expected agent score: 0.35 β 0.60** | |
| --- | |
| ## Reward Function | |
| Each step returns a reward in `[0.0, 1.0]` composed of three components: | |
| | Component | Max | Criteria | | |
| |---|---|---| | |
| | Classification accuracy | 0.70 | Exact match=0.70, one tier off=0.25, two off=0.05, three+ off=0.00 | | |
| | Reasoning quality | 0.20 | Keyword matches in reasoning (+0.12) + length β₯50 chars (+0.08) | | |
| | Criteria used | 0.10 | Non-empty list (+0.04) + β₯2 items (+0.06) | | |
| > **Important:** Reasoning and criteria bonuses are fully suppressed when the classification is 3+ tiers away from ground truth (e.g. Benign for a Pathogenic variant). Good writing cannot rescue a catastrophically wrong answer. | |
| --- | |
| ## Setup | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - Docker Desktop or Docker Engine | |
| - A Hugging Face API token (free at [huggingface.co](https://huggingface.co)) | |
| ### Install | |
| ```bash | |
| git clone https://huggingface.co/spaces/fierce74/GenoTriage | |
| cd GenoTriage | |
| pip install openenv-core>=0.2.2 | |
| ``` | |
| ### Configure environment variables | |
| Copy `.env.example` to `.env` and fill in your values: | |
| ```bash | |
| cp .env.example .env | |
| ``` | |
| ```env | |
| HF_TOKEN=hf_your_token_here | |
| API_BASE_URL=https://router.huggingface.co/v1 | |
| MODEL_NAME=Qwen/Qwen2.5-72B-Instruct | |
| LOCAL_IMAGE_NAME=vep_env_env:latest | |
| ``` | |
| ### Build the Docker image | |
| ```bash | |
| docker build -t vep_env_env:latest . | |
| ``` | |
| ### Run the server locally (without Docker) | |
| ```bash | |
| pip install -e . | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| --- | |
| ## Usage | |
| ### Run the baseline inference script | |
| ```bash | |
| python inference.py | |
| ``` | |
| This runs all 3 tasks sequentially (easy β medium β hard), printing structured logs: | |
| ``` | |
| [START] task=easy env=vep_env model=Qwen/Qwen2.5-72B-Instruct | |
| [STEP] step=1 action=Benign|CFTR reward=1.00 done=true error=null | |
| ... | |
| [END] success=true steps=8 score=0.875 rewards=1.00,0.90,... | |
| ``` | |
| ### Use the client in your own code | |
| ```python | |
| import asyncio | |
| from vep_env import VepAction, VepEnv | |
| async def main(): | |
| async with VepEnv(base_url="http://localhost:8000") as env: | |
| # Reset β receive a variant case | |
| result = await env.reset() | |
| obs = result.observation | |
| print(f"Gene: {obs.gene} | Disease: {obs.disease}") | |
| print(f"Consequence: {obs.consequence}") | |
| print(f"Population frequency: {obs.population_frequency}") | |
| for snippet in obs.evidence_snippets: | |
| print(f" - {snippet}") | |
| # Submit classification | |
| action = VepAction( | |
| classification="Pathogenic", | |
| reasoning="Nonsense variant in MSH2, absent from gnomAD, causes Lynch syndrome.", | |
| criteria_used=["nonsense variant", "absent from gnomAD", "disease gene"], | |
| ) | |
| result = await env.step(action) | |
| print(f"Reward: {result.reward}") | |
| print(f"Feedback: {result.observation.feedback}") | |
| asyncio.run(main()) | |
| ``` | |
| ### Control the task tier | |
| ```bash | |
| VEP_TASK=medium python inference.py # run medium tier only | |
| VEP_TASK=hard uvicorn server.app:app # start server in hard mode | |
| ``` | |
| --- | |
| ## Baseline Scores | |
| Evaluated using `Qwen/Qwen2.5-72B-Instruct` via Hugging Face Inference Router. | |
| | Task | Score | Notes | | |
| |---|---|---| | |
| | easy | 0.875 | Model correctly identifies benign signals in most cases | | |
| | medium | 0.800 | Strong on loss-of-function; occasionally misses subtle pathogenic signals | | |
| | hard | 0.738 | Tends toward confident classifications when VUS is correct answer | | |
| | **overall** | **0.804** | Average across all 3 tasks | | |
| --- | |
| ## Project Structure | |
| ``` | |
| GenoTriage/ | |
| βββ __init__.py # Package exports | |
| βββ models.py # VepAction, VepObservation (Pydantic) | |
| βββ client.py # VepEnv client (WebSocket) | |
| βββ inference.py # Baseline inference script | |
| βββ variants.json # Curated ClinVar variants (ground truth) | |
| βββ openenv.yaml # OpenEnv spec manifest | |
| βββ pyproject.toml # Package config | |
| βββ Dockerfile # Container definition | |
| βββ server/ | |
| βββ app.py # FastAPI application | |
| βββ vep_env_environment.py # Environment logic + grader | |
| βββ requirements.txt # Server dependencies | |
| ``` | |
| --- | |
| ## Data | |
| Variants are sourced from ClinVar (April 2026 release, GRCh38) filtered to: | |
| - SNPs only (`CLNVC=single_nucleotide_variant`) | |
| - Trusted review status (`criteria_provided` or better) | |
| - Named disease association | |
| - 8 well-known disease genes: MSH2, MLH1, VHL, CFTR, SCN5A, APC, TSC1, RET | |
| Population allele frequencies are from gnomAD v4 (queried at curation time and stored statically β no live API calls at runtime). | |
| --- | |