GenoTriage / README.md
fierce74's picture
Update README.md
6442015 verified
metadata
title: GenoTriage
emoji: 🧬
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv

GenoTriage 🧬

An OpenEnv environment where AI agents classify real ClinVar SNP variants using ACMG criteria across three clinical difficulty tiers.

OpenEnv PyPI


Overview

Clinical geneticists classify genetic variants daily to determine whether a mutation causes disease. This judgment β€” Pathogenic, Likely Pathogenic, Uncertain, Likely Benign, or Benign β€” directly impacts patient care, yet remains time-consuming, expert-dependent, and difficult to scale.

GenoTriage turns this into a structured RL environment. Agents receive real SNP variants from ClinVar enriched with population frequency data from gnomAD, and must classify them using the standard ACMG/AMP five-tier system. Each episode is single-step β€” the agent reads the evidence and submits one classification β€” making it fast, deterministic, and well-suited for both RL training and LLM evaluation.


Environment Description

Property Value
Variant type SNPs (single nucleotide polymorphisms) only
Data source ClinVar (NCBI) + gnomAD v4 population frequencies
Genome build GRCh38
Episode structure Single-step (reset β†’ observe β†’ classify β†’ reward β†’ done)
Tasks 3 (easy, medium, hard)
Variants per task 8
Interface OpenEnv-compatible (step / reset / state)

Action Space

The agent submits a VepAction with three fields:

Field Type Description
classification str (one of 5) ACMG tier: Pathogenic, Likely_pathogenic, Uncertain_significance, Likely_benign, or Benign
reasoning str Explanation citing specific evidence from the observation (min 20 chars encouraged)
criteria_used list[str] List of specific criteria that drove the decision (e.g. "high population frequency", "nonsense variant")

Observation Space

The agent receives a VepObservation with the following fields:

Field Type Description
gene str Gene symbol (e.g. BRCA1, CFTR, MSH2)
chromosome str Chromosome (e.g. 17)
position int GRCh38 genomic position
ref / alt str Reference and alternate alleles
hgvs str HGVS genomic notation
consequence str | None Molecular consequence (e.g. missense_variant, nonsense, synonymous_variant)
disease str Primary disease associated with this gene
population_frequency float | None gnomAD v4 allele frequency (None if absent from gnomAD)
evidence_snippets list[str] 3–4 evidence snippets: gene-disease context, consequence interpretation, frequency context, functional evidence
task_description str Instructions for the agent
feedback str Grader feedback after step() β€” empty on reset()
done bool True after first step
reward float Reward received (0.0 on reset)

Tasks

Task 1 β€” easy (Benign / Likely Benign)

Variants with clear benign signals: moderate-to-high population frequency, synonymous or non-coding consequence, and no functional evidence linking the specific variant to disease. Agents should score well by correctly reading population frequency and consequence type.

Expected agent score: 0.75 – 0.95

Task 2 β€” medium (Pathogenic / Likely Pathogenic)

Variants with clear pathogenic signals: loss-of-function consequences (nonsense, splice-site), absent from gnomAD, and strong gene-disease association with clinical literature support. Agents must distinguish signal from noise and identify loss-of-function as a strong pathogenicity indicator.

Expected agent score: 0.55 – 0.80

Task 3 β€” hard (Uncertain Significance)

Variants where evidence is genuinely ambiguous: missense or regulatory variants in disease genes with no functional studies, conflicting computational predictions, or intermediate frequency. Agents must recognise when evidence is insufficient rather than defaulting to a confident classification.

Expected agent score: 0.35 – 0.60


Reward Function

Each step returns a reward in [0.0, 1.0] composed of three components:

Component Max Criteria
Classification accuracy 0.70 Exact match=0.70, one tier off=0.25, two off=0.05, three+ off=0.00
Reasoning quality 0.20 Keyword matches in reasoning (+0.12) + length β‰₯50 chars (+0.08)
Criteria used 0.10 Non-empty list (+0.04) + β‰₯2 items (+0.06)

Important: Reasoning and criteria bonuses are fully suppressed when the classification is 3+ tiers away from ground truth (e.g. Benign for a Pathogenic variant). Good writing cannot rescue a catastrophically wrong answer.


Setup

Prerequisites

  • Python 3.10+
  • Docker Desktop or Docker Engine
  • A Hugging Face API token (free at huggingface.co)

Install

git clone https://huggingface.co/spaces/fierce74/GenoTriage

cd GenoTriage
pip install openenv-core>=0.2.2

Configure environment variables

Copy .env.example to .env and fill in your values:

cp .env.example .env
HF_TOKEN=hf_your_token_here
API_BASE_URL=https://router.huggingface.co/v1
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
LOCAL_IMAGE_NAME=vep_env_env:latest

Build the Docker image

docker build -t vep_env_env:latest .

Run the server locally (without Docker)

pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000

Usage

Run the baseline inference script

python inference.py

This runs all 3 tasks sequentially (easy β†’ medium β†’ hard), printing structured logs:

[START] task=easy env=vep_env model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=Benign|CFTR reward=1.00 done=true error=null
...
[END] success=true steps=8 score=0.875 rewards=1.00,0.90,...

Use the client in your own code

import asyncio
from vep_env import VepAction, VepEnv

async def main():
    async with VepEnv(base_url="http://localhost:8000") as env:
        # Reset β€” receive a variant case
        result = await env.reset()
        obs = result.observation
        print(f"Gene: {obs.gene} | Disease: {obs.disease}")
        print(f"Consequence: {obs.consequence}")
        print(f"Population frequency: {obs.population_frequency}")
        for snippet in obs.evidence_snippets:
            print(f"  - {snippet}")

        # Submit classification
        action = VepAction(
            classification="Pathogenic",
            reasoning="Nonsense variant in MSH2, absent from gnomAD, causes Lynch syndrome.",
            criteria_used=["nonsense variant", "absent from gnomAD", "disease gene"],
        )
        result = await env.step(action)
        print(f"Reward: {result.reward}")
        print(f"Feedback: {result.observation.feedback}")

asyncio.run(main())

Control the task tier

VEP_TASK=medium python inference.py   # run medium tier only
VEP_TASK=hard uvicorn server.app:app  # start server in hard mode

Baseline Scores

Evaluated using Qwen/Qwen2.5-72B-Instruct via Hugging Face Inference Router.

Task Score Notes
easy 0.875 Model correctly identifies benign signals in most cases
medium 0.800 Strong on loss-of-function; occasionally misses subtle pathogenic signals
hard 0.738 Tends toward confident classifications when VUS is correct answer
overall 0.804 Average across all 3 tasks

Project Structure

GenoTriage/
β”œβ”€β”€ __init__.py              # Package exports
β”œβ”€β”€ models.py                # VepAction, VepObservation (Pydantic)
β”œβ”€β”€ client.py                # VepEnv client (WebSocket)
β”œβ”€β”€ inference.py             # Baseline inference script
β”œβ”€β”€ variants.json            # Curated ClinVar variants (ground truth)
β”œβ”€β”€ openenv.yaml             # OpenEnv spec manifest
β”œβ”€β”€ pyproject.toml           # Package config
β”œβ”€β”€ Dockerfile               # Container definition
└── server/
    β”œβ”€β”€ app.py               # FastAPI application
    β”œβ”€β”€ vep_env_environment.py  # Environment logic + grader
    └── requirements.txt     # Server dependencies

Data

Variants are sourced from ClinVar (April 2026 release, GRCh38) filtered to:

  • SNPs only (CLNVC=single_nucleotide_variant)
  • Trusted review status (criteria_provided or better)
  • Named disease association
  • 8 well-known disease genes: MSH2, MLH1, VHL, CFTR, SCN5A, APC, TSC1, RET

Population allele frequencies are from gnomAD v4 (queried at curation time and stored statically β€” no live API calls at runtime).