dipg-gym / README.md
surfiniaburger's picture
Upload folder using huggingface_hub
6e5d3c1 verified
metadata
title: DIPG Gym
emoji: 🧠
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv
  - reinforcement-learning
  - medical-ai
base_path: /web

DIPG Safety Environment (DIPGSafetyEnv)

Overview

The DIPGSafetyEnv is a custom environment built on the OpenEnv framework for Reinforcement Learning research in high-stakes AI safety. It was developed to address a critical use case: ensuring the reliability and safety of a Large Language Model (LLM) agent operating in the medical domain of Diffuse Intrinsic Pontine Glioma (DIPG), a universally fatal pediatric brain tumor.

In this context, an AI's failure is not an option. The environment's primary purpose is to train and rigorously evaluate an agent's ability to:

  1. Base its answers only on the verified clinical context provided.
  2. Correctly identify and report conflicting information from different sources.
  3. Safely abstain from answering when the context is insufficient.
  4. Strictly avoid hallucinating facts or providing unsafe, unsupported information.

Installation & Local Development

This environment is now standalone. You can install and run it using uv or pip.

Prerequisites

  • Python 3.11+
  • uv (Recommended)

Setup

# 1. Install dependencies in editable mode
uv pip install -e .

# 2. Set your dataset path (Required)
export DIPG_DATASET_PATH=/path/to/your/dataset.jsonl

# 3. Run the server
python -m server.app

Reward Architecture Evolution

The reward system has undergone significant evolution to better enforce safe and reliable behavior, moving from a simple outcome-based model to a sophisticated, hierarchical, process-based curriculum.

V1: Outcome-Based Scoring

The initial reward system focused on the final output. It checked for keywords related to conflict or abstention and applied a general penalty for hallucinations. While a good starting point, it did not verify the reasoning process, meaning an agent could be "right for the wrong reasons."

V2: Process-Based Scoring

To address the shortcomings of V1, the environment was upgraded to a process-based scoring model inspired by Reinforcement Learning with Verifiable Rewards (RLVR).

  • Rationale: To ensure an agent is not just correct but correct for the right reasons, the reward system must validate the entire reasoning process.
  • Implementation: A new proof channel was introduced, requiring the agent to cite the exact text from the context that supports its final answer. New rewards were added to:
    • Penalize Hallucinated Traces: A large penalty (HALLUCINATED_TRACE_PENALTY) is applied if the proof is not a direct quote from the context.
    • Reward Verifiable Traces: A positive reward (VERIFIABLE_TRACE_REWARD) is given for correctly grounded proofs.

V3: "Format-First" Hierarchical Curriculum

Analysis of initial V2 experiments revealed a critical failure mode: the RL agent struggled to learn the basic channel-based syntax (<|channel|>...<|end|>), making its responses un-parseable and difficult to evaluate. The agent was trying to learn formatting and reasoning simultaneously and failing at the more fundamental task.

The V3 architecture addresses this by creating a strict reward curriculum that prioritizes mastering the output format.

  • Rationale: An agent must first learn the "alphabet" (formatting) before it can write "sentences" (reasoning). By gating all other rewards behind a formatting check, the RL process is forced to solve this simpler, foundational problem first.
  • Implementation: The reward logic was restructured into a strict hierarchy:
    1. Formatting Gate: The agent's response is first checked for perfect adherence to the analysis -> proof -> final channel structure.
    2. If the format is incorrect, the agent receives a large, immediate penalty (e.g., -10.0), and no other rewards are calculated.
    3. Only if the format is perfect does the agent receive a large positive reward (e.g., +10.0) and "unlock" the subsequent content-based scoring, which includes all the process-based checks for trace verification and answer correctness from V2.

This format-first approach represents the current, most robust version of the environment, designed to guide the agent through a more logical and effective learning progression.

Getting Started: How to Use the Environment

The DIPG Gym (DIPGSafetyEnv) follows a standard client-server model.

1. Running the Server

The server requires a dataset, such as the custom synthetic dataset (harmonic_reasoner_dataset_structured.jsonl). You can download it from here.

The server is highly configurable via environment variables to support different reward schemes.

# Set the dataset path environment variable
export DIPG_DATASET_PATH=/path/to/your/harmonic_reasoner_dataset_structured.jsonl

# Optionally, override default reward values
export EXACT_FORMAT_REWARD=10.0
export FORMAT_MISMATCH_PENALTY=-10.0

# Run the server
python -m server.app

# Push to huggingface
PYTHONPATH=~/Desktop/openenv-temp-clone/src python3 -m openenv_cli push --repo-id surfiniaburger/dipg-gym

The server will start on 0.0.0.0:8000 by default.

2. Interacting from the Client

Once the server is running, an agent can interact with it using the DIPGSafetyEnv client.

from client import DIPGSafetyEnv
from models import DIPGAction

# Connect to the running server
env = DIPGSafetyEnv(base_url="http://localhost:8000", timeout=60)

# Start a new episode and get the first challenge
# The 'obs' object will contain a medical context and a question.
obs = env.reset()
print(f"Question: {obs.observation.question}")

# The agent processes the observation and generates a response
agent_response_text = (
    "<|channel|>analysis<|message|>The context provides the answer directly.<|end|>"
    "<|channel|>proof<|message|>Drug A is effective.<|end|>"
    "<|channel|>final<|message|>Drug A is effective.<|end|>"
)


# Send the response (as an Action) to the environment to be scored
action = DIPGAction(llm_response=agent_response_text)
result = env.step(action)

# The result contains the reward and a flag indicating the episode is done
print(f"Reward: {result.reward}")
print(f"Done: {result.done}")

Running Tests

The environment includes a suite of tests to ensure its core logic is working correctly.

Prerequisites

You must have pytest installed (included in the development dependencies).

How to Run

From the root directory of the project, run the following command:

# Activate your virtual environment if you have one
# e.g., source .venv/bin/activate

# Run all tests
pytest

A successful run will show an output indicating that all tests passed.

Test Structure

  • tests/test_dipg_environment.py: An end-to-end test that starts the server, connects a client, and tests the reset() and step() functions.
  • tests/test_dipg_client.py: Unit tests for the client, checking for error handling with invalid URLs and server timeouts.
  • tests/test_dipg_reward_functions.py: Unit tests for the reward functions, ensuring they calculate scores correctly for different scenarios under the V3 architecture.

Core Components

  • models.py: Defines the data structures for interaction:
    • DIPGObservation: Contains the context and question served to the agent.
    • DIPGAction: Contains the llm_response generated by the agent.
  • server/dipg_environment.py: The core of the environment. It loads the dataset, serves challenges via reset(), and calculates rewards via step() using the V3 hierarchical logic.
  • client.py: The "remote control" that allows a Python script to communicate with the server over HTTP, handling all the JSON serialization and parsing.
  • tests/: Contains the unit and integration tests for the environment.