title: Contract Validation Environment Server
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
Contract Validation Environment
The Contract Validation Environment is an OpenEnv-compliant RL and LLM benchmark designed to test an agent's ability to act as a precise legal assistant. The agent must review various contract clauses, identify specific legal risks (e.g., liability, termination, payment), and correctly flag them without generating false positives on standard, safe clauses.
π Motivation
Legal contract review is a massive industry, but off-the-shelf LLMs often struggle with "alert fatigue"βflagging everything as a risk. This environment challenges agents to precisely isolate genuine liabilities across varying difficulty levels while explicitly rewarding speed and accuracy.
π― Tasks & Difficulty
The environment features 3 deterministic tasks with increasing complexity:
- Easy: 1 clause. Contains a single, explicit liability risk. Tests basic risk identification.
- Medium: 3 clauses. Requires identifying payment and termination risks while actively ignoring a safe governing-law distractor clause.
- Hard: 5 clauses. A complex mix of confidentiality, liability, and compliance risks interspersed with dense, safe, standard boilerplate clauses. Challenges frontier models to avoid false positives.
π Environment Details
Action Space (ContractValidationAction)
clause_id(int): The ID of the clause being reviewed (Set to 0 if submitting final).risk_type(str): The identified risk (e.g., 'liability', 'payment', 'termination', 'confidentiality', 'compliance', or 'none').submit_final(bool): Set toTruewhen the agent has finished flagging risks to end the episode and receive a final score.explanation(str): The agent's chain-of-thought or reasoning for the decision.
Observation Space (ContractValidationObservation)
task_level(str): Difficulty level of the current task ("easy", "medium", "hard").contract_clauses(list): List of dictionaries containing theidandtextof the contract clauses to review.flagged_risks(dict): A dictionary mapping clause IDs to the risks currently flagged by the agent.step_count(int): Number of steps taken in the current episode.reward(float): The reward delta granted for the most recent action.done(bool): Whether the episode has concluded.info(dict): Additional environment info, including the current internalscorefrom the grader.
Reward Function & Grader
The environment utilizes a trajectory-based reward system. The grader calculates a score between 0.0 and 1.0 based on precision and recall.
- Positive Reward: Granted for newly correct flags.
- Negative Penalty: Applied for flagging safe clauses or assigning the wrong risk type.
- Step Penalty: A
-0.02penalty is applied per step to encourage the agent to evaluate the contract efficiently. Rewards are clamped tomax(0.0)to ensure compatibility with OpenEnv graders. - Completion Bonus: A
+0.5bonus is awarded if the agent submits the contract with a perfect 1.0 grader score.
Project Structure
contract_validation/
βββ .dockerignore # Docker build exclusions
βββ .env # Local environment variables (API keys - DO NOT COMMIT)
βββ .gitignore # Git tracking exclusions (ignores .env, caches, etc.)
βββ __init__.py # Module exports
βββ README.md # Project documentation (with tags: - openenv)
βββ openenv.yaml # OpenEnv manifest
βββ pyproject.toml # Project metadata and dependencies
βββ uv.lock # Locked dependencies (generated)
βββ client.py # ContractValidationEnv client
βββ inference.py # Evaluation script for the OpenEnv grader (JSON logging)
βββ models.py # Action and Observation Pydantic models
βββ Dockerfile # Container image definition
βββ server/
βββ __init__.py # Server module exports
βββ contract_validation_environment.py # Core environment logic and task data
βββ app.py # FastAPI application (HTTP + WebSocket endpoints)