Spaces:

AishaniS
/

quantum-rl-optimizer

Sleeping

App Files Files Community

aishani-s20 commited on Apr 12

Commit

2d7c393

1 Parent(s): ea65582

improvements

Browse files

Files changed (4) hide show

README.md +92 -78
inference.py +86 -41
server/graders.py +57 -9
server/quantum_openenv_env_environment.py +118 -64

README.md CHANGED Viewed

@@ -10,182 +10,196 @@ pinned: false
 # 🌌 Quantum Circuit Optimization Environment
-> **An advanced, physics-grounded Reinforcement Learning environment for the Meta OpenEnv Hackathon.** > Challenge agents to act as quantum compilers — optimizing multi-qubit circuits through mathematical identities and commutativity rules.
 ---
-## 🏆 Key Features
-* **NP-Hard Problem Space:** Moves beyond static text puzzles into multi-dimensional spatial reasoning.
-* **Deterministic Reproducibility (Seed Logic):** Fully supports the OpenEnv framework's episode seed. The engine guarantees the **exact same complex circuit** is generated for a given episode across different model runs, ensuring flawless grader reproducibility.
-* **Relative Compression Grader:** Dynamically scores agents based on mathematical compression ratios, adapting perfectly to any circuit depth.
 ---
-## 🚀 Motivation: The Quantum Compiler Challenge
-In the real world, quantum computers suffer from rapid **decoherence**. Every quantum gate introduces noise, so shorter circuits yield higher-fidelity results. However, optimal quantum circuit compression is an NP-Hard problem.
-Current LLM benchmarks often rely on static toy puzzles. This environment bridges the gap by requiring agents to apply real-world quantum physics rules—such as swapping spatially separated, commuting gates to bring distant self-inverse identities together. This precludes simple memorization; agents must dynamically reason about multi-dimensional spatial gate layouts and plan over long horizons.
 ---
-## 🛠️ Environment Specifications
-### 👁️ Observation Space
 The environment provides the agent with a complete topological view of the quantum state at every step.
 | Field | Type | Description |
 |---|---|---|
-| `circuit` | `List[Gate]` | Current sequence of gates. Each gate includes a `name` and `target_qubits`. |
 | `gate_count` | `int` | Current number of gates in the circuit. |
 | `num_qubits` | `int` | Total number of qubits in the system. |
-| `done` | `bool` | `True` if the circuit is fully optimized, dead-ended, or the step limit is reached. |
 | `reward` | `float` | Reward received from the previous action. |
-| `metadata` | `dict` | Instance-specific tracking data: `task`, `initial_count`, `seed`. |
 ---
-### 🎮 Action Space
 The agent submits a JSON payload specifying where and how to modify the circuit.
 | Field | Type | Description |
 |---|---|---|
-| `target_index` | `int` | The index of the primary gate in the circuit array to target. |
-| `action_type` | `int` | The specific quantum physics rule to apply (1–4). See below. |
 #### Available Action Types
 | ID | Name | Description | Reward |
 |---|---|---|---|
-| `1` | **Cancel Identical Gates** | Removes self-inverse gate pairs (e.g., X·X = I) targeting the same qubits, provided they are not blocked by overlapping intermediate gates. | `+1.0` |
-| `2` | **Swap Commuting Gates** | Swaps the target gate with the next adjacent gate **only if** their target qubits do not intersect. This enables agents to bring distant cancellable gates together. | `-0.05` |
-| `3` | **Identity Collapse (H-X-H)** | Replaces a 3-gate sequence `H → X → H` on the same qubit with a single `Z` gate. | `+2.0` |
-| `4` | **Entanglement Compression** | Replaces an adjacent `CNOT → SWAP` sequence sharing exact qubits with a single `CZ` gate. | `+1.0` |
-> **Note:** Invalid actions (e.g., out-of-bounds index, illegal non-commuting swaps) incur a `-0.10` penalty to discourage hallucination, and the circuit state remains unchanged.
 ---
-## 📊 Tasks & Difficulty Levels
-The environment natively supports dynamic scaling of qubits and circuit depth. By setting `QUANTUM_TASK=random`, the environment dynamically generates a fresh, randomized circuit topology from a pool of valid gate pairs and noise injections.
-| Task | Qubits | Initial Gates | Entanglement | Expected Difficulty |
 |---|---|---|---|---|
-| `easy` | 2 | ~20 | None (Single Qubit) | **Low:** Agents can easily spot local cancellations. |
-| `medium` | 4 | ~30 | Low (CNOT, SWAP) | **Moderate:** Requires basic spatial swapping to clear blocker gates. |
-| `hard` | 6 | ~70 | High (Deep Entanglement) | **Extreme:** Demands rigorous long-horizon spatial reasoning across many qubits. |
 ---
-## 🏆 Grader & Evaluation
-Because calculating the absolute theoretical minimum length of a randomized multi-qubit circuit is NP-Hard, the environment utilizes a **Relative Compression Grader**:
-$$\text{Score} = \max\left(0.0, \min\left(1.0, \frac{\text{Initial Count} - \text{Final Count}}{\text{Initial Count}}\right)\right)$$
-- A score of **1.0** indicates the agent perfectly compressed the circuit down to 0 gates.
-- The **success threshold** is `0.10` — meaning a 10% reduction in overall circuit depth is considered a passing score for a given episode.
----
-## 📈 Baseline Scores
-This environment is designed to serve as a rigorous boundary test for frontier reasoning models. All baseline evaluations are fully reproducible using the environment's deterministic seed logic.
-| Model | Task | Result | Notes |
-|---|---|---|---|
-| Qwen 2.5 72B Instruct (Zero-Shot) | `easy` | **Passing Baseline** | Successfully identifies and executes local cancellations (Score: ~0.15–0.30). |
-| Qwen 2.5 72B Instruct (Zero-Shot) | `medium` | **Borderline** | Attempts basic spatial swapping but frequently gets trapped by blocking gates. Usually falls just short of the 0.10 success threshold (Score: ~0.00–0.08). |
-| Qwen 2.5 72B Instruct (Zero-Shot) | `hard` | **Benchmark Limit** | Provides a highly complex layout that tests the absolute limits of current LLMs, establishing a rigorous 0.0 baseline. **(100% reproducible via episode seeds)**. |
-> **Conclusion:** This environment successfully establishes an unsolved benchmark for testing algorithmic spatial planning, proving that advanced scaffolding (e.g., Tree-of-Thought or ReAct loops) is required for deep quantum compilation.
->
-> **Note on Reproducibility:** You can reliably reproduce these exact baseline constraints. The environment fully supports OpenEnv episode seeding, guaranteeing the exact same initial circuit generation for any given seed across different runs.
 ---
-## 💻 Setup and Usage Instructions
 ### 1. Prerequisites
-Ensure you have **Docker** and **uv** installed, then install the OpenEnv core dependencies:
 ```bash
-uv pip install openenv-core
 uv sync
 ```
 ### 2. Environment Variables
-Create a ```.env``` file in the root directory:
-```bash
 HF_TOKEN="your_huggingface_read_token"
-API_BASE_URL="[https://router.huggingface.co/v1](https://router.huggingface.co/v1)"
 MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
 QUANTUM_TASK="random"
 ```
 | Variable | Description |
 |---|---|
-| `HF_TOKEN` | Your HuggingFace API token (read access) |
 | `API_BASE_URL` | Inference endpoint (HF router or custom) |
 | `MODEL_NAME` | Model to run inference with |
-| `QUANTUM_TASK` | Task name: `easy`, `medium`, `hard`, or `random` |
 ### 3. Build & Validate
 ```bash
 docker build -t quantum_env .
 openenv validate .
 ```
 ### 4. Run Inference
 ```bash
 uv run python inference.py
 ```
-The inference script handles API errors gracefully and automatically parses JSON outputs into the strict Action Space schema.
-### 5. Reproducing via Seed
-To test the deterministic generation and replicate our baseline scores, you can pass a specific seed to the environment during the reset phase in your client script.
-Simply modify the reset call in ```inference.py```:
-```bash
-# Pass any integer seed to guarantee the exact same initial circuit topography
 result = await env.reset(seed=42)
 ```
 ---
-## 📁 Project Structure
 ```
 .
 ├── server/
-│   ├── app.py                            # FastAPI WebSocket Server
-│   └── quantum_openenv_env_environment.py # Core Physics Engine & Randomizer
-├── client.py                             # EnvClient Translator
-├─�� models.py                             # Strict Pydantic Data Models
-├── inference.py                          # Baseline LLM Agent Script
-├── openenv.yaml                          # OpenEnv spec manifest
-├── Dockerfile                            # Container definition
 └── README.md
 ```
 ---
-## 📄 License
-This project is released under the MIT license found in the `LICENSE` file.
----

 # 🌌 Quantum Circuit Optimization Environment
+> **An advanced, physics-grounded Reinforcement Learning environment for the Meta OpenEnv Hackathon.**
+> Challenge agents to act as quantum compilers — optimizing multi-qubit circuits through mathematical identities and commutativity rules.
 ---
+## Key Features
+- **NP-Hard Problem Space:** Moves beyond static text puzzles into multi-dimensional spatial reasoning.
+- **Deterministic Reproducibility (Seed Logic):** Fully supports the OpenEnv framework's episode seed. The engine guarantees the **exact same circuit** is generated for a given seed across different model runs, ensuring flawless grader reproducibility.
+- **Three Differentiated Graders:** Each difficulty tier measures a genuinely different skill — pure compression on easy, identity-discovery bonus on medium, and step-efficiency weighting on hard.
 ---
+## Motivation: The Quantum Compiler Challenge
+In the real world, quantum computers suffer from rapid **decoherence**. Every quantum gate introduces noise, so shorter circuits yield higher-fidelity results. However, optimal quantum circuit compression is an **NP-Hard problem**.
+While traditional frameworks like **Qiskit, Cirq, and tket** rely on hardcoded human heuristics to identify redundant gates and exploit commutativity, this environment turns that exact physics problem into a rigorous testing ground for Artificial Intelligence. It is designed to evaluate whether RL and LLM agents can independently learn and execute these compiler heuristics from scratch.
+Current LLM benchmarks rely on static toy puzzles. This environment bridges the gap by requiring agents to generalize real-world quantum physics rules — such as swapping spatially separated, commuting gates to bring distant self-inverse identities together. **Memorization is impossible**; agents must dynamically reason about multi-dimensional spatial gate layouts and plan over long horizons.
 ---
+## Environment Specifications
+### Observation Space
 The environment provides the agent with a complete topological view of the quantum state at every step.
 | Field | Type | Description |
 |---|---|---|
+| `circuit` | `List[Gate]` | Current gate sequence. Each gate has a `name` (e.g. `"H"`, `"CNOT"`) and `target_qubits`. |
 | `gate_count` | `int` | Current number of gates in the circuit. |
 | `num_qubits` | `int` | Total number of qubits in the system. |
+| `done` | `bool` | `True` if the circuit is fully optimized, dead-ended, or the step limit (150) is reached. |
 | `reward` | `float` | Reward received from the previous action. |
+| `metadata` | `dict` | Episode tracking data — see breakdown below. |
+#### Metadata Fields
+| Key | Type | Description |
+|---|---|---|
+| `task` | `str` | Active task name: `"easy"`, `"medium"`, or `"hard"`. |
+| `initial_count` | `int` | Gate count at episode start. Used by all graders to compute compression ratio. |
+| `step` | `int` | Current step number. Used by the hard grader for step-efficiency scoring. |
+| `seed` | `int \| None` | RNG seed used to generate this circuit. Pass the same value to `reset()` to reproduce it exactly. |
+| `used_advanced_actions` | `bool` | `True` if the agent successfully used action 3 (H-X-H→Z) or action 4 (CNOT-SWAP→CZ) this episode. Used by the medium grader bonus. |
 ---
+### Action Space
 The agent submits a JSON payload specifying where and how to modify the circuit.
 | Field | Type | Description |
 |---|---|---|
+| `target_index` | `int` | Index of the primary gate in the circuit array to target. |
+| `action_type` | `int` | Quantum physics rule to apply (1–4). See below. |
 #### Available Action Types
 | ID | Name | Description | Reward |
 |---|---|---|---|
+| `1` | **Cancel Identical Gates** | Removes self-inverse gate pairs (X·X = I, H·H = I, CNOT·CNOT = I, etc.) on the same qubits, not blocked by overlapping intermediate gates. | `+1.0` |
+| `2` | **Swap Commuting Gates** | Swaps the target gate with the next adjacent gate **only if** their qubit sets do not intersect. Enables bringing distant cancellable pairs together. | `-0.05` |
+| `3` | **H-X-H Identity Collapse** | Replaces a `H → X → H` sequence on the same qubit with a single `Z` gate (net: 2 gates removed). | `+2.0` |
+| `4` | **Entanglement Compression** | Replaces an adjacent `CNOT → SWAP` on the same qubits with a single `CZ` gate (net: 1 gate removed). | `+1.0` |
+> **Invalid actions** (out-of-bounds index, illegal non-commuting swap, pattern not present) incur a `-0.10` penalty. Circuit state remains unchanged.
 ---
+## Tasks & Difficulty Levels
+| Task | Qubits | Initial Gates | Entanglement | Key Challenge |
 |---|---|---|---|---|
+| `easy` | 2 | ~20 | None (single-qubit only) | Identify and cancel local self-inverse gate pairs. |
+| `medium` | 4 | ~30 | Low (CNOT, SWAP) | Swap to unblock cancellations; discover H-X-H and CNOT-SWAP identities. |
+| `hard` | 6 | ~70 | High (deep entanglement) | Long-horizon spatial reasoning; must compress efficiently with minimal wasted steps. |
+Set `QUANTUM_TASK=random` to have the environment randomly select a difficulty tier on each `reset()`.
 ---
+## Grader & Evaluation
+Each grader measures a **different skill** matching its difficulty tier. All scores are strictly within `(0.01, 0.99)`.
+| Task | Grader Formula | Full Score Requires |
+|---|---|---|
+| **Easy** | `score = (initial − final) / initial` | Any consistent gate removal earns proportional credit. |
+| **Medium** | `score = compression + 0.15` if agent used action 3 or 4, else `score = compression` | Gate removal **and** discovering at least one algebraic identity. |
+| **Hard** | `score = 0.7 × compression + 0.3 × step_efficiency` where `step_efficiency = 1 − (steps / 150)` | High compression **and** achieving it with few wasted steps. |
+The hard grader directly penalises the behaviour frontier models exhibit most — thrashing through invalid swaps before finding cancellations, which exhausts the step budget without progress.
+> **Why not use the theoretical minimum gate count?** Computing the absolute minimum for a randomized multi-qubit circuit is NP-Hard. Relative compression grading is the standard approach used in real quantum compiler benchmarks, and is the only approach that scales to arbitrary circuit depth.
+---
+## Baseline Scores
+| Model | Task | Score | Result | Notes |
+|---|---|---|---|---|
+| Qwen 2.5 72B Instruct (Zero-Shot) | `easy` | ~0.22 | Pass | Identifies local cancellations reliably. |
+| Qwen 2.5 72B Instruct (Zero-Shot) | `medium` | ~0.08 | Pass | Occasional cancellations; rarely discovers identities; no bonus awarded. |
+| Qwen 2.5 72B Instruct (Zero-Shot) | `hard` | ~0.04 | Fail | Thrashes with invalid swaps; step budget exhausted before meaningful compression. |
+> Success threshold: `score ≥ 0.10`. The hard task is an **unsolved benchmark** for zero-shot reasoning models. Advanced scaffolding (ReAct, Tree-of-Thought) is required for reliable performance.
 ---
+## Setup and Usage Instructions
 ### 1. Prerequisites
 ```bash
+pip install openenv-core
 uv sync
 ```
 ### 2. Environment Variables
+Create a `.env` file in the root directory:
+```env
 HF_TOKEN="your_huggingface_read_token"
+API_BASE_URL="https://router.huggingface.co/v1"
 MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
 QUANTUM_TASK="random"
+IMAGE_NAME="quantum_env"
 ```
 | Variable | Description |
 |---|---|
+| `HF_TOKEN` | HuggingFace API token (read access) |
 | `API_BASE_URL` | Inference endpoint (HF router or custom) |
 | `MODEL_NAME` | Model to run inference with |
+| `QUANTUM_TASK` | Task: `easy`, `medium`, `hard`, or `random` |
+| `IMAGE_NAME` | Docker image name for the environment server |
 ### 3. Build & Validate
 ```bash
 docker build -t quantum_env .
 openenv validate .
 ```
 ### 4. Run Inference
 ```bash
 uv run python inference.py
 ```
+The script runs **easy → medium → hard** sequentially, each in its own container instance, and prints a results summary table at the end. All 3 tasks are always evaluated.
+### 5. Reproducing Baseline via Seed
+To reproduce the exact same circuit for a given episode, pass a seed to `reset()`:
+```python
+# Same seed always produces the same initial circuit
 result = await env.reset(seed=42)
 ```
+The environment uses `random.Random(seed)` internally — fully isolated per instance, safe for concurrent WebSocket sessions.
 ---
+## Project Structure
 ```
 .
 ├── server/
+│   ├── __init__.py
+│   ├── app.py                               # FastAPI server entry point
+│   ├── graders.py                           # Task-specific grader functions
+│   └── quantum_openenv_env_environment.py   # Core environment + physics engine
+├── __init__.py
+├── client.py                                # OpenEnv WebSocket client
+├── models.py                                # Typed Pydantic models
+├── inference.py                             # Baseline LLM inference script (all 3 tasks)
+├── openenv.yaml                             # OpenEnv spec manifest
+├── Dockerfile                               # Container definition
+├── pyproject.toml
 └── README.md
 ```
 ---
+## License
+This project is released under the MIT license found in the `LICENSE` file.

inference.py CHANGED Viewed

@@ -1,22 +1,24 @@
 """
 Inference Script
 ================
-Runs the LLM agent against all 3 tasks (easy, medium, hard) and emits
-a [START] / [END] log line for each, which the hackathon platform requires
-to validate that all 3 tasks have graders.
 Required environment variables:
-    API_BASE_URL      The API endpoint for the LLM.
-    MODEL_NAME        The model identifier.
-    HF_TOKEN          Your Hugging Face / API key.
-    IMAGE_NAME        Docker image name (default: quantum_env).
 """
 import asyncio
 import json
 import os
 import textwrap
-from typing import List, Optional
 from dotenv import load_dotenv
@@ -39,7 +41,7 @@ TEMPERATURE = 0.7
 MAX_TOKENS = 150
 SUCCESS_SCORE_THRESHOLD = 0.1
-# All 3 tasks are always evaluated — this is what the platform requires
 ALL_TASKS = ["easy", "medium", "hard"]
@@ -48,28 +50,34 @@ SYSTEM_PROMPT = textwrap.dedent(
     You are an AI agent tasked with optimizing a multi-qubit quantum circuit.
     You will be given the current circuit as a list of gates with their index, name, and target_qubits.
-    You have 4 possible actions you can take at any index.
-    Action 1: Cancel identical self-inverse gates (H, X, Y, Z, CNOT, SWAP). They must be on the same qubits and not blocked by intermediate gates sharing those qubits.
-    Action 2: Swap adjacent commuting gates (gates that operate on entirely different qubits and do not overlap).
     Action 3: Replace an H-X-H sequence on the same qubit with a Z gate.
     Action 4: Replace a CNOT-SWAP sequence on the same qubits with a CZ gate.
-    You MUST output ONLY a valid JSON object with exactly two keys: 'target_index' (integer) and 'action_type' (integer 1-4).
     Example: {"target_index": 2, "action_type": 1}
     Do not output markdown, backticks, or any other text.
     """
 ).strip()
 def log_start(task: str, env: str, model: str) -> None:
     print(f"[START] task={task} env={env} model={model}", flush=True)
 def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
     error_val = error if error else "null"
-    done_val = str(done).lower()
     print(
-        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
         flush=True,
     )
@@ -77,21 +85,24 @@ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[
 def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
     rewards_str = ",".join(f"{r:.2f}" for r in rewards)
     print(
-        f"[END] success={str(success).lower()} steps={steps} score={score:.2f} rewards={rewards_str}",
         flush=True,
     )
 def build_user_prompt(step: int, circuit: list, last_reward: float, history: List[str]) -> str:
-    if circuit:
-        circuit_lines = [
             f"Index {i}: {gate.name} on qubits {gate.target_qubits}"
             for i, gate in enumerate(circuit)
-        ]
-        circuit_block = "\n".join(circuit_lines)
-    else:
-        circuit_block = "Empty circuit"
     history_block = "\n".join(history[-4:]) if history else "None"
     return textwrap.dedent(
         f"""
@@ -106,7 +117,13 @@ def build_user_prompt(step: int, circuit: list, last_reward: float, history: Lis
     ).strip()
-def get_model_action(client: OpenAI, step: int, circuit: list, last_reward: float, history: List[str]) -> str:
     user_prompt = build_user_prompt(step, circuit, last_reward, history)
     try:
         completion = client.chat.completions.create(
@@ -126,26 +143,32 @@ def get_model_action(client: OpenAI, step: int, circuit: list, last_reward: floa
         return "{}"
-async def run_single_task(task_name: str, env: QuantumOpenenvEnv, client: OpenAI) -> None:
     """
-    Run one full episode for a given task and emit [START] / [END] log lines.
-    The platform validates that all 3 tasks appear in these logs.
     """
     history: List[str] = []
     rewards: List[float] = []
     steps_taken = 0
-    score = 0.0
     success = False
     try:
-        # Reset with the specific task seed for reproducibility
         result = await env.reset()
         circuit = result.observation.circuit
         last_reward = 0.0
         initial_gate_count = len(circuit)
-        # Infer actual task name from metadata (env may be running in random mode)
         actual_task = (result.observation.metadata or {}).get("task", task_name)
         if actual_task not in ALL_TASKS:
             actual_task = task_name
@@ -169,7 +192,9 @@ async def run_single_task(task_name: str, env: QuantumOpenenvEnv, client: OpenAI
                 target_index = 0
                 action_type = 1
-            result = await env.step(QuantumAction(target_index=target_index, action_type=action_type))
             reward = result.reward or 0.0
             done = result.done
@@ -184,7 +209,7 @@ async def run_single_task(task_name: str, env: QuantumOpenenvEnv, client: OpenAI
             if done:
                 break
-        # Inject initial count for grader
         if not result.observation.metadata:
             result.observation.metadata = {}
         result.observation.metadata["initial_count"] = initial_gate_count
@@ -199,36 +224,56 @@ async def run_single_task(task_name: str, env: QuantumOpenenvEnv, client: OpenAI
     finally:
         log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
 async def main() -> None:
     """
-    Run all 3 tasks sequentially.
-    The hackathon platform requires inference.py to produce a [START] / [END]
-    log pair for EACH of the 3 tasks (easy, medium, hard). Running only one
-    task causes "Not enough tasks with graders" in Phase 2 Task Validation.
     """
     client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
     for task_name in ALL_TASKS:
         print(f"\n{'='*60}", flush=True)
-        print(f"Running task: {task_name}", flush=True)
         print(f"{'='*60}", flush=True)
-        # Start a fresh Docker environment instance for each task
-        # Pass task name so the env generates the right circuit type
         env = await QuantumOpenenvEnv.from_docker_image(
             IMAGE_NAME,
             env_vars={"QUANTUM_TASK": task_name},
         )
         try:
-            await run_single_task(task_name, env, client)
         finally:
             try:
                 await env.close()
             except Exception as e:
                 print(f"[DEBUG] env.close() error for task {task_name}: {e}", flush=True)
 if __name__ == "__main__":
     asyncio.run(main())

 """
 Inference Script
 ================
+Runs the LLM agent against all 3 tasks (easy, medium, hard) sequentially
+and prints a [START] / [END] log line for each task.
+The hackathon platform requires all 3 tasks to appear in the log output
+for Task Validation to pass.
 Required environment variables:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier.
+    HF_TOKEN       Your Hugging Face / API key.
+    IMAGE_NAME     Docker image name (default: quantum_env).
 """
 import asyncio
 import json
 import os
 import textwrap
+from typing import List, Optional, Tuple
 from dotenv import load_dotenv
 MAX_TOKENS = 150
 SUCCESS_SCORE_THRESHOLD = 0.1
+# Platform requires all 3 tasks to appear in [START] log lines
 ALL_TASKS = ["easy", "medium", "hard"]
     You are an AI agent tasked with optimizing a multi-qubit quantum circuit.
     You will be given the current circuit as a list of gates with their index, name, and target_qubits.
+    You have 4 possible actions:
+    Action 1: Cancel identical self-inverse gates (H, X, Y, Z, CNOT, SWAP) on the same qubits,
+              not blocked by intermediate gates sharing those qubits.
+    Action 2: Swap adjacent commuting gates (gates on entirely different, non-overlapping qubits).
     Action 3: Replace an H-X-H sequence on the same qubit with a Z gate.
     Action 4: Replace a CNOT-SWAP sequence on the same qubits with a CZ gate.
+    You MUST output ONLY a valid JSON object with exactly two keys:
+      'target_index' (integer) and 'action_type' (integer 1-4).
     Example: {"target_index": 2, "action_type": 1}
     Do not output markdown, backticks, or any other text.
     """
 ).strip()
+# ============================================================================
+# Logging (format required by hackathon platform output parser)
+# ============================================================================
 def log_start(task: str, env: str, model: str) -> None:
     print(f"[START] task={task} env={env} model={model}", flush=True)
 def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
     error_val = error if error else "null"
     print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} "
+        f"done={str(done).lower()} error={error_val}",
         flush=True,
     )
 def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
     rewards_str = ",".join(f"{r:.2f}" for r in rewards)
     print(
+        f"[END] success={str(success).lower()} steps={steps} "
+        f"score={score:.2f} rewards={rewards_str}",
         flush=True,
     )
+# ============================================================================
+# Prompt building
+# ============================================================================
 def build_user_prompt(step: int, circuit: list, last_reward: float, history: List[str]) -> str:
+    circuit_block = (
+        "\n".join(
             f"Index {i}: {gate.name} on qubits {gate.target_qubits}"
             for i, gate in enumerate(circuit)
+        )
+        if circuit else "Empty circuit"
+    )
     history_block = "\n".join(history[-4:]) if history else "None"
     return textwrap.dedent(
         f"""
     ).strip()
+def get_model_action(
+    client: OpenAI,
+    step: int,
+    circuit: list,
+    last_reward: float,
+    history: List[str],
+) -> str:
     user_prompt = build_user_prompt(step, circuit, last_reward, history)
     try:
         completion = client.chat.completions.create(
         return "{}"
+# ============================================================================
+# Single task episode
+# ============================================================================
+async def run_single_task(
+    task_name: str,
+    env: QuantumOpenenvEnv,
+    client: OpenAI,
+) -> Tuple[str, float, bool]:
     """
+    Run one full episode for a given task.
+    Returns (task_name, score, success).
     """
     history: List[str] = []
     rewards: List[float] = []
     steps_taken = 0
+    score = 0.01
     success = False
     try:
         result = await env.reset()
         circuit = result.observation.circuit
         last_reward = 0.0
         initial_gate_count = len(circuit)
+        # Resolve actual task from metadata (env may override based on QUANTUM_TASK)
         actual_task = (result.observation.metadata or {}).get("task", task_name)
         if actual_task not in ALL_TASKS:
             actual_task = task_name
                 target_index = 0
                 action_type = 1
+            result = await env.step(
+                QuantumAction(target_index=target_index, action_type=action_type)
+            )
             reward = result.reward or 0.0
             done = result.done
             if done:
                 break
+        # Inject initial count so grader can compute compression ratio
         if not result.observation.metadata:
             result.observation.metadata = {}
         result.observation.metadata["initial_count"] = initial_gate_count
     finally:
         log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return task_name, score, success
+# ============================================================================
+# Main: loop over all 3 tasks
+# ============================================================================
 async def main() -> None:
     """
+    Run all 3 tasks sequentially, each in its own Docker container instance.
+    The hackathon platform requires:
+    - A [START] task=X line for each of easy, medium, hard
+    - A [END] score=Y line for each task
+    - At least 3 tasks with graders validated in the log
     """
     client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    results: List[Tuple[str, float, bool]] = []
     for task_name in ALL_TASKS:
         print(f"\n{'='*60}", flush=True)
+        print(f"  Task: {task_name.upper()}", flush=True)
         print(f"{'='*60}", flush=True)
         env = await QuantumOpenenvEnv.from_docker_image(
             IMAGE_NAME,
             env_vars={"QUANTUM_TASK": task_name},
         )
         try:
+            task, score, success = await run_single_task(task_name, env, client)
+            results.append((task, score, success))
         finally:
             try:
                 await env.close()
             except Exception as e:
                 print(f"[DEBUG] env.close() error for task {task_name}: {e}", flush=True)
+    # -----------------------------------------------------------------------
+    # Summary table — printed at end for human reviewers in Phase 3
+    # -----------------------------------------------------------------------
+    print(f"\n{'='*60}", flush=True)
+    print("  BASELINE RESULTS SUMMARY", flush=True)
+    print(f"{'='*60}", flush=True)
+    print(f"  {'Task':<10} {'Score':>8}  {'Result'}", flush=True)
+    print(f"  {'-'*40}", flush=True)
+    for task, score, success in results:
+        status = "PASS ✓" if success else "FAIL ✗"
+        print(f"  {task:<10} {score:>8.3f}  {status}", flush=True)
+    print(f"{'='*60}\n", flush=True)
 if __name__ == "__main__":
     asyncio.run(main())

server/graders.py CHANGED Viewed

@@ -2,41 +2,89 @@
 # All rights reserved.
 """
-Standalone graders for the Quantum Circuit Optimization Environment.
-Scores are strictly within (0.0, 1.0) — never exactly 0.0 or 1.0.
 """
 def _strict(score: float) -> float:
-    """Clamp score to strictly (0.0, 1.0) as required by the platform."""
-    return max(0.01, min(0.99, score))
 def grade_easy(observation) -> float:
     metadata = getattr(observation, 'metadata', {}) or {}
     final_count = getattr(observation, 'gate_count', 0)
     initial_count = metadata.get("initial_count", final_count)
     if initial_count == 0:
-        return _strict(0.99)
     compression = (initial_count - final_count) / initial_count
     return _strict(compression)
 def grade_medium(observation) -> float:
     metadata = getattr(observation, 'metadata', {}) or {}
     final_count = getattr(observation, 'gate_count', 0)
     initial_count = metadata.get("initial_count", final_count)
     if initial_count == 0:
-        return _strict(0.99)
     compression = (initial_count - final_count) / initial_count
-    return _strict(compression / 0.20)
 def grade_hard(observation) -> float:
     metadata = getattr(observation, 'metadata', {}) or {}
     final_count = getattr(observation, 'gate_count', 0)
     initial_count = metadata.get("initial_count", final_count)
     if initial_count == 0:
-        return _strict(0.99)
     compression = (initial_count - final_count) / initial_count
-    return _strict(compression / 0.35)

 # All rights reserved.
 """
+Graders for the Quantum Circuit Optimization Environment.
+Each grader measures a different aspect of performance matching its difficulty tier:
+  - Easy:   Pure compression ratio. Any gate removal earns proportional credit.
+  - Medium: Compression + bonus for using advanced identity actions (3 or 4).
+  - Hard:   Weighted blend of compression and step efficiency. Harder threshold.
+All scores are strictly within (0.01, 0.99) as required by the platform.
 """
 def _strict(score: float) -> float:
+    """Clamp to strictly (0.0, 1.0) — platform rejects exactly 0.0 or 1.0."""
+    return max(0.01, min(0.99, float(score)))
 def grade_easy(observation) -> float:
+    """
+    Easy grader: pure compression ratio.
+    Score = (initial_gates - final_gates) / initial_gates
+    Any reduction in gate count earns proportional credit.
+    No bonus mechanics — agent just needs to find and cancel obvious pairs.
+    """
     metadata = getattr(observation, 'metadata', {}) or {}
     final_count = getattr(observation, 'gate_count', 0)
     initial_count = metadata.get("initial_count", final_count)
     if initial_count == 0:
+        return _strict(0.5)
     compression = (initial_count - final_count) / initial_count
     return _strict(compression)
 def grade_medium(observation) -> float:
+    """
+    Medium grader: compression ratio + bonus for advanced identity usage.
+    Score = compression_ratio + 0.15 bonus if agent used action 3 (H-X-H→Z)
+            or action 4 (CNOT-SWAP→CZ) at least once during the episode.
+    This rewards agents that discover algebraic identities beyond simple
+    gate cancellation — a meaningfully harder skill than the easy task.
+    """
     metadata = getattr(observation, 'metadata', {}) or {}
     final_count = getattr(observation, 'gate_count', 0)
     initial_count = metadata.get("initial_count", final_count)
     if initial_count == 0:
+        return _strict(0.5)
     compression = (initial_count - final_count) / initial_count
+    # Bonus for using advanced identity actions (tracked in metadata by environment)
+    used_advanced = metadata.get("used_advanced_actions", False)
+    bonus = 0.15 if used_advanced else 0.0
+    return _strict(compression + bonus)
 def grade_hard(observation) -> float:
+    """
+    Hard grader: weighted blend of compression efficiency and step efficiency.
+    Score = 0.7 * compression_ratio + 0.3 * step_efficiency
+    where step_efficiency = 1 - (steps_taken / max_steps)
+    This penalises agents that compress the circuit but waste many steps —
+    exactly the behaviour frontier models exhibit on hard tasks
+    (thrashing with invalid swaps before finding cancellations).
+    """
     metadata = getattr(observation, 'metadata', {}) or {}
     final_count = getattr(observation, 'gate_count', 0)
     initial_count = metadata.get("initial_count", final_count)
+    steps_taken = metadata.get("step", 1)
+    max_steps = 150
     if initial_count == 0:
+        return _strict(0.5)
     compression = (initial_count - final_count) / initial_count
+    step_efficiency = max(0.0, 1.0 - (steps_taken / max_steps))
+    score = 0.7 * compression + 0.3 * step_efficiency
+    return _strict(score)

server/quantum_openenv_env_environment.py CHANGED Viewed

@@ -12,13 +12,16 @@ Architecture:
 - Instance-isolated PRNG (seeding) for strict reproducibility in server environments.
 - Relative Compression Grading: Evaluates agents on compression ratio rather than
   an absolute theoretical minimum, mirroring real-world NP-Hard quantum optimization constraints.
 """
 import random
 from uuid import uuid4
 from openenv.core.env_server.interfaces import Environment
-from openenv.core.env_server.types import State
 from quantum_openenv_env.models import QuantumAction, QuantumGate, QuantumObservation
@@ -77,17 +80,15 @@ TASKS = ["easy", "medium", "hard"]
 # ============================================================================
-# Standalone graders (used by graders.py and inference.py)
 # ============================================================================
-from quantum_openenv_env.server.graders import grade_easy as _grade_easy_fn
-from quantum_openenv_env.server.graders import grade_medium as _grade_medium_fn
-from quantum_openenv_env.server.graders import grade_hard as _grade_hard_fn
 GRADERS = {
-    "easy":   _grade_easy_fn,
-    "medium": _grade_medium_fn,
-    "hard":   _grade_hard_fn,
 }
@@ -102,20 +103,38 @@ class QuantumCircuitOptimizationEnvironment(Environment):
     The agent acts as a quantum compiler, reducing circuit depth by applying
     mathematical identities and commutativity rules across 3 difficulty tiers.
     Action types:
-        1 - Cancel identical self-inverse gate pairs
-        2 - Swap adjacent commuting gates (different qubits)
-        3 - Replace H-X-H sequence with Z gate
-        4 - Replace CNOT-SWAP sequence with CZ gate
     """
     SUPPORTS_CONCURRENT_SESSIONS: bool = True
-    SELF_INVERSE_GATES = {"H", "X", "Y", "Z", "CNOT", "CX", "CZ", "SWAP", "CCX", "TOFFOLI", "CSWAP", "FREDKIN"}
     def __init__(self, task: str = "random", seed: int = None):
         self.mode = task
         if self.mode != "random" and self.mode not in TASK_CONFIGS:
-            raise ValueError(f"Unknown task: {task}. Must be 'random' or one of {list(TASK_CONFIGS.keys())}")
         self._state = State(episode_id=str(uuid4()), step_count=0)
         self._reset_count = 0
@@ -126,11 +145,21 @@ class QuantumCircuitOptimizationEnvironment(Environment):
         self.task_config = TASK_CONFIGS["easy"]
         self._circuit: list[QuantumGate] = []
         self._initial_gate_count = 0
-    def reset(self) -> QuantumObservation:
         """Reset the environment to a fresh circuit for the configured task."""
         self._state = State(episode_id=str(uuid4()), step_count=0)
         self._reset_count += 1
         if self.mode == "random":
             self.task_name = self.rng.choice(TASKS)
@@ -152,10 +181,11 @@ class QuantumCircuitOptimizationEnvironment(Environment):
                 "reset_count": self._reset_count,
                 "initial_count": self._initial_gate_count,
                 "seed": self.current_seed,
             },
         )
-    def step(self, action: QuantumAction) -> QuantumObservation:  # type: ignore[override]
         """Execute one action in the environment."""
         self._state.step_count += 1
         target_index = action.target_index
@@ -170,7 +200,9 @@ class QuantumCircuitOptimizationEnvironment(Environment):
         gate_at_index = self._circuit[target_index]
         active_qubits = set(gate_at_index.target_qubits)
         # ACTION 1: Cancel Identical Self-Inverse Gates
         if action_type == 1:
             next_gate_index = None
             for j in range(target_index + 1, len(self._circuit)):
@@ -188,7 +220,9 @@ class QuantumCircuitOptimizationEnvironment(Environment):
                 reward = 1.0
                 action_result = "cancelled_identical"
         # ACTION 2: Swap Commuting Gates
         elif action_type == 2:
             if target_index + 1 < len(self._circuit):
                 next_gate = self._circuit[target_index + 1]
@@ -201,75 +235,105 @@ class QuantumCircuitOptimizationEnvironment(Environment):
                     reward = -0.05
                     action_result = "swapped_commuting"
-        # ACTION 3: Replace H-X-H with Z
         elif action_type == 3:
             if target_index + 2 < len(self._circuit):
                 g1 = self._circuit[target_index]
                 g2 = self._circuit[target_index + 1]
                 g3 = self._circuit[target_index + 2]
                 if (g1.name == "H" and g2.name == "X" and g3.name == "H" and
                         g1.target_qubits == g2.target_qubits == g3.target_qubits):
                     self._circuit.pop(target_index + 2)
                     self._circuit.pop(target_index + 1)
-                    self._circuit[target_index] = QuantumGate(name="Z", target_qubits=g1.target_qubits)
                     reward = 2.0
                     action_result = "identity_hxh_to_z"
-        # ACTION 4: Replace CNOT-SWAP with CZ
         elif action_type == 4:
             if target_index + 1 < len(self._circuit):
                 g1 = self._circuit[target_index]
                 g2 = self._circuit[target_index + 1]
                 if (g1.name == "CNOT" and g2.name == "SWAP" and
                         set(g1.target_qubits) == set(g2.target_qubits)):
                     self._circuit.pop(target_index + 1)
-                    self._circuit[target_index] = QuantumGate(name="CZ", target_qubits=g1.target_qubits)
                     reward = 1.0
                     action_result = "identity_cnot_swap_to_cz"
         return self._build_observation(reward, action_result)
     # ============================================================================
-    # Grader Methods (OpenEnv validator calls these on the environment instance)
-    # Each grades the CURRENT internal circuit state — no arguments needed.
     # ============================================================================
     def grade_easy(self) -> float:
-        """
-        Grader for Easy Task.
-        Pure compression ratio — any reduction in gate count earns proportional score.
-        """
         if self._initial_gate_count == 0:
-            return 1.0
-        final_count = len(self._circuit)
-        compression = (self._initial_gate_count - final_count) / self._initial_gate_count
-        return max(0.0, min(1.0, compression))
     def grade_medium(self) -> float:
-        """
-        Grader for Medium Task.
-        Scaled so that 20% compression = full score (1.0).
-        Partial credit below threshold encourages progress.
-        """
         if self._initial_gate_count == 0:
-            return 1.0
-        final_count = len(self._circuit)
-        compression = (self._initial_gate_count - final_count) / self._initial_gate_count
-        scaled = compression / 0.20
-        return max(0.0, min(1.0, scaled))
     def grade_hard(self) -> float:
-        """
-        Grader for Hard Task.
-        Scaled so that 35% compression = full score (1.0).
-        Harder threshold reflects genuine difficulty of deep entangled circuits.
-        """
         if self._initial_gate_count == 0:
-            return 1.0
-        final_count = len(self._circuit)
-        compression = (self._initial_gate_count - final_count) / self._initial_gate_count
-        scaled = compression / 0.35
-        return max(0.0, min(1.0, scaled))
     # ============================================================================
     # Internal helpers
@@ -291,6 +355,7 @@ class QuantumCircuitOptimizationEnvironment(Environment):
                 "step": self._state.step_count,
                 "initial_count": self._initial_gate_count,
                 "seed": self.current_seed,
             },
         )
@@ -298,6 +363,7 @@ class QuantumCircuitOptimizationEnvironment(Environment):
         if len(self._circuit) == 0:
             return True
         for i in range(len(self._circuit)):
             curr_gate = self._circuit[i]
             active_qubits = set(curr_gate.target_qubits)
@@ -311,22 +377,10 @@ class QuantumCircuitOptimizationEnvironment(Environment):
                         return False
                     break
         for i in range(len(self._circuit) - 1):
             if not set(self._circuit[i].target_qubits).intersection(
                     set(self._circuit[i + 1].target_qubits)):
                 return False
-        return True
-    def grade(self) -> float:
-        """Grade current state using the active task's grader."""
-        grader_method = {
-            "easy": self.grade_easy,
-            "medium": self.grade_medium,
-            "hard": self.grade_hard,
-        }[self.task_name]
-        return grader_method()
-    @property
-    def state(self) -> State:
-        return self._state

 - Instance-isolated PRNG (seeding) for strict reproducibility in server environments.
 - Relative Compression Grading: Evaluates agents on compression ratio rather than
   an absolute theoretical minimum, mirroring real-world NP-Hard quantum optimization constraints.
+- Advanced action tracking: medium/hard graders reward agents that discover
+  algebraic identities (H-X-H=Z, CNOT-SWAP=CZ) beyond simple cancellations.
 """
+import os
 import random
 from uuid import uuid4
 from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import EnvironmentMetadata, State
 from quantum_openenv_env.models import QuantumAction, QuantumGate, QuantumObservation
 # ============================================================================
+# Graders (imported from graders.py)
 # ============================================================================
+from quantum_openenv_env.server.graders import grade_easy, grade_medium, grade_hard
 GRADERS = {
+    "easy":   grade_easy,
+    "medium": grade_medium,
+    "hard":   grade_hard,
 }
     The agent acts as a quantum compiler, reducing circuit depth by applying
     mathematical identities and commutativity rules across 3 difficulty tiers.
+    Observation:
+        circuit       - Current list of QuantumGate objects
+        gate_count    - Number of gates remaining
+        num_qubits    - System qubit count
+        done          - Episode terminal flag
+        reward        - Last step reward
+        metadata      - task, initial_count, step, seed, used_advanced_actions
     Action types:
+        1 - Cancel identical self-inverse gate pairs          (+1.0)
+        2 - Swap adjacent commuting gates (different qubits)  (-0.05)
+        3 - Replace H-X-H sequence with Z gate                (+2.0)
+        4 - Replace CNOT-SWAP sequence with CZ gate           (+1.0)
+        Invalid actions                                        (-0.1)
     """
     SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    SELF_INVERSE_GATES = {
+        "H", "X", "Y", "Z", "CNOT", "CX", "CZ", "SWAP",
+        "CCX", "TOFFOLI", "CSWAP", "FREDKIN"
+    }
     def __init__(self, task: str = "random", seed: int = None):
+        # Also read from environment variable so Docker env_vars work
+        if task == "random":
+            task = os.getenv("QUANTUM_TASK", "random")
         self.mode = task
         if self.mode != "random" and self.mode not in TASK_CONFIGS:
+            raise ValueError(
+                f"Unknown task: {task}. Must be 'random' or one of {list(TASK_CONFIGS.keys())}"
+            )
         self._state = State(episode_id=str(uuid4()), step_count=0)
         self._reset_count = 0
         self.task_config = TASK_CONFIGS["easy"]
         self._circuit: list[QuantumGate] = []
         self._initial_gate_count = 0
+        self._used_advanced_actions = False  # tracks action 3 or 4 usage this episode
+    # ============================================================================
+    # OpenEnv API
+    # ============================================================================
+    def reset(self, seed: int = None, **kwargs) -> QuantumObservation:
         """Reset the environment to a fresh circuit for the configured task."""
         self._state = State(episode_id=str(uuid4()), step_count=0)
         self._reset_count += 1
+        self._used_advanced_actions = False
+        if seed is not None:
+            self.current_seed = seed
+            self.rng = random.Random(self.current_seed)
         if self.mode == "random":
             self.task_name = self.rng.choice(TASKS)
                 "reset_count": self._reset_count,
                 "initial_count": self._initial_gate_count,
                 "seed": self.current_seed,
+                "used_advanced_actions": False,
             },
         )
+    def step(self, action: QuantumAction, **kwargs) -> QuantumObservation:  # type: ignore[override]
         """Execute one action in the environment."""
         self._state.step_count += 1
         target_index = action.target_index
         gate_at_index = self._circuit[target_index]
         active_qubits = set(gate_at_index.target_qubits)
+        # ------------------------------------------------------------------
         # ACTION 1: Cancel Identical Self-Inverse Gates
+        # ------------------------------------------------------------------
         if action_type == 1:
             next_gate_index = None
             for j in range(target_index + 1, len(self._circuit)):
                 reward = 1.0
                 action_result = "cancelled_identical"
+        # ------------------------------------------------------------------
         # ACTION 2: Swap Commuting Gates
+        # ------------------------------------------------------------------
         elif action_type == 2:
             if target_index + 1 < len(self._circuit):
                 next_gate = self._circuit[target_index + 1]
                     reward = -0.05
                     action_result = "swapped_commuting"
+        # ------------------------------------------------------------------
+        # ACTION 3: Replace H-X-H with Z  (advanced identity)
+        # ------------------------------------------------------------------
         elif action_type == 3:
             if target_index + 2 < len(self._circuit):
                 g1 = self._circuit[target_index]
                 g2 = self._circuit[target_index + 1]
                 g3 = self._circuit[target_index + 2]
                 if (g1.name == "H" and g2.name == "X" and g3.name == "H" and
                         g1.target_qubits == g2.target_qubits == g3.target_qubits):
                     self._circuit.pop(target_index + 2)
                     self._circuit.pop(target_index + 1)
+                    self._circuit[target_index] = QuantumGate(
+                        name="Z", target_qubits=g1.target_qubits
+                    )
                     reward = 2.0
                     action_result = "identity_hxh_to_z"
+                    self._used_advanced_actions = True  # track for medium grader
+        # ------------------------------------------------------------------
+        # ACTION 4: Replace CNOT-SWAP with CZ  (advanced identity)
+        # ------------------------------------------------------------------
         elif action_type == 4:
             if target_index + 1 < len(self._circuit):
                 g1 = self._circuit[target_index]
                 g2 = self._circuit[target_index + 1]
                 if (g1.name == "CNOT" and g2.name == "SWAP" and
                         set(g1.target_qubits) == set(g2.target_qubits)):
                     self._circuit.pop(target_index + 1)
+                    self._circuit[target_index] = QuantumGate(
+                        name="CZ", target_qubits=g1.target_qubits
+                    )
                     reward = 1.0
                     action_result = "identity_cnot_swap_to_cz"
+                    self._used_advanced_actions = True  # track for medium grader
         return self._build_observation(reward, action_result)
+    @property
+    def state(self) -> State:
+        return self._state
+    def get_metadata(self) -> EnvironmentMetadata:
+        """
+        Return human-readable metadata shown in the HF Space web UI and
+        consumed by the platform's agent during Phase 2 evaluation.
+        """
+        return EnvironmentMetadata(
+            name="Quantum Circuit Optimizer",
+            description=(
+                "RL environment where an agent acts as a quantum compiler, "
+                "reducing circuit depth by applying gate cancellation, "
+                "commutativity swaps, and algebraic identities "
+                "(H·X·H = Z, CNOT·SWAP = CZ) across 3 difficulty tiers "
+                "(2-qubit easy → 4-qubit medium → 6-qubit hard with deep entanglement)."
+            ),
+            version="0.1.0",
+        )
     # ============================================================================
+    # Grader methods (called by OpenEnv validator on the environment instance)
     # ============================================================================
+    @staticmethod
+    def _strict(score: float) -> float:
+        """Clamp to strictly (0.0, 1.0) — platform rejects exactly 0.0 or 1.0."""
+        return max(0.01, min(0.99, float(score)))
     def grade_easy(self) -> float:
+        """Pure compression ratio — any gate removal earns proportional credit."""
         if self._initial_gate_count == 0:
+            return self._strict(0.5)
+        compression = (self._initial_gate_count - len(self._circuit)) / self._initial_gate_count
+        return self._strict(compression)
     def grade_medium(self) -> float:
+        """Compression ratio + 0.15 bonus for using advanced identity actions."""
         if self._initial_gate_count == 0:
+            return self._strict(0.5)
+        compression = (self._initial_gate_count - len(self._circuit)) / self._initial_gate_count
+        bonus = 0.15 if self._used_advanced_actions else 0.0
+        return self._strict(compression + bonus)
     def grade_hard(self) -> float:
+        """Weighted blend: 70% compression + 30% step efficiency."""
         if self._initial_gate_count == 0:
+            return self._strict(0.5)
+        compression = (self._initial_gate_count - len(self._circuit)) / self._initial_gate_count
+        step_efficiency = max(0.0, 1.0 - (self._state.step_count / 150))
+        score = 0.7 * compression + 0.3 * step_efficiency
+        return self._strict(score)
+    def grade(self) -> float:
+        """Grade current state using the active task's grader method."""
+        return {"easy": self.grade_easy, "medium": self.grade_medium, "hard": self.grade_hard}[
+            self.task_name
+        ]()
     # ============================================================================
     # Internal helpers
                 "step": self._state.step_count,
                 "initial_count": self._initial_gate_count,
                 "seed": self.current_seed,
+                "used_advanced_actions": self._used_advanced_actions,
             },
         )
         if len(self._circuit) == 0:
             return True
+        # Check for any valid cancellation
         for i in range(len(self._circuit)):
             curr_gate = self._circuit[i]
             active_qubits = set(curr_gate.target_qubits)
                         return False
                     break
+        # Check for any valid swap
         for i in range(len(self._circuit) - 1):
             if not set(self._circuit[i].target_qubits).intersection(
                     set(self._circuit[i + 1].target_qubits)):
                 return False
+        return True