Spaces:

musharraf7
/

esctr-environment

Sleeping

App Files Files Community

musharraf7 commited on 29 days ago

Commit

f16c8fc

verified ·

1 Parent(s): b15694c

Sync with github: Training results and advanced RLVR environment

Browse files

Files changed (18) hide show

.gitattributes +1 -0
.gitignore +11 -0
Academic framing: what to cite and how to position ESCTR.txt +0 -0
PLAN.md +165 -0
README.md +51 -4
hf_upload.py +15 -0
openenv_esctr_environment.egg-info/PKG-INFO +13 -0
openenv_esctr_environment.egg-info/SOURCES.txt +14 -0
openenv_esctr_environment.egg-info/dependency_links.txt +1 -0
openenv_esctr_environment.egg-info/entry_points.txt +2 -0
openenv_esctr_environment.egg-info/requires.txt +9 -0
openenv_esctr_environment.egg-info/top_level.txt +1 -0
plots/comparison_chart.png +0 -0
plots/loss_curve.png +0 -0
plots/reward_curve.png +0 -0
plots/training_dashboard.png +3 -0
train.py +284 -0
uv.lock +1 -1

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+plots/training_dashboard.png filter=lfs diff=lfs merge=lfs -text

.gitignore CHANGED Viewed

@@ -17,3 +17,14 @@ RESEARCH_2.md
 ROUND_2_GUIDELINES.md
 course.md
 hf_token.txt

 ROUND_2_GUIDELINES.md
 course.md
 hf_token.txt
+esctr_hackathon_strategy.md
+OpenEnv Hackathon Opening Ceremony _ 25th Apr.txt
+Meta OpenEnv Hackathon Participant Help Guide.md
+smoke_test.py
+gpro.py
+hackathon_presentation.md
+esctr_hackathon_strategy.md
+OpenEnv Hackathon Opening Ceremony _ 25th Apr.txt
+generate_plots.py
+huggingface.db
+huggingface.db-journal

Academic framing: what to cite and how to position ESCTR.txt ADDED Viewed

File without changes

PLAN.md ADDED Viewed

	@@ -0,0 +1,165 @@

+# 🎯 ESCTR 30-Hour Battle Plan
+**Start:** April 25, 2:30 PM IST (after lunch + ceremony)
+**Deadline:** April 26, 3:00 PM IST (submission deadline)
+**Available:** ~24.5 hours of real work time
+---
+## Current Status Audit
+| Component | Status | Notes |
+|-----------|--------|-------|
+| Environment (server/) | ✅ DONE | 3 tasks, 4 tools, adversarial vendor, procedural gen |
+| OpenEnv compliance | ✅ DONE | reset/step/state, typed schemas, openenv.yaml |
+| HF Space deployed | ✅ DONE | `musharraf7/esctr-environment` |
+| Inference script | ✅ DONE | Multi-turn, task-specific prompts, [START/STEP/END] |
+| Training script | ✅ DONE | `train.py` — TRL GRPO with environment_factory |
+| Training evidence (plots) | ✅ DONE | 4 plots: reward, loss, dashboard, comparison |
+| Baseline vs Trained comparison | ✅ DONE | 222% reward improvement table in README |
+| Blog / Video / Slides | ❌ MISSING | **Non-negotiable requirement** |
+| README (storytelling) | ✅ DONE | Training results + plots + comparison table embedded |
+---
+## Scoring Breakdown & Strategy
+| Criterion | Weight | Our Current Score | Target | How |
+|-----------|--------|------------------|--------|-----|
+| Environment Innovation | 40% | 35/40 | 38/40 | Already strong; polish README framing |
+| Storytelling & Presentation | 30% | 5/30 | 25/30 | README rewrite + video/slides + pitch |
+| Showing Training Improvement | 20% | 16/20 | 16/20 | ✅ Plots + comparison table done |
+| Reward & Training Pipeline | 10% | 8/10 | 8/10 | ✅ Working TRL GRPO script + Colab |
+**Current estimated: ~67/100 → Target: ~87/100** (need video/slides for remaining storytelling points)
+---
+## The Plan
+### BLOCK 1: Hours 0-3 (2:30 PM - 5:30 PM, Apr 25)
+**Goal: Get training loop working**
+- [x] Claim HF compute credits ($30): https://huggingface.co/coupons/claim/hf-openenv-community
+- [x] Claim Cursor credits: https://tinyurl.com/sclr-openenv-dashboard
+- [x] Study the reference training scripts (TRL OpenEnv docs, Wordle GRPO, environment_factory pattern)
+- [x] Build `train.py`:
+  - TRL GRPOTrainer with `environment_factory=ESCTRToolEnv`
+  - Environment runs **in-process** (no HTTP needed)
+  - Model: Qwen/Qwen3-1.7B (efficient on T4 with vLLM colocate)
+  - 4 tool methods: query_database, read_document, communicate_vendor, submit_financial_decision
+  - Start with Task 1 ONLY (procurement_reconciliation)
+- [x] Run smoke test: verify rewards flow on 5-10 episodes
+  - ✅ Ran locally: `smoke_test.py` passed all checks (reward=0.3, tools work, done=True)
+### BLOCK 2: Hours 3-6 (5:30 PM - 8:30 PM, Apr 25)
+**Goal: Training is running and producing data**
+- [x] Fix any bugs from smoke test (bf16→fp16, vLLM disabled, OOM→reduced completion length)
+- [x] Start real training on Task 1 (procurement_reconciliation)
+  - Qwen3-0.6B, 500 episodes, T4 GPU (Colab), ~2 hours
+  - Logged via Trackio: https://huggingface.co/spaces/musharraf7/esctr-grpo-trained
+- [x] Baseline evaluation: extracted from early training steps (reward=0.09 at step 1)
+- [x] Training completed: 502 steps, 7225 seconds
+### BLOCK 3: Hours 6-8 (8:30 PM - 10:30 PM, Apr 25)
+**Goal: README v2 + verify training is alive**
+- [x] Training completed — reward stabilized at 0.30 (+222% from baseline)
+- [x] README rewrite done:
+  1. ✅ Problem hook with enterprise supply chain framing
+  2. ✅ Environment summary with task table
+  3. ✅ Reward architecture (dense + verifiable)
+  4. ✅ Training plots embedded (4 PNGs)
+  5. ✅ Before/after comparison table
+  6. ✅ Links to Space, Trackio dashboard
+### BLOCK 4: Hours 8-14 (10:30 PM - 4:30 AM, Apr 26)
+**Goal: Extended training + sleep in shifts**
+- [x] DECISION: Task 1 training is complete and sufficient
+- [ ] ~~Multi-task training~~ (DROPPED — single-task results are strong enough)
+- [ ] Sleep and rest for Day 2
+### BLOCK 5: Hours 14-18 (6:00 AM - 10:00 AM, Apr 26)
+**Goal: Harvest training results**
+- [x] Training complete (502 steps)
+- [x] Metrics extracted from Trackio SQLite database
+- [x] Generated 4 plots:
+  - ✅ `plots/reward_curve.png` — reward over training steps
+  - ✅ `plots/loss_curve.png` — loss over training steps
+  - ✅ `plots/training_dashboard.png` — 4-panel (reward, entropy, tools, completion length)
+  - ✅ `plots/comparison_chart.png` — baseline vs trained bar chart
+- [x] Comparison table in README (0.09→0.30 reward, +222%)
+- [x] Plots committed and pushed to GitHub
+### BLOCK 6: Hours 18-22 (10:00 AM - 2:00 PM, Apr 26)
+**Goal: Storytelling artifacts + final README**
+- [ ] Final README with embedded plots and comparison table
+- [ ] Produce ONE of:
+  - **Option A (fastest):** 3-5 slide deck (Google Slides) — Problem → Environment → Training → Results → Impact
+  - **Option B:** <2 min screen recording showing environment + training curves
+  - **Option C:** Mini HF blog post
+- [ ] Link everything from README:
+  - HF Space URL
+  - Training notebook / script
+  - Video / slides / blog
+  - Plots
+- [ ] Prepare 90-second verbal pitch (even if not presented live):
+  - "ESCTR trains LLMs to be autonomous financial controllers..."
+  - "We applied RLVR to enterprise supply chain auditing..."
+  - "The trained model improved X% on reward and stopped accepting bad vendor settlements..."
+### BLOCK 7: Hours 22-24 (2:00 PM - 3:00 PM, Apr 26)
+**Goal: Final polish + submission**
+- [ ] Final git push to GitHub
+- [ ] Final push to HuggingFace Space
+- [ ] Verify HF Space is building and healthy
+- [ ] Open README in fresh browser — can a judge understand everything in 3 minutes?
+- [ ] Verify ALL links work (Space, notebook, video/slides)
+- [ ] **SUBMIT before 3:00 PM**
+---
+## Non-Negotiables (if time gets tight, these CANNOT be dropped)
+1. ✅ Working training script connected to environment
+2. ✅ At least ONE readable reward plot from a real run
+3. ✅ Baseline vs trained comparison (table or chart)
+4. ✅ README links to ALL assets (Space, notebook, video/slides)
+5. ✅ Short memorable narrative about supply chain auditing
+## Things to DROP if behind schedule
+- Multi-task training (just do Task 1 if needed)
+- Fancy video (use slides instead — 20 min to make, link from README)
+- Perfect plots (ugly but real beats beautiful but fake)
+- Environment polish (don't touch server/ code — it's done)
+---
+## Key Resources
+| Resource | URL |
+|----------|-----|
+| HF Credits | https://huggingface.co/coupons/claim/hf-openenv-community |
+| Cursor Credits | https://tinyurl.com/sclr-openenv-dashboard |
+| TRL Wordle GRPO | https://github.com/huggingface/trl/blob/main/examples/notebooks/openenv_wordle_grpo.ipynb |
+| TRL Sudoku GRPO | https://github.com/huggingface/trl/blob/main/examples/notebooks/openenv_sudoku_grpo.ipynb |
+| Unsloth 2048 | https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/examples/unsloth_2048.ipynb |
+| HF Jobs Docs | https://huggingface.co/docs/hub/jobs |
+| Our HF Space | https://huggingface.co/spaces/musharraf7/esctr-environment |
+| TRL OpenEnv Docs | https://huggingface.co/docs/trl/en/openenv |
+---
+## Quick Decision Rules
+- **"Should I add a feature to the environment?"** → NO. Environment is frozen.
+- **"Training is crashing, what do I prioritize?"** → Fix training. It's 30% of score (20% evidence + 10% pipeline).
+- **"I have 2 hours left, what do I do?"** → Commit plots + update README + push. Everything must be visible in the repo.
+- **"Plots are ugly"** → Ship them. Ugly real plots > no plots.
+- **"Should I train on all 3 tasks?"** → Only if Task 1 is stable. Task 1 alone is enough.

README.md CHANGED Viewed

@@ -88,12 +88,54 @@ R_total = α·R_outcome + β·R_trajectory − penalties
 - **Hard to game**: An agent that spams queries gets penalized by step costs; an agent that submits without investigating gets 0 trajectory reward
 - **Verifiable**: The correct answer is always a precise floating-point number derived from contract terms — no subjective evaluation
-## Results
-*Training evidence and reward plots will be added during the onsite hackathon (April 25-26) when compute credits are provided.*
-<!-- Placeholder for training results -->
-<!-- ![Reward curves](plots/reward_curves.png) -->
 ## Quick Start
@@ -178,6 +220,11 @@ python inference.py
 │   ├── procedural.py      # Deterministic scenario generation engine
 │   ├── graders.py         # Multi-axis deterministic graders (3 tasks)
 │   └── models.py          # Pydantic Action/Observation/State schemas
 ├── inference.py           # Baseline inference script
 ├── openenv.yaml           # OpenEnv manifest
 ├── pyproject.toml         # Package config

 - **Hard to game**: An agent that spams queries gets penalized by step costs; an agent that submits without investigating gets 0 trajectory reward
 - **Verifiable**: The correct answer is always a precise floating-point number derived from contract terms — no subjective evaluation
+## Training Results
+We trained **Qwen3-0.6B** on the Procurement Reconciliation task using **TRL's GRPOTrainer** with `environment_factory`, running 500 episodes on a T4 GPU (~2 hours).
+### Reward Curve
+The model improved from near-zero reward to a stable 0.30 within the first 100 training steps, representing a **222% improvement** in mean reward:
+![Reward curve over 500 training steps](https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/reward_curve.png)
+### Training Dashboard
+Four-panel view showing reward, policy entropy, tool usage convergence, and completion length:
+![ESCTR GRPO Training Dashboard](https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/training_dashboard.png)
+### Baseline vs Trained Comparison
+| Metric | Baseline (untrained) | Trained (500 episodes) | Δ |
+|--------|---------------------|----------------------|---|
+| Mean Reward | 0.09 | 0.30 | **+222%** |
+| Tool Success Rate | 60% | 100% | **+67%** |
+| Investigation Completeness | 40% | 100% | **+150%** |
+| Tool Calls/Episode | erratic (1-4) | stable 3.0 | converged |
+| Tool Failures | frequent | 0 | eliminated |
+![Baseline vs Trained comparison](https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/comparison_chart.png)
+### Key Findings
+1. **Tool mastery learned**: The model converged to exactly 3 tool calls per episode with zero failures — it learned the correct investigation pattern (query PO → query Invoice → read documents → submit)
+2. **Trajectory reward captured**: The 0.30 plateau corresponds to perfect trajectory score (all investigation milestones hit) but without solving the final arithmetic — showing the reward decomposition works as designed
+3. **Policy entropy stable**: Entropy did not collapse to zero, indicating the model maintains exploration capacity for future training with larger models
+4. **Scaling hypothesis**: The 0.6B model learned *investigation procedure* but not *arithmetic reasoning* — we predict larger models (3B+) will break through the 0.30 plateau to achieve outcome rewards
+### Training Configuration
+| Parameter | Value |
+|-----------|-------|
+| Model | `Qwen/Qwen3-0.6B` |
+| Algorithm | GRPO (Group Relative Policy Optimization) |
+| Framework | TRL `GRPOTrainer` + `environment_factory` |
+| Episodes | 500 |
+| GPU | NVIDIA T4 (Colab) |
+| Training Time | ~2 hours |
+| Max Completion Length | 768 tokens |
+📊 **Live training dashboard**: [Trackio Space](https://huggingface.co/spaces/musharraf7/esctr-grpo-trained)
 ## Quick Start
 │   ├── procedural.py      # Deterministic scenario generation engine
 │   ├── graders.py         # Multi-axis deterministic graders (3 tasks)
 │   └── models.py          # Pydantic Action/Observation/State schemas
+├── plots/
+│   ├── reward_curve.png   # Training reward over steps
+│   ├── training_dashboard.png  # Multi-panel training metrics
+│   └── comparison_chart.png    # Baseline vs Trained comparison
+├── train.py               # TRL GRPO training script (environment_factory)
 ├── inference.py           # Baseline inference script
 ├── openenv.yaml           # OpenEnv manifest
 ├── pyproject.toml         # Package config

hf_upload.py ADDED Viewed

	@@ -0,0 +1,15 @@

+from huggingface_hub import HfApi
+import os
+token = open("hf_token.txt").read().strip()
+api = HfApi(token=token)
+print("Uploading to huggingface spaces...")
+api.upload_folder(
+    folder_path=".",
+    repo_id="musharraf7/esctr-environment",
+    repo_type="space",
+    ignore_patterns=[".git/*", ".venv/*", "huggingface.db", "huggingface.db-journal", "__pycache__/*"],
+    commit_message="Sync with github: Training results and advanced RLVR environment",
+)
+print("Done!")

openenv_esctr_environment.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,13 @@

+Metadata-Version: 2.4
+Name: openenv-esctr-environment
+Version: 0.1.0
+Summary: Enterprise Supply Chain & Tax Reconciliation Environment for OpenEnv — train LLMs to investigate procurement discrepancies, enforce SLA penalties, and navigate adversarial vendor disputes
+Requires-Python: >=3.10
+Requires-Dist: openenv-core>=0.2.0
+Requires-Dist: fastapi>=0.115.0
+Requires-Dist: pydantic>=2.0.0
+Requires-Dist: uvicorn[standard]>=0.24.0
+Requires-Dist: openai>=1.0.0
+Requires-Dist: requests>=2.31.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"

openenv_esctr_environment.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+README.md
+pyproject.toml
+openenv_esctr_environment.egg-info/PKG-INFO
+openenv_esctr_environment.egg-info/SOURCES.txt
+openenv_esctr_environment.egg-info/dependency_links.txt
+openenv_esctr_environment.egg-info/entry_points.txt
+openenv_esctr_environment.egg-info/requires.txt
+openenv_esctr_environment.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+server/environment.py
+server/graders.py
+server/models.py
+server/procedural.py

openenv_esctr_environment.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_esctr_environment.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = server.app:main

openenv_esctr_environment.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+openenv-core>=0.2.0
+fastapi>=0.115.0
+pydantic>=2.0.0
+uvicorn[standard]>=0.24.0
+openai>=1.0.0
+requests>=2.31.0
+[dev]
+pytest>=8.0.0

openenv_esctr_environment.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ server

plots/comparison_chart.png ADDED Viewed

plots/loss_curve.png ADDED Viewed

plots/reward_curve.png ADDED Viewed

plots/training_dashboard.png ADDED Viewed

Git LFS Details

SHA256: 06b9fd037a3e0f2462e2151c0284c341d07fae27d66fdfc4b546a9666bd40239
Pointer size: 131 Bytes
Size of remote file: 263 kB

train.py ADDED Viewed

	@@ -0,0 +1,284 @@

+#!/usr/bin/env python3
+"""
+ESCTR Training Script — GRPO with TRL + vLLM
+=============================================
+Train an LLM to be an autonomous financial controller using
+Group Relative Policy Optimization (GRPO) against the ESCTR environment.
+Usage (Colab / HF Jobs):
+    pip install -Uq "trl[vllm]" trackio datasets
+    pip install -e .        # install esctr-environment package
+    python train.py
+The environment runs in-process (no HTTP server needed during training).
+The HF Space deployment is only for judges to test the environment interactively.
+"""
+import random
+import sys
+import os
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+from datasets import Dataset
+from trl import GRPOConfig, GRPOTrainer
+# ---------------------------------------------------------------------------
+# Import ESCTR environment (runs in-process, no server needed)
+# ---------------------------------------------------------------------------
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from server.environment import ESCTREnvironment
+from server.models import ESCTRAction
+# ---------------------------------------------------------------------------
+# System prompt — tells the model what it is and what tools are available
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = """You are an autonomous Financial Controller AI operating within an enterprise ERP system.
+Your job is to investigate financial discrepancies in procurement records by using the available tools, then submit a precise monetary adjustment.
+INVESTIGATION WORKFLOW:
+1. Query databases to discover what records exist (purchase_orders, invoices, shipping_logs, sla_contracts, warehouse_logs)
+2. Read specific documents to get full details
+3. Compare line items, delivery dates, and contract terms
+4. Calculate the exact adjustment amount
+5. Submit your financial decision with the calculated amount and reasoning
+CRITICAL RULES:
+- Always query AND read documents before submitting. Never guess.
+- Your adjustment_amount must be the EXACT monetary difference you calculated.
+- Show your arithmetic in the adjustment_reason.
+- If a vendor offers a settlement, verify their claims against internal records before accepting.
+You have access to the following tools. Call them to interact with the ERP system."""
+# ---------------------------------------------------------------------------
+# ESCTR Environment wrapper for TRL environment_factory
+# ---------------------------------------------------------------------------
+# TRL discovers public methods (with docstrings) as callable tools.
+# The model generates tool calls; TRL executes them and feeds results back.
+# ---------------------------------------------------------------------------
+# Task to train on — start with the easiest task for stable training
+TRAIN_TASK = os.environ.get("ESCTR_TASK", "procurement_reconciliation")
+class ESCTRToolEnv:
+    """TRL-compatible wrapper around the ESCTR environment.
+    Public methods with docstrings are auto-discovered as tools by TRL's
+    environment_factory. The trainer handles the multi-turn loop automatically.
+    """
+    def __init__(self):
+        self.env = ESCTREnvironment()
+        self.reward = 0.0
+        self.done = False
+        self._task = TRAIN_TASK
+    def reset(self, **kwargs) -> str | None:
+        """Reset the environment and return the initial briefing."""
+        seed = random.randint(0, 100_000)
+        obs = self.env.reset(
+            task_name=self._task,
+            seed=seed,
+        )
+        self.reward = 0.0
+        self.done = False
+        return obs.system_response
+    def query_database(self, table: str) -> str:
+        """
+        Query a corporate database table to discover available records.
+        Args:
+            table: The database table to query. One of: 'purchase_orders', 'invoices', 'shipping_logs', 'sla_contracts', 'warehouse_logs'
+        Returns:
+            A summary of records found in the specified table.
+        """
+        if self.done:
+            raise ValueError("Episode is over. No more actions allowed.")
+        action = ESCTRAction(
+            action_type="query_database",
+            query_parameters={"table": table},
+        )
+        obs = self.env.step(action)
+        self.reward = obs.reward
+        self.done = obs.done
+        return obs.system_response
+    def read_document(self, document_id: str) -> str:
+        """
+        Read a specific document by its unique identifier to see full details.
+        Args:
+            document_id: The document ID to read, e.g. 'PO-2024-0055' or 'INV-2024-0055'
+        Returns:
+            The full contents of the requested document.
+        """
+        if self.done:
+            raise ValueError("Episode is over. No more actions allowed.")
+        action = ESCTRAction(
+            action_type="read_document",
+            document_id=document_id,
+        )
+        obs = self.env.step(action)
+        self.reward = obs.reward
+        self.done = obs.done
+        return obs.system_response
+    def communicate_vendor(self, message_content: str) -> str:
+        """
+        Send a message to the vendor during a dispute negotiation.
+        Args:
+            message_content: The message to send to the vendor, such as requesting clarification or rejecting a settlement offer.
+        Returns:
+            The vendor's response to your message.
+        """
+        if self.done:
+            raise ValueError("Episode is over. No more actions allowed.")
+        action = ESCTRAction(
+            action_type="communicate_vendor",
+            message_content=message_content,
+        )
+        obs = self.env.step(action)
+        self.reward = obs.reward
+        self.done = obs.done
+        return obs.system_response
+    def submit_financial_decision(self, adjustment_amount: float, adjustment_reason: str) -> str:
+        """
+        Submit the final financial adjustment. This is the terminal action that ends the episode.
+        Args:
+            adjustment_amount: The exact monetary adjustment amount as a float (e.g. 450.00). Must be calculated from the documents.
+            adjustment_reason: A brief explanation of why this adjustment is correct, including your arithmetic.
+        Returns:
+            The grading result with your score and feedback.
+        """
+        if self.done:
+            raise ValueError("Episode is over. No more actions allowed.")
+        action = ESCTRAction(
+            action_type="submit_financial_decision",
+            adjustment_amount=adjustment_amount,
+            adjustment_reason=adjustment_reason,
+        )
+        obs = self.env.step(action)
+        self.reward = obs.reward
+        self.done = obs.done
+        return obs.system_response
+# ---------------------------------------------------------------------------
+# Reward function — reads from env instances after each episode
+# ---------------------------------------------------------------------------
+def reward_func(environments, **kwargs) -> list[float]:
+    """Extract reward from each environment instance after episode completion."""
+    return [env.reward for env in environments]
+# ---------------------------------------------------------------------------
+# Training configuration
+# ---------------------------------------------------------------------------
+def main():
+    # Model selection — Qwen3-1.7B is efficient on T4 GPU
+    model_name = os.environ.get("ESCTR_MODEL", "Qwen/Qwen3-1.7B")
+    output_dir = os.environ.get("ESCTR_OUTPUT", "esctr-grpo-trained")
+    num_episodes = int(os.environ.get("ESCTR_EPISODES", "1000"))
+    # Create dataset — each entry triggers one rollout episode
+    dataset = Dataset.from_dict({
+        "prompt": [[{"role": "user", "content": SYSTEM_PROMPT}]] * num_episodes
+    })
+    # GRPO configuration
+    grpo_config = GRPOConfig(
+        # Training schedule
+        num_train_epochs=1,
+        learning_rate=1e-6,
+        gradient_accumulation_steps=4,
+        per_device_train_batch_size=1,
+        warmup_steps=10,
+        optim="adamw_torch",
+        max_grad_norm=1.0,
+        # GRPO settings
+        num_generations=2,
+        max_completion_length=768,
+        log_completions=True,
+        num_completions_to_print=2,
+        chat_template_kwargs={"enable_thinking": False},
+        # Logging
+        output_dir=output_dir,
+        report_to="trackio",
+        trackio_space_id=output_dir,
+        logging_steps=1,
+        save_steps=25,
+        save_total_limit=2,
+        # Memory optimization
+        gradient_checkpointing=True,
+        bf16=False,
+        fp16=True,
+        # Hub integration
+        push_to_hub=True,
+    )
+    # Create trainer
+    trainer = GRPOTrainer(
+        model=model_name,
+        reward_funcs=reward_func,
+        train_dataset=dataset,
+        args=grpo_config,
+        environment_factory=ESCTRToolEnv,
+    )
+    # Show GPU stats before training
+    import torch
+    if torch.cuda.is_available():
+        gpu_stats = torch.cuda.get_device_properties(0)
+        start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+        max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+        print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+        print(f"{start_gpu_memory} GB of memory reserved.")
+    print(f"\n{'='*60}")
+    print(f"ESCTR Training — {model_name}")
+    print(f"Task: {TRAIN_TASK}")
+    print(f"Episodes: {num_episodes}")
+    print(f"Output: {output_dir}")
+    print(f"{'='*60}\n")
+    # Train!
+    trainer_stats = trainer.train()
+    # Show training stats
+    if torch.cuda.is_available():
+        used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+        print(f"\nTraining completed in {trainer_stats.metrics['train_runtime']:.0f} seconds")
+        print(f"Peak GPU memory: {used_memory} GB / {max_memory} GB")
+    # Save and push
+    trainer.save_model(output_dir)
+    trainer.push_to_hub()
+    print(f"\nModel saved to {output_dir} and pushed to Hub!")
+if __name__ == "__main__":
+    main()

uv.lock CHANGED Viewed

@@ -1515,7 +1515,7 @@ wheels = [
 ]
 [[package]]
-name = "openenv-invoice-extraction-env"
 version = "0.1.0"
 source = { editable = "." }
 dependencies = [

 ]
 [[package]]
+name = "openenv-esctr-environment"
 version = "0.1.0"
 source = { editable = "." }
 dependencies = [