Spaces:

NorthernTribe-Research
/

math_trainer

Running

App Files Files Community

NorthernTribe-Research commited on about 12 hours ago

Commit

0be512a

verified ·

1 Parent(s): f505b3a

Upgrade training pipeline with post-eval quality gates and tactical UI controls.

Browse files

Files changed (5) hide show

README.md +35 -35
app.py +171 -7
configs/deepseek_math_sota.yaml +40 -2
scripts/eval_sota.py +344 -61
scripts/train_sota.py +387 -47

README.md CHANGED Viewed

@@ -9,52 +9,52 @@ pinned: false
 # Math Conjecture Trainer Space
-This Space is the tactical training console for the project: it pulls released
-training corpus splits, builds runtime config from the SOTA curriculum YAML,
-executes multi-stage `DeepSeek-Math` fine-tuning, optionally evaluates
-self-consistency, and can publish adapters/checkpoints/training summaries to:
-- `NorthernTribe-Research/math-conjecture-model` (when push is enabled)
-## What this Space does
-1. Downloads dataset parquet splits from:
-   `NorthernTribe-Research/math-conjecture-training-corpus`
-2. Builds a runtime config from `configs/deepseek_math_sota.yaml`
-3. Runs `scripts/train_sota.py` for staged curriculum training
-4. Optionally runs `scripts/eval_sota.py`
-5. Streams logs and run summary JSON in the UI
-## Authentication mode
-The app is autonomous and does not require entering an HF token in the UI.
-It resolves auth in this order:
-1. `HF_TOKEN` environment variable
-2. `HUGGINGFACE_HUB_TOKEN` environment variable
-3. `huggingface-api-key.json` (if present)
-If no token is found, training can still run on public datasets, and hub push is
-automatically disabled for that run.
-## Operational controls in UI
-- `Preflight Only (No Training)`: validates data/config/stage pipeline using
-  `train_sota.py --dry-run`.
-- `Push Adapter to Hub`: controls whether `hub.push_to_hub` is enabled in the
-  runtime config.
-- `Force Dataset Redownload`: bypasses cached local parquet files.
-- `Stop Active Run`: requests cancellation and terminates active subprocesses.
-- `Run Summary (JSON)`: structured output with config, status, and metrics.
-## Default training config
-- `configs/deepseek_math_sota.yaml`
-- base model default: `deepseek-ai/deepseek-math-v2`
-- output root: `workspace/runs/math-conjecture-sota`
 ## Notes
-- Full training expects GPU hardware.
-- Runtime config generated by the app is stored at:
-  `workspace/runtime/deepseek_math_sota.runtime.yaml`.

 # Math Conjecture Trainer Space
+Launch multi-stage DeepSeek-Math fine-tuning on Space GPU and push adapters to your model repo.
+This Space is the tactical operations console for `maths-conjuncture-solutions` and is wired to:
+- dataset: `NorthernTribe-Research/math-conjecture-training-corpus`
+- model repo: `NorthernTribe-Research/math-conjecture-model`
+## End-to-end flow
+1. Download released parquet splits (`train/validation/test`).
+2. Build runtime config from `configs/deepseek_math_sota.yaml`.
+3. Run 4-stage curriculum LoRA fine-tuning with `scripts/train_sota.py`.
+4. Run post-train evaluation (`pass@1`, `pass@k`, exact/boxed, family metrics).
+5. Apply quality gate thresholds before hub push.
+6. Emit `training_summary.json` + `post_eval_report.json` and stream live telemetry in UI.
+## Autonomous authentication
+No token input is required in the UI.
+Resolution order:
+1. `HF_TOKEN`
+2. `HUGGINGFACE_HUB_TOKEN`
+3. `huggingface-api-key.json`
+If no token is available, public dataset training still works and push is automatically disabled.
+## Runtime controls
+- `Run Evaluation After Training`: toggles post-train eval in runtime config.
+- `Enforce Quality Gate`: enables/disables promotion gate checks.
+- `Gate Min pass@1`, `Gate Min pass@k`, `Gate Min Rows`: runtime gate thresholds.
+- `Preflight Only (No Training)`: validates pipeline with `--dry-run`.
+- `Force Dataset Redownload`: bypasses cached parquet files.
+- `Abort Active Run`: cancels active subprocess tree.
+## Artifacts
+- runtime config: `workspace/runtime/deepseek_math_sota.runtime.yaml`
+- run output root: `workspace/runs/math-conjecture-sota`
+- final adapter: `workspace/runs/math-conjecture-sota/final_adapter`
+- training summary: `workspace/runs/math-conjecture-sota/training_summary.json`
+- post-eval report: `workspace/runs/math-conjecture-sota/post_eval_report.json`
 ## Notes
+- Full training requires GPU hardware.
+- App handles Gradio copy-button compatibility across versions automatically.

app.py CHANGED Viewed

@@ -198,15 +198,58 @@ PROJECT_DESCRIPTION = """
 # Math Conjecture Trainer
 This console runs the full training operations lane for the `maths-conjuncture-solutions` project:
 1. Pull released parquet splits from `NorthernTribe-Research/math-conjecture-training-corpus`.
 2. Build runtime training configuration from `configs/deepseek_math_sota.yaml`.
 3. Execute multi-stage DeepSeek-Math curriculum fine-tuning via `scripts/train_sota.py`.
-4. Optionally evaluate adapters with pass@k-style sampling via `scripts/eval_sota.py`.
-5. Auto-resolve Hugging Face credentials, push adapters/checkpoints/summary when allowed, and stream live logs.
-6. Support preflight validation, abort control, cache strategy, and structured run-summary telemetry in one UI.
 """
 def now_ts() -> str:
     return dt.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S UTC")
@@ -390,7 +433,15 @@ def write_runtime_config(
     model_repo_id: str,
     train_file: str,
     validation_file: str,
     push_to_hub: bool,
 ) -> Path:
     cfg = yaml.safe_load(CONFIG_TEMPLATE.read_text(encoding="utf-8"))
     cfg["model"]["base_model"] = base_model_id
@@ -399,6 +450,21 @@ def write_runtime_config(
     cfg["data"]["default_train_file"] = train_file
     cfg["data"]["default_validation_file"] = validation_file
     cfg["global"]["output_root"] = str(TRAIN_OUTPUT_DIR)
     runtime_path = RUNTIME_DIR / "deepseek_math_sota.runtime.yaml"
     runtime_path.write_text(yaml.safe_dump(cfg, sort_keys=False), encoding="utf-8")
     return runtime_path
@@ -511,6 +577,10 @@ def run_pipeline(
     run_eval: bool,
     eval_k: int,
     eval_samples: int,
     push_to_hub: bool,
     force_redownload: bool,
     preflight_only: bool,
@@ -542,14 +612,25 @@ def run_pipeline(
         stage_count = int(max_stages)
         eval_k = int(eval_k)
         eval_samples = int(eval_samples)
         if stage_start < 1:
             raise ValueError("Start stage must be >= 1.")
         if stage_count < 1:
             raise ValueError("How many stages must be >= 1.")
         if eval_k < 1:
             raise ValueError("Eval K must be >= 1.")
         if eval_samples < 1:
             raise ValueError("Eval max samples must be >= 1.")
         for required_path in (CONFIG_TEMPLATE, TRAIN_SCRIPT):
             if not required_path.exists():
@@ -570,6 +651,10 @@ def run_pipeline(
                 "run_eval": bool(run_eval),
                 "eval_k": eval_k,
                 "eval_samples": eval_samples,
                 "push_to_hub": bool(push_to_hub),
                 "force_redownload": bool(force_redownload),
                 "preflight_only": bool(preflight_only),
@@ -633,7 +718,15 @@ def run_pipeline(
             model_repo_id=model_repo_id,
             train_file=train_file,
             validation_file=validation_file,
             push_to_hub=effective_push_to_hub,
         )
         summary["runtime_config"] = str(runtime_cfg)
         append_log(log_lines, f"Wrote runtime config: {runtime_cfg}")
@@ -701,15 +794,50 @@ def run_pipeline(
             return
         training_summary_path = TRAIN_OUTPUT_DIR / "training_summary.json"
         if training_summary_path.exists():
             try:
                 summary["training_summary_path"] = str(training_summary_path)
-                summary["training_summary"] = json.loads(training_summary_path.read_text(encoding="utf-8"))
             except json.JSONDecodeError:
                 summary["training_summary_path"] = str(training_summary_path)
                 summary["training_summary"] = {"warning": "Unable to parse training summary JSON."}
-        if run_eval:
             eval_report = WORKSPACE_DIR / "runs" / "latest_eval_report.json"
             eval_cmd = [
                 sys.executable,
@@ -763,9 +891,12 @@ def run_pipeline(
             if eval_report.exists():
                 report = json.loads(eval_report.read_text(encoding="utf-8"))
                 summary["evaluation"] = {
                     "evaluated_rows": report.get("evaluated_rows"),
                     "pass_at_1": report.get("pass_at_1"),
                     "pass_at_k": report.get("pass_at_k"),
                     "k": report.get("k"),
                     "report_path": str(eval_report),
                 }
@@ -807,12 +938,41 @@ with gr.Blocks(title="Math Conjecture Trainer Space") as demo:
             value="deepseek-ai/deepseek-math-v2",
         )
     with gr.Row():
-        start_stage = gr.Slider(label="Stage Start", minimum=1, maximum=3, step=1, value=1)
-        max_stages = gr.Slider(label="Stage Count", minimum=1, maximum=3, step=1, value=3)
         run_eval = gr.Checkbox(label="Run Evaluation After Training", value=True)
     with gr.Row():
         eval_k = gr.Slider(label="Evaluation K", minimum=1, maximum=8, step=1, value=4)
         eval_samples = gr.Slider(label="Evaluation Max Samples", minimum=50, maximum=1000, step=50, value=300)
     with gr.Row():
         push_to_hub = gr.Checkbox(label="Push Adapter to Hub", value=True)
         force_redownload = gr.Checkbox(label="Force Dataset Redownload", value=False)
@@ -843,6 +1003,10 @@ with gr.Blocks(title="Math Conjecture Trainer Space") as demo:
             run_eval,
             eval_k,
             eval_samples,
             push_to_hub,
             force_redownload,
             preflight_only,

 # Math Conjecture Trainer
 This console runs the full training operations lane for the `maths-conjuncture-solutions` project:
+Launch multi-stage DeepSeek-Math fine-tuning on Space GPU and push adapters to your model repo.
 1. Pull released parquet splits from `NorthernTribe-Research/math-conjecture-training-corpus`.
 2. Build runtime training configuration from `configs/deepseek_math_sota.yaml`.
 3. Execute multi-stage DeepSeek-Math curriculum fine-tuning via `scripts/train_sota.py`.
+4. Run post-training evaluation with pass@k-style sampling and family-level metrics.
+5. Enforce autonomous quality gates before adapter promotion/push.
+6. Auto-resolve Hugging Face credentials, stream live telemetry, and emit structured run summaries.
 """
+def _safe_float(value: Any, default: float) -> float:
+    try:
+        return float(value)
+    except (TypeError, ValueError):
+        return default
+def _safe_int(value: Any, default: int) -> int:
+    try:
+        return int(value)
+    except (TypeError, ValueError):
+        return default
+def load_template_defaults() -> Dict[str, Any]:
+    if not CONFIG_TEMPLATE.exists():
+        return {}
+    try:
+        cfg = yaml.safe_load(CONFIG_TEMPLATE.read_text(encoding="utf-8"))
+    except Exception:
+        return {}
+    if not isinstance(cfg, dict):
+        return {}
+    return cfg
+TEMPLATE_CFG = load_template_defaults()
+TEMPLATE_STAGE_COUNT = max(1, len(TEMPLATE_CFG.get("stages", []) or [None]))
+TEMPLATE_QUALITY_GATE = TEMPLATE_CFG.get("quality_gate", {})
+if not isinstance(TEMPLATE_QUALITY_GATE, dict):
+    TEMPLATE_QUALITY_GATE = {}
+_raw_gate_enabled = TEMPLATE_QUALITY_GATE.get("enabled", True)
+if isinstance(_raw_gate_enabled, bool):
+    DEFAULT_GATE_ENABLED = _raw_gate_enabled
+else:
+    DEFAULT_GATE_ENABLED = str(_raw_gate_enabled).strip().lower() in {"1", "true", "yes", "y", "on"}
+DEFAULT_GATE_MIN_ROWS = max(1, _safe_int(TEMPLATE_QUALITY_GATE.get("min_evaluated_rows"), 120))
+DEFAULT_GATE_MIN_PASS_AT_1 = max(0.0, _safe_float(TEMPLATE_QUALITY_GATE.get("min_pass_at_1"), 0.01))
+DEFAULT_GATE_MIN_PASS_AT_K = max(0.0, _safe_float(TEMPLATE_QUALITY_GATE.get("min_pass_at_k"), 0.06))
 def now_ts() -> str:
     return dt.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S UTC")
     model_repo_id: str,
     train_file: str,
     validation_file: str,
+    test_file: str,
+    run_eval: bool,
+    eval_k: int,
+    eval_samples: int,
     push_to_hub: bool,
+    enforce_quality_gate: bool,
+    gate_min_pass_at_1: float,
+    gate_min_pass_at_k: float,
+    gate_min_rows: int,
 ) -> Path:
     cfg = yaml.safe_load(CONFIG_TEMPLATE.read_text(encoding="utf-8"))
     cfg["model"]["base_model"] = base_model_id
     cfg["data"]["default_train_file"] = train_file
     cfg["data"]["default_validation_file"] = validation_file
     cfg["global"]["output_root"] = str(TRAIN_OUTPUT_DIR)
+    cfg.setdefault("post_eval", {})
+    cfg["post_eval"]["enabled"] = bool(run_eval)
+    cfg["post_eval"]["eval_file"] = test_file
+    cfg["post_eval"]["k"] = int(eval_k)
+    cfg["post_eval"]["max_samples"] = int(eval_samples)
+    cfg["post_eval"]["output_json"] = str(TRAIN_OUTPUT_DIR / "post_eval_report.json")
+    cfg.setdefault("quality_gate", {})
+    cfg["quality_gate"]["enabled"] = bool(enforce_quality_gate)
+    cfg["quality_gate"]["min_evaluated_rows"] = int(gate_min_rows)
+    cfg["quality_gate"]["min_pass_at_1"] = float(gate_min_pass_at_1)
+    cfg["quality_gate"]["min_pass_at_k"] = float(gate_min_pass_at_k)
+    cfg["quality_gate"]["require_post_eval"] = bool(enforce_quality_gate and run_eval)
     runtime_path = RUNTIME_DIR / "deepseek_math_sota.runtime.yaml"
     runtime_path.write_text(yaml.safe_dump(cfg, sort_keys=False), encoding="utf-8")
     return runtime_path
     run_eval: bool,
     eval_k: int,
     eval_samples: int,
+    enforce_quality_gate: bool,
+    gate_min_pass_at_1: float,
+    gate_min_pass_at_k: float,
+    gate_min_rows: int,
     push_to_hub: bool,
     force_redownload: bool,
     preflight_only: bool,
         stage_count = int(max_stages)
         eval_k = int(eval_k)
         eval_samples = int(eval_samples)
+        gate_min_rows = int(gate_min_rows)
+        gate_min_pass_at_1 = float(gate_min_pass_at_1)
+        gate_min_pass_at_k = float(gate_min_pass_at_k)
         if stage_start < 1:
             raise ValueError("Start stage must be >= 1.")
+        if stage_start > TEMPLATE_STAGE_COUNT:
+            raise ValueError(f"Start stage must be <= {TEMPLATE_STAGE_COUNT}.")
         if stage_count < 1:
             raise ValueError("How many stages must be >= 1.")
         if eval_k < 1:
             raise ValueError("Eval K must be >= 1.")
         if eval_samples < 1:
             raise ValueError("Eval max samples must be >= 1.")
+        if gate_min_rows < 1:
+            raise ValueError("Gate minimum rows must be >= 1.")
+        if not 0.0 <= gate_min_pass_at_1 <= 1.0:
+            raise ValueError("Gate min pass@1 must be between 0 and 1.")
+        if not 0.0 <= gate_min_pass_at_k <= 1.0:
+            raise ValueError("Gate min pass@k must be between 0 and 1.")
         for required_path in (CONFIG_TEMPLATE, TRAIN_SCRIPT):
             if not required_path.exists():
                 "run_eval": bool(run_eval),
                 "eval_k": eval_k,
                 "eval_samples": eval_samples,
+                "enforce_quality_gate": bool(enforce_quality_gate),
+                "gate_min_rows": gate_min_rows,
+                "gate_min_pass_at_1": gate_min_pass_at_1,
+                "gate_min_pass_at_k": gate_min_pass_at_k,
                 "push_to_hub": bool(push_to_hub),
                 "force_redownload": bool(force_redownload),
                 "preflight_only": bool(preflight_only),
             model_repo_id=model_repo_id,
             train_file=train_file,
             validation_file=validation_file,
+            test_file=test_file,
+            run_eval=bool(run_eval),
+            eval_k=eval_k,
+            eval_samples=eval_samples,
             push_to_hub=effective_push_to_hub,
+            enforce_quality_gate=bool(enforce_quality_gate),
+            gate_min_pass_at_1=gate_min_pass_at_1,
+            gate_min_pass_at_k=gate_min_pass_at_k,
+            gate_min_rows=gate_min_rows,
         )
         summary["runtime_config"] = str(runtime_cfg)
         append_log(log_lines, f"Wrote runtime config: {runtime_cfg}")
             return
         training_summary_path = TRAIN_OUTPUT_DIR / "training_summary.json"
+        training_summary: Optional[Dict[str, Any]] = None
         if training_summary_path.exists():
             try:
                 summary["training_summary_path"] = str(training_summary_path)
+                loaded_summary = json.loads(training_summary_path.read_text(encoding="utf-8"))
+                if isinstance(loaded_summary, dict):
+                    training_summary = loaded_summary
+                    summary["training_summary"] = loaded_summary
+                else:
+                    summary["training_summary"] = {"warning": "Training summary JSON is not an object."}
             except json.JSONDecodeError:
                 summary["training_summary_path"] = str(training_summary_path)
                 summary["training_summary"] = {"warning": "Unable to parse training summary JSON."}
+        if isinstance(training_summary, dict):
+            quality_gate = training_summary.get("quality_gate")
+            if isinstance(quality_gate, dict):
+                summary["quality_gate"] = quality_gate
+                append_log(
+                    log_lines,
+                    f"Quality gate: passed={quality_gate.get('passed')} enabled={quality_gate.get('enabled')}",
+                )
+            push_report = training_summary.get("push")
+            if isinstance(push_report, dict):
+                summary["push"] = push_report
+                append_log(
+                    log_lines,
+                    f"Push decision: requested={push_report.get('requested')} performed={push_report.get('performed')}",
+                )
+            post_eval_report = training_summary.get("post_eval")
+            if run_eval and isinstance(post_eval_report, dict):
+                summary["evaluation"] = {
+                    "source": "train_post_eval",
+                    "evaluated_rows": post_eval_report.get("evaluated_rows"),
+                    "pass_at_1": post_eval_report.get("pass_at_1"),
+                    "pass_at_k": post_eval_report.get("pass_at_k"),
+                    "exact_at_k": post_eval_report.get("exact_at_k"),
+                    "composite_score": post_eval_report.get("composite_score"),
+                    "k": post_eval_report.get("k"),
+                    "report_path": post_eval_report.get("report_path"),
+                }
+                append_log(log_lines, "Using post-eval metrics emitted by training run.")
+        if run_eval and "evaluation" not in summary:
             eval_report = WORKSPACE_DIR / "runs" / "latest_eval_report.json"
             eval_cmd = [
                 sys.executable,
             if eval_report.exists():
                 report = json.loads(eval_report.read_text(encoding="utf-8"))
                 summary["evaluation"] = {
+                    "source": "fallback_eval",
                     "evaluated_rows": report.get("evaluated_rows"),
                     "pass_at_1": report.get("pass_at_1"),
                     "pass_at_k": report.get("pass_at_k"),
+                    "exact_at_k": report.get("exact_at_k"),
+                    "composite_score": report.get("composite_score"),
                     "k": report.get("k"),
                     "report_path": str(eval_report),
                 }
             value="deepseek-ai/deepseek-math-v2",
         )
     with gr.Row():
+        start_stage = gr.Slider(label="Stage Start", minimum=1, maximum=TEMPLATE_STAGE_COUNT, step=1, value=1)
+        max_stages = gr.Slider(
+            label="Stage Count",
+            minimum=1,
+            maximum=TEMPLATE_STAGE_COUNT,
+            step=1,
+            value=TEMPLATE_STAGE_COUNT,
+        )
         run_eval = gr.Checkbox(label="Run Evaluation After Training", value=True)
     with gr.Row():
         eval_k = gr.Slider(label="Evaluation K", minimum=1, maximum=8, step=1, value=4)
         eval_samples = gr.Slider(label="Evaluation Max Samples", minimum=50, maximum=1000, step=50, value=300)
+    with gr.Row():
+        enforce_quality_gate = gr.Checkbox(label="Enforce Quality Gate", value=DEFAULT_GATE_ENABLED)
+        gate_min_pass_at_1 = gr.Slider(
+            label="Gate Min pass@1",
+            minimum=0.0,
+            maximum=0.5,
+            step=0.005,
+            value=min(max(DEFAULT_GATE_MIN_PASS_AT_1, 0.0), 0.5),
+        )
+        gate_min_pass_at_k = gr.Slider(
+            label="Gate Min pass@k",
+            minimum=0.0,
+            maximum=1.0,
+            step=0.01,
+            value=min(max(DEFAULT_GATE_MIN_PASS_AT_K, 0.0), 1.0),
+        )
+        gate_min_rows = gr.Slider(
+            label="Gate Min Rows",
+            minimum=10,
+            maximum=2000,
+            step=10,
+            value=min(max(DEFAULT_GATE_MIN_ROWS, 10), 2000),
+        )
     with gr.Row():
         push_to_hub = gr.Checkbox(label="Push Adapter to Hub", value=True)
         force_redownload = gr.Checkbox(label="Force Dataset Redownload", value=False)
             run_eval,
             eval_k,
             eval_samples,
+            enforce_quality_gate,
+            gate_min_pass_at_1,
+            gate_min_pass_at_k,
+            gate_min_rows,
             push_to_hub,
             force_redownload,
             preflight_only,

configs/deepseek_math_sota.yaml CHANGED Viewed

@@ -97,17 +97,55 @@ stages:
         - conjecture_core
       require_conjecture_id: true
     training:
-      num_train_epochs: 3
       learning_rate: 5.0e-6
       save_steps: 100
       eval_steps: 100
 hub:
   push_to_hub: true
   repo_id: NorthernTribe-Research/math-conjecture-model
   private: false
   upload_stage_checkpoints: true
-  commit_message: Train multi-stage SOTA curriculum for conjecture reasoning.
 credentials:
   path: huggingface-api-key.json

         - conjecture_core
       require_conjecture_id: true
     training:
+      num_train_epochs: 2
       learning_rate: 5.0e-6
       save_steps: 100
       eval_steps: 100
+  - name: hard_case_polish
+    max_train_samples: 60000
+    max_eval_samples: 2000
+    filters:
+      include_families:
+        - conjecture_core
+        - formal_proof
+      require_conjecture_id: true
+      min_sample_weight: 3.0
+    training:
+      num_train_epochs: 1
+      learning_rate: 3.0e-6
+      gradient_accumulation_steps: 24
+      save_steps: 80
+      eval_steps: 80
+post_eval:
+  enabled: true
+  eval_file: workspace/data/releases/v1/test.parquet
+  max_samples: 240
+  k: 6
+  max_new_tokens: 320
+  temperature: 0.7
+  top_p: 0.95
+  seed: 17
+  output_json: workspace/runs/math-conjecture-sota/post_eval_report.json
+quality_gate:
+  enabled: true
+  require_post_eval: true
+  min_evaluated_rows: 120
+  min_pass_at_1: 0.01
+  min_pass_at_k: 0.06
+  max_final_eval_loss: 2.6
+  required_family_pass_at_k:
+    conjecture_core: 0.06
+    formal_proof: 0.03
 hub:
   push_to_hub: true
   repo_id: NorthernTribe-Research/math-conjecture-model
   private: false
   upload_stage_checkpoints: true
+  commit_message: Launch multi-stage DeepSeek-Math fine-tuning on Space GPU and push adapters to your model repo.
 credentials:
   path: huggingface-api-key.json

scripts/eval_sota.py CHANGED Viewed

@@ -7,7 +7,7 @@ import argparse
 import json
 import re
 from pathlib import Path
-from typing import Any, Dict, List, Optional, Sequence
 import torch
 import yaml
@@ -15,13 +15,20 @@ from datasets import load_dataset
 from peft import PeftModel
 from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(description="Run pass@k-style evaluation on held-out split.")
     parser.add_argument(
         "--config",
         type=Path,
-        default=Path("configs/deepseek_math_sota.yaml"),
         help="Training config used for prompt formatting defaults.",
     )
     parser.add_argument(
@@ -39,19 +46,32 @@ def parse_args() -> argparse.Namespace:
     parser.add_argument(
         "--eval-file",
         type=Path,
-        default=Path("data/releases/v1/test.parquet"),
-        help="Parquet split used for evaluation.",
     )
     parser.add_argument("--max-samples", type=int, default=300, help="Maximum evaluation rows.")
     parser.add_argument("--k", type=int, default=4, help="Number of sampled generations per prompt.")
     parser.add_argument("--max-new-tokens", type=int, default=256, help="Generation length cap.")
     parser.add_argument("--temperature", type=float, default=0.7, help="Sampling temperature.")
     parser.add_argument("--top-p", type=float, default=0.95, help="Nucleus sampling p.")
     parser.add_argument("--seed", type=int, default=17, help="Random seed.")
     parser.add_argument(
         "--output-json",
         type=Path,
-        default=Path("workspace/runs/latest_eval_report.json"),
         help="Where to write evaluation report.",
     )
     return parser.parse_args()
@@ -65,6 +85,24 @@ def as_text(value: Any) -> str:
     return str(value).strip()
 def load_config(path: Path) -> Dict[str, Any]:
     cfg = yaml.safe_load(path.read_text(encoding="utf-8"))
     if not isinstance(cfg, dict):
@@ -74,9 +112,124 @@ def load_config(path: Path) -> Dict[str, Any]:
 def normalize_answer(text: str) -> str:
     text = text.strip().lower()
-    text = re.sub(r"\s+", " ", text)
     text = text.replace("$", "")
-    return text
 def flatten_expected(row: Dict[str, Any], data_cfg: Dict[str, Any]) -> List[str]:
@@ -168,27 +321,11 @@ def extract_candidate_text(full_generation: str, prompt_text: str) -> str:
     return full_generation.strip()
-def is_match(candidate: str, expected_values: Sequence[str]) -> bool:
-    cand_norm = normalize_answer(candidate)
-    if not cand_norm:
-        return False
-    for expected in expected_values:
-        exp_norm = normalize_answer(expected)
-        if not exp_norm:
-            continue
-        if exp_norm in cand_norm or cand_norm in exp_norm:
-            return True
-        boxed = re.findall(r"\\boxed\{([^{}]+)\}", cand_norm)
-        if boxed and any(exp_norm in item for item in boxed):
-            return True
-    return False
 def load_model_and_tokenizer(
     base_model: str,
     adapter_path: Optional[Path],
     trust_remote_code: bool,
-) -> tuple[Any, AutoTokenizer]:
     tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=trust_remote_code, use_fast=True)
     if tokenizer.pad_token is None:
         tokenizer.pad_token = tokenizer.eos_token or tokenizer.unk_token
@@ -207,55 +344,126 @@ def load_model_and_tokenizer(
     return model, tokenizer
-def main() -> None:
-    args = parse_args()
-    cfg = load_config(args.config)
     data_cfg = cfg.get("data", {})
-    model_cfg = cfg.get("model", {})
-    set_seed(args.seed)
     if args.k < 1:
         raise ValueError("--k must be >= 1.")
     if args.max_samples < 1:
         raise ValueError("--max-samples must be >= 1.")
     if args.max_new_tokens < 1:
         raise ValueError("--max-new-tokens must be >= 1.")
     if args.temperature <= 0:
         raise ValueError("--temperature must be > 0.")
     if not 0 < args.top_p <= 1:
         raise ValueError("--top-p must be in (0, 1].")
     base_model = args.base_model or as_text(model_cfg.get("base_model"))
     if not base_model:
         raise ValueError("Base model is required via --base-model or config.model.base_model.")
     if args.adapter_path is not None and not args.adapter_path.exists():
         raise FileNotFoundError(f"Adapter path not found: {args.adapter_path}")
     model, tokenizer = load_model_and_tokenizer(
         base_model=base_model,
         adapter_path=args.adapter_path,
         trust_remote_code=bool(model_cfg.get("trust_remote_code", False)),
     )
-    if not args.eval_file.exists():
-        raise FileNotFoundError(f"Evaluation file not found: {args.eval_file}")
-    ds = load_dataset("parquet", data_files={"eval": str(args.eval_file)})["eval"]
     if args.max_samples > 0 and args.max_samples < len(ds):
         ds = ds.select(range(args.max_samples))
-    total = 0
-    hit_at_1 = 0
-    hit_at_k = 0
-    records = []
     for row in ds:
         expected_values = flatten_expected(row, data_cfg)
         if not expected_values:
             continue
         prompt_text = build_prompt_text(row, tokenizer, data_cfg)
-        inputs = tokenizer(prompt_text, return_tensors="pt", truncation=True, max_length=4096)
-        model_device = next(model.parameters()).device
         inputs = {k: v.to(model_device) for k, v in inputs.items()}
         with torch.no_grad():
@@ -269,44 +477,119 @@ def main() -> None:
                 pad_token_id=tokenizer.pad_token_id,
                 eos_token_id=tokenizer.eos_token_id,
             )
         generations = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
         candidates = [extract_candidate_text(text, prompt_text) for text in generations]
-        matches = [is_match(candidate, expected_values) for candidate in candidates]
-        total += 1
-        if matches and matches[0]:
-            hit_at_1 += 1
-        if any(matches):
-            hit_at_k += 1
-        records.append(
-            {
-                "uid": as_text(row.get("uid")),
-                "prompt": as_text(row.get(as_text(data_cfg.get("prompt_field")) or "prompt")),
-                "expected_values": expected_values[:5],
-                "candidates": candidates,
-                "matches": matches,
-            }
         )
-    pass_at_1 = (hit_at_1 / total) if total else 0.0
-    pass_at_k = (hit_at_k / total) if total else 0.0
-    report = {
         "base_model": base_model,
         "adapter_path": str(args.adapter_path) if args.adapter_path is not None else None,
-        "eval_file": str(args.eval_file),
-        "evaluated_rows": total,
         "k": args.k,
         "pass_at_1": pass_at_1,
         "pass_at_k": pass_at_k,
         "temperature": args.temperature,
         "top_p": args.top_p,
         "max_new_tokens": args.max_new_tokens,
-        "samples": records[:30],
     }
     args.output_json.parent.mkdir(parents=True, exist_ok=True)
     args.output_json.write_text(json.dumps(report, ensure_ascii=True, indent=2), encoding="utf-8")
-    print(json.dumps({k: report[k] for k in ("evaluated_rows", "pass_at_1", "pass_at_k", "k")}, indent=2))
     print(f"Saved report to {args.output_json}")
 if __name__ == "__main__":

 import json
 import re
 from pathlib import Path
+from typing import Any, Dict, List, Optional, Sequence, Tuple
 import torch
 import yaml
 from peft import PeftModel
 from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
+SCRIPT_ROOT = Path(__file__).resolve().parents[1]
+DEFAULT_CONFIG_PATH = SCRIPT_ROOT / "configs" / "deepseek_math_sota.yaml"
+DEFAULT_OUTPUT_JSON = SCRIPT_ROOT / "runs" / "latest_eval_report.json"
+BOXED_RE = re.compile(r"\\boxed\{([^{}]+)\}")
+SPACE_RE = re.compile(r"\s+")
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(description="Run pass@k-style evaluation on held-out split.")
     parser.add_argument(
         "--config",
         type=Path,
+        default=DEFAULT_CONFIG_PATH,
         help="Training config used for prompt formatting defaults.",
     )
     parser.add_argument(
     parser.add_argument(
         "--eval-file",
         type=Path,
+        default=None,
+        help="Parquet split used for evaluation (defaults to post_eval.eval_file or data.default_validation_file).",
     )
     parser.add_argument("--max-samples", type=int, default=300, help="Maximum evaluation rows.")
     parser.add_argument("--k", type=int, default=4, help="Number of sampled generations per prompt.")
     parser.add_argument("--max-new-tokens", type=int, default=256, help="Generation length cap.")
+    parser.add_argument("--max-input-length", type=int, default=4096, help="Prompt tokenization length cap.")
     parser.add_argument("--temperature", type=float, default=0.7, help="Sampling temperature.")
     parser.add_argument("--top-p", type=float, default=0.95, help="Nucleus sampling p.")
     parser.add_argument("--seed", type=int, default=17, help="Random seed.")
+    parser.add_argument(
+        "--progress-every",
+        type=int,
+        default=25,
+        help="Print progress every N evaluated rows (0 disables).",
+    )
+    parser.add_argument(
+        "--sample-records",
+        type=int,
+        default=30,
+        help="How many sample records to store in report.",
+    )
     parser.add_argument(
         "--output-json",
         type=Path,
+        default=DEFAULT_OUTPUT_JSON,
         help="Where to write evaluation report.",
     )
     return parser.parse_args()
     return str(value).strip()
+def as_float(value: Any, default: float) -> float:
+    if value is None:
+        return default
+    try:
+        return float(value)
+    except (TypeError, ValueError):
+        return default
+def as_int(value: Any, default: int) -> int:
+    if value is None:
+        return default
+    try:
+        return int(value)
+    except (TypeError, ValueError):
+        return default
 def load_config(path: Path) -> Dict[str, Any]:
     cfg = yaml.safe_load(path.read_text(encoding="utf-8"))
     if not isinstance(cfg, dict):
 def normalize_answer(text: str) -> str:
     text = text.strip().lower()
     text = text.replace("$", "")
+    text = text.replace("\\left", "").replace("\\right", "")
+    text = text.replace("\\,", "").replace("\\!", "").replace("\\;", "")
+    text = SPACE_RE.sub(" ", text)
+    return text.strip(" .")
+def extract_boxed_values(text: str) -> List[str]:
+    return [normalize_answer(match) for match in BOXED_RE.findall(text or "") if normalize_answer(match)]
+def parse_numeric_value(text: str) -> Optional[float]:
+    normalized = normalize_answer(text)
+    if not normalized:
+        return None
+    candidate = normalized.replace(",", "")
+    if re.fullmatch(r"[-+]?\d+\s*/\s*[-+]?\d+", candidate):
+        left, right = candidate.split("/", maxsplit=1)
+        try:
+            numerator = float(left.strip())
+            denominator = float(right.strip())
+        except ValueError:
+            return None
+        if denominator == 0:
+            return None
+        return numerator / denominator
+    if re.fullmatch(r"[-+]?(?:\d+\.\d*|\d*\.\d+|\d+)(?:[eE][-+]?\d+)?", candidate):
+        try:
+            return float(candidate)
+        except ValueError:
+            return None
+    return None
+def approximately_equal(left: float, right: float) -> bool:
+    tolerance = 1e-6 * max(1.0, abs(left), abs(right))
+    return abs(left - right) <= tolerance
+def match_candidate(candidate: str, expected_values: Sequence[str]) -> Dict[str, Any]:
+    cand_norm = normalize_answer(candidate)
+    if not cand_norm:
+        return {
+            "match": False,
+            "exact": False,
+            "boxed": False,
+            "numeric": False,
+            "reason": "empty_candidate",
+        }
+    cand_boxed = extract_boxed_values(candidate)
+    cand_num = parse_numeric_value(cand_norm)
+    substring_hit = False
+    boxed_hit = False
+    numeric_hit = False
+    for expected in expected_values:
+        exp_norm = normalize_answer(expected)
+        if not exp_norm:
+            continue
+        if cand_norm == exp_norm:
+            return {
+                "match": True,
+                "exact": True,
+                "boxed": exp_norm in cand_boxed,
+                "numeric": False,
+                "reason": "exact",
+            }
+        if exp_norm in cand_norm or cand_norm in exp_norm:
+            substring_hit = True
+        expected_boxed = extract_boxed_values(expected)
+        for cand_box in cand_boxed:
+            if cand_box == exp_norm or exp_norm in cand_box or cand_box in exp_norm:
+                boxed_hit = True
+        for exp_box in expected_boxed:
+            if cand_norm == exp_box or exp_box in cand_norm or cand_norm in exp_box:
+                boxed_hit = True
+        exp_num = parse_numeric_value(exp_norm)
+        if cand_num is not None and exp_num is not None and approximately_equal(cand_num, exp_num):
+            numeric_hit = True
+    if boxed_hit:
+        return {
+            "match": True,
+            "exact": False,
+            "boxed": True,
+            "numeric": numeric_hit,
+            "reason": "boxed",
+        }
+    if numeric_hit:
+        return {
+            "match": True,
+            "exact": False,
+            "boxed": False,
+            "numeric": True,
+            "reason": "numeric",
+        }
+    if substring_hit:
+        return {
+            "match": True,
+            "exact": False,
+            "boxed": False,
+            "numeric": False,
+            "reason": "substring",
+        }
+    return {
+        "match": False,
+        "exact": False,
+        "boxed": False,
+        "numeric": False,
+        "reason": "no_match",
+    }
 def flatten_expected(row: Dict[str, Any], data_cfg: Dict[str, Any]) -> List[str]:
     return full_generation.strip()
 def load_model_and_tokenizer(
     base_model: str,
     adapter_path: Optional[Path],
     trust_remote_code: bool,
+) -> Tuple[Any, AutoTokenizer]:
     tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=trust_remote_code, use_fast=True)
     if tokenizer.pad_token is None:
         tokenizer.pad_token = tokenizer.eos_token or tokenizer.unk_token
     return model, tokenizer
+def make_bucket() -> Dict[str, Any]:
+    return {
+        "evaluated_rows": 0,
+        "pass_at_1_hits": 0,
+        "pass_at_k_hits": 0,
+        "exact_at_1_hits": 0,
+        "exact_at_k_hits": 0,
+        "boxed_at_k_hits": 0,
+    }
+def update_bucket(bucket: Dict[str, Any], hit1: bool, hitk: bool, exact1: bool, exactk: bool, boxedk: bool) -> None:
+    bucket["evaluated_rows"] += 1
+    if hit1:
+        bucket["pass_at_1_hits"] += 1
+    if hitk:
+        bucket["pass_at_k_hits"] += 1
+    if exact1:
+        bucket["exact_at_1_hits"] += 1
+    if exactk:
+        bucket["exact_at_k_hits"] += 1
+    if boxedk:
+        bucket["boxed_at_k_hits"] += 1
+def finalize_bucket(bucket: Dict[str, Any]) -> Dict[str, Any]:
+    total = max(int(bucket.get("evaluated_rows", 0)), 1)
+    rows = int(bucket.get("evaluated_rows", 0))
+    return {
+        "evaluated_rows": rows,
+        "pass_at_1": float(bucket.get("pass_at_1_hits", 0)) / total,
+        "pass_at_k": float(bucket.get("pass_at_k_hits", 0)) / total,
+        "exact_at_1": float(bucket.get("exact_at_1_hits", 0)) / total,
+        "exact_at_k": float(bucket.get("exact_at_k_hits", 0)) / total,
+        "boxed_at_k": float(bucket.get("boxed_at_k_hits", 0)) / total,
+    }
+def resolve_eval_file(arg_eval_file: Optional[Path], cfg: Dict[str, Any]) -> Path:
+    if arg_eval_file is not None:
+        return arg_eval_file
+    post_eval_cfg = cfg.get("post_eval", {})
     data_cfg = cfg.get("data", {})
+    for candidate in (
+        as_text(post_eval_cfg.get("eval_file")),
+        as_text(data_cfg.get("default_validation_file")),
+        "data/releases/v1/test.parquet",
+        "workspace/data/releases/v1/test.parquet",
+    ):
+        if not candidate:
+            continue
+        path = Path(candidate)
+        if path.exists():
+            return path
+    return Path("data/releases/v1/test.parquet")
+def run_evaluation(args: argparse.Namespace) -> Dict[str, Any]:
     if args.k < 1:
         raise ValueError("--k must be >= 1.")
     if args.max_samples < 1:
         raise ValueError("--max-samples must be >= 1.")
     if args.max_new_tokens < 1:
         raise ValueError("--max-new-tokens must be >= 1.")
+    if args.max_input_length < 128:
+        raise ValueError("--max-input-length must be >= 128.")
     if args.temperature <= 0:
         raise ValueError("--temperature must be > 0.")
     if not 0 < args.top_p <= 1:
         raise ValueError("--top-p must be in (0, 1].")
+    cfg = load_config(args.config)
+    data_cfg = cfg.get("data", {})
+    model_cfg = cfg.get("model", {})
+    set_seed(args.seed)
     base_model = args.base_model or as_text(model_cfg.get("base_model"))
     if not base_model:
         raise ValueError("Base model is required via --base-model or config.model.base_model.")
     if args.adapter_path is not None and not args.adapter_path.exists():
         raise FileNotFoundError(f"Adapter path not found: {args.adapter_path}")
+    eval_file = resolve_eval_file(args.eval_file, cfg)
+    if not eval_file.exists():
+        raise FileNotFoundError(f"Evaluation file not found: {eval_file}")
     model, tokenizer = load_model_and_tokenizer(
         base_model=base_model,
         adapter_path=args.adapter_path,
         trust_remote_code=bool(model_cfg.get("trust_remote_code", False)),
     )
+    ds = load_dataset("parquet", data_files={"eval": str(eval_file)})["eval"]
     if args.max_samples > 0 and args.max_samples < len(ds):
         ds = ds.select(range(args.max_samples))
+    totals = make_bucket()
+    family_buckets: Dict[str, Dict[str, Any]] = {}
+    difficulty_buckets: Dict[str, Dict[str, Any]] = {}
+    processed_rows = 0
+    skipped_no_expected = 0
+    samples: List[Dict[str, Any]] = []
+    model_device = next(model.parameters()).device
+    prompt_field = as_text(data_cfg.get("prompt_field")) or "prompt"
     for row in ds:
         expected_values = flatten_expected(row, data_cfg)
         if not expected_values:
+            skipped_no_expected += 1
             continue
         prompt_text = build_prompt_text(row, tokenizer, data_cfg)
+        inputs = tokenizer(
+            prompt_text,
+            return_tensors="pt",
+            truncation=True,
+            max_length=args.max_input_length,
+        )
         inputs = {k: v.to(model_device) for k, v in inputs.items()}
         with torch.no_grad():
                 pad_token_id=tokenizer.pad_token_id,
                 eos_token_id=tokenizer.eos_token_id,
             )
         generations = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
         candidates = [extract_candidate_text(text, prompt_text) for text in generations]
+        details = [match_candidate(candidate, expected_values) for candidate in candidates]
+        matches = [bool(item["match"]) for item in details]
+        exacts = [bool(item["exact"]) for item in details]
+        boxed = [bool(item["boxed"]) for item in details]
+        hit1 = bool(matches and matches[0])
+        hitk = bool(any(matches))
+        exact1 = bool(exacts and exacts[0])
+        exactk = bool(any(exacts))
+        boxedk = bool(any(boxed))
+        update_bucket(totals, hit1=hit1, hitk=hitk, exact1=exact1, exactk=exactk, boxedk=boxedk)
+        family = as_text(row.get("family")) or "__unknown__"
+        if family not in family_buckets:
+            family_buckets[family] = make_bucket()
+        update_bucket(family_buckets[family], hit1=hit1, hitk=hitk, exact1=exact1, exactk=exactk, boxedk=boxedk)
+        difficulty = as_text(row.get("difficulty")) or "__unknown__"
+        if difficulty not in difficulty_buckets:
+            difficulty_buckets[difficulty] = make_bucket()
+        update_bucket(
+            difficulty_buckets[difficulty],
+            hit1=hit1,
+            hitk=hitk,
+            exact1=exact1,
+            exactk=exactk,
+            boxedk=boxedk,
         )
+        processed_rows += 1
+        if args.progress_every > 0 and processed_rows % args.progress_every == 0:
+            print(f"Progress: evaluated_rows={processed_rows} latest_family={family}")
+        if len(samples) < args.sample_records:
+            samples.append(
+                {
+                    "uid": as_text(row.get("uid")),
+                    "family": family,
+                    "difficulty": difficulty,
+                    "prompt": as_text(row.get(prompt_field)),
+                    "expected_values": expected_values[:5],
+                    "candidates": candidates,
+                    "match_details": details,
+                    "matches": matches,
+                }
+            )
+    total_eval = int(totals.get("evaluated_rows", 0))
+    denominator = max(total_eval, 1)
+    pass_at_1 = float(totals.get("pass_at_1_hits", 0)) / denominator
+    pass_at_k = float(totals.get("pass_at_k_hits", 0)) / denominator
+    exact_at_1 = float(totals.get("exact_at_1_hits", 0)) / denominator
+    exact_at_k = float(totals.get("exact_at_k_hits", 0)) / denominator
+    boxed_at_k = float(totals.get("boxed_at_k_hits", 0)) / denominator
+    composite_score = 0.30 * pass_at_1 + 0.50 * pass_at_k + 0.20 * exact_at_k
+    report: Dict[str, Any] = {
         "base_model": base_model,
         "adapter_path": str(args.adapter_path) if args.adapter_path is not None else None,
+        "eval_file": str(eval_file),
+        "config": str(args.config),
+        "evaluated_rows": total_eval,
+        "skipped_rows_without_targets": skipped_no_expected,
+        "requested_rows": len(ds),
         "k": args.k,
         "pass_at_1": pass_at_1,
         "pass_at_k": pass_at_k,
+        "exact_at_1": exact_at_1,
+        "exact_at_k": exact_at_k,
+        "boxed_at_k": boxed_at_k,
+        "composite_score": composite_score,
         "temperature": args.temperature,
         "top_p": args.top_p,
         "max_new_tokens": args.max_new_tokens,
+        "max_input_length": args.max_input_length,
+        "seed": args.seed,
+        "family_metrics": {
+            key: finalize_bucket(family_buckets[key])
+            for key in sorted(family_buckets.keys())
+        },
+        "difficulty_metrics": {
+            key: finalize_bucket(difficulty_buckets[key])
+            for key in sorted(difficulty_buckets.keys())
+        },
+        "samples": samples,
     }
     args.output_json.parent.mkdir(parents=True, exist_ok=True)
     args.output_json.write_text(json.dumps(report, ensure_ascii=True, indent=2), encoding="utf-8")
+    summary_view = {
+        "evaluated_rows": total_eval,
+        "pass_at_1": pass_at_1,
+        "pass_at_k": pass_at_k,
+        "exact_at_k": exact_at_k,
+        "composite_score": composite_score,
+        "k": args.k,
+    }
+    print(json.dumps(summary_view, indent=2))
     print(f"Saved report to {args.output_json}")
+    return report
+def main() -> None:
+    args = parse_args()
+    run_evaluation(args)
 if __name__ == "__main__":

scripts/train_sota.py CHANGED Viewed

@@ -4,10 +4,13 @@
 from __future__ import annotations
 import argparse
 import json
 import os
 from pathlib import Path
-from typing import Any, Dict, Optional, Tuple
 import torch
 import yaml
@@ -25,7 +28,9 @@ from transformers import (
     set_seed,
 )
-DEFAULT_CONFIG_PATH = Path("configs/deepseek_math_sota.yaml")
 def parse_args() -> argparse.Namespace:
@@ -41,6 +46,21 @@ def parse_args() -> argparse.Namespace:
     parser.add_argument("--repo-id", type=str, default=None, help="Override hub.repo_id.")
     parser.add_argument("--push-to-hub", action="store_true", help="Force push enabled.")
     parser.add_argument("--no-push-to-hub", action="store_true", help="Force push disabled.")
     parser.add_argument(
         "--start-stage",
         type=int,
@@ -93,6 +113,19 @@ def as_int(value: Any, default: int) -> int:
         return default
 def load_config(path: Path) -> Dict[str, Any]:
     if not path.exists():
         raise FileNotFoundError(f"Config not found: {path}")
@@ -108,6 +141,8 @@ def load_config(path: Path) -> Dict[str, Any]:
     cfg.setdefault("training_defaults", {})
     cfg.setdefault("hub", {})
     cfg.setdefault("credentials", {})
     return cfg
@@ -123,6 +158,16 @@ def apply_overrides(cfg: Dict[str, Any], args: argparse.Namespace) -> None:
     if args.no_push_to_hub:
         cfg.setdefault("hub", {})["push_to_hub"] = False
 def resolve_auth(cfg: Dict[str, Any]) -> Tuple[Optional[str], Optional[str]]:
     token = as_text(os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN")) or None
@@ -133,9 +178,17 @@ def resolve_auth(cfg: Dict[str, Any]) -> Tuple[Optional[str], Optional[str]]:
         if path.exists():
             data = json.loads(path.read_text(encoding="utf-8"))
             if token is None:
-                token = as_text(data.get("key")) or None
             if username is None:
-                username = as_text(data.get("username")) or None
     return token, username
@@ -556,6 +609,228 @@ def push_folder(
     api.upload_folder(**kwargs)
 def main() -> None:
     args = parse_args()
     cfg = load_config(args.config)
@@ -564,17 +839,17 @@ def main() -> None:
     seed = as_int(cfg.get("global", {}).get("seed"), 17)
     set_seed(seed)
-    output_root = Path(as_text(cfg.get("global", {}).get("output_root")) or "model_development/runs/math-conjecture-sota")
     output_root.mkdir(parents=True, exist_ok=True)
     token, username = resolve_auth(cfg)
     repo_id = resolve_repo_id(cfg, username=username, output_root=output_root)
-    push_to_hub = bool(cfg.get("hub", {}).get("push_to_hub", False))
-    if args.dry_run and push_to_hub:
         print("Dry-run enabled. Disabling push_to_hub for this run.")
-    if args.dry_run:
-        push_to_hub = False
-    if push_to_hub:
         if token is None:
             raise ValueError("Hub push requested but token is missing.")
         if repo_id is None:
@@ -585,8 +860,9 @@ def main() -> None:
         model = None
     else:
         model, tokenizer = build_model_and_tokenizer(cfg["model"], cfg.get("training_defaults", {}))
     data_cfg = cfg["data"]
-    stage_reports = []
     start_stage = max(1, args.start_stage)
     stages = cfg["stages"]
@@ -607,15 +883,18 @@ def main() -> None:
         raw = load_dataset("parquet", data_files=split_files)
         train_rows_before = len(raw["train"])
         valid_rows_before = len(raw["validation"])
         filters = stage.get("filters", {})
         raw["train"] = apply_filters(raw["train"], filters)
         raw["validation"] = apply_filters(raw["validation"], filters)
         train_rows_after_filter = len(raw["train"])
         valid_rows_after_filter = len(raw["validation"])
         raw["train"] = maybe_select(raw["train"], stage.get("max_train_samples"))
         raw["validation"] = maybe_select(raw["validation"], stage.get("max_eval_samples"))
         train_rows_selected = len(raw["train"])
         valid_rows_selected = len(raw["validation"])
         print(
             f"[stage {index}] rows train: {train_rows_before} -> {train_rows_after_filter} -> {train_rows_selected}; "
             f"validation: {valid_rows_before} -> {valid_rows_after_filter} -> {valid_rows_selected}"
@@ -627,19 +906,20 @@ def main() -> None:
             sample_row = raw["train"][0]
             _ = build_prompt_text(sample_row, tokenizer, data_cfg)
             _ = build_answer_block(sample_row, data_cfg)
-            report = {
-                "stage_index": index,
-                "stage_name": stage_name,
-                "stage_slug": stage_slug,
-                "mode": "dry_run",
-                "train_rows_before_filter": train_rows_before,
-                "validation_rows_before_filter": valid_rows_before,
-                "train_rows_after_filter": train_rows_after_filter,
-                "validation_rows_after_filter": valid_rows_after_filter,
-                "train_rows_selected": train_rows_selected,
-                "validation_rows_selected": valid_rows_selected,
-            }
-            stage_reports.append(report)
             print(f"[stage {index}] Dry-run checks passed.")
             continue
@@ -670,33 +950,36 @@ def main() -> None:
         trainer.log_metrics("train", train_result.metrics)
         trainer.save_metrics("train", train_result.metrics)
         trainer.save_state()
         eval_metrics = None
         if eval_dataset is not None:
             eval_metrics = trainer.evaluate()
             trainer.log_metrics("eval", eval_metrics)
             trainer.save_metrics("eval", eval_metrics)
         trainer.save_model(str(stage_output_dir))
         tokenizer.save_pretrained(str(stage_output_dir))
-        report = {
-            "stage_index": index,
-            "stage_name": stage_name,
-            "output_dir": str(stage_output_dir),
-            "train_rows_before_filter": train_rows_before,
-            "validation_rows_before_filter": valid_rows_before,
-            "train_rows_after_filter": train_rows_after_filter,
-            "validation_rows_after_filter": valid_rows_after_filter,
-            "train_rows_selected": train_rows_selected,
-            "validation_rows_selected": valid_rows_selected,
-            "train_rows": len(train_dataset),
-            "eval_rows": len(eval_dataset) if eval_dataset is not None else 0,
-            "train_metrics": train_result.metrics,
-            "eval_metrics": eval_metrics,
-        }
-        stage_reports.append(report)
         print(
-            f"[stage {index}] Completed: train_rows={report['train_rows']} "
-            f"eval_rows={report['eval_rows']} output={stage_output_dir}"
         )
     if args.dry_run:
@@ -720,17 +1003,59 @@ def main() -> None:
     model.save_pretrained(str(final_dir))
     tokenizer.save_pretrained(str(final_dir))
-    summary = {
         "config_path": str(args.config),
         "repo_id": repo_id,
         "seed": seed,
         "stages_ran": stage_reports,
         "final_adapter_dir": str(final_dir),
     }
     summary_path = output_root / "training_summary.json"
     summary_path.write_text(json.dumps(summary, ensure_ascii=True, indent=2), encoding="utf-8")
-    if push_to_hub and repo_id is not None and token is not None:
         api = HfApi(token=token)
         api.create_repo(
             repo_id=repo_id,
@@ -740,17 +1065,22 @@ def main() -> None:
         )
         commit_message = as_text(cfg.get("hub", {}).get("commit_message")) or "Upload SOTA curriculum adapter."
         push_folder(api, repo_id, final_dir, commit_message=commit_message)
         if bool(cfg.get("hub", {}).get("upload_stage_checkpoints", False)):
             for report in stage_reports:
-                stage_dir = Path(report["output_dir"])
-                path_in_repo = f"checkpoints/{Path(report['output_dir']).name}"
                 push_folder(
                     api,
                     repo_id,
                     stage_dir,
-                    commit_message=f"Upload stage checkpoint {report['stage_name']}",
                     path_in_repo=path_in_repo,
                 )
         api.upload_file(
             path_or_fileobj=str(summary_path),
             path_in_repo="training_summary.json",
@@ -758,6 +1088,16 @@ def main() -> None:
             repo_type="model",
             commit_message="Upload training summary for SOTA curriculum run.",
         )
         print(f"Pushed training artifacts to https://huggingface.co/{repo_id}")
     print(f"Training complete. Final adapter: {final_dir}")

 from __future__ import annotations
 import argparse
+import gc
 import json
 import os
+import subprocess
+import sys
 from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
 import torch
 import yaml
     set_seed,
 )
+SCRIPT_ROOT = Path(__file__).resolve().parents[1]
+DEFAULT_CONFIG_PATH = SCRIPT_ROOT / "configs" / "deepseek_math_sota.yaml"
+DEFAULT_EVAL_SCRIPT = Path(__file__).resolve().with_name("eval_sota.py")
 def parse_args() -> argparse.Namespace:
     parser.add_argument("--repo-id", type=str, default=None, help="Override hub.repo_id.")
     parser.add_argument("--push-to-hub", action="store_true", help="Force push enabled.")
     parser.add_argument("--no-push-to-hub", action="store_true", help="Force push disabled.")
+    parser.add_argument(
+        "--run-post-eval",
+        action="store_true",
+        help="Force post-training evaluation enabled.",
+    )
+    parser.add_argument(
+        "--no-post-eval",
+        action="store_true",
+        help="Force post-training evaluation disabled.",
+    )
+    parser.add_argument(
+        "--skip-quality-gate",
+        action="store_true",
+        help="Disable quality gate checks for this run.",
+    )
     parser.add_argument(
         "--start-stage",
         type=int,
         return default
+def as_bool(value: Any, default: bool = False) -> bool:
+    if value is None:
+        return default
+    if isinstance(value, bool):
+        return value
+    text = as_text(value).lower()
+    if text in {"1", "true", "yes", "y", "on"}:
+        return True
+    if text in {"0", "false", "no", "n", "off"}:
+        return False
+    return default
 def load_config(path: Path) -> Dict[str, Any]:
     if not path.exists():
         raise FileNotFoundError(f"Config not found: {path}")
     cfg.setdefault("training_defaults", {})
     cfg.setdefault("hub", {})
     cfg.setdefault("credentials", {})
+    cfg.setdefault("post_eval", {})
+    cfg.setdefault("quality_gate", {})
     return cfg
     if args.no_push_to_hub:
         cfg.setdefault("hub", {})["push_to_hub"] = False
+    if args.run_post_eval and args.no_post_eval:
+        raise ValueError("Cannot set both --run-post-eval and --no-post-eval.")
+    if args.run_post_eval:
+        cfg.setdefault("post_eval", {})["enabled"] = True
+    if args.no_post_eval:
+        cfg.setdefault("post_eval", {})["enabled"] = False
+    if args.skip_quality_gate:
+        cfg.setdefault("quality_gate", {})["enabled"] = False
 def resolve_auth(cfg: Dict[str, Any]) -> Tuple[Optional[str], Optional[str]]:
     token = as_text(os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN")) or None
         if path.exists():
             data = json.loads(path.read_text(encoding="utf-8"))
             if token is None:
+                for key in ("token", "key", "api_key", "hf_token"):
+                    candidate = as_text(data.get(key))
+                    if candidate:
+                        token = candidate
+                        break
             if username is None:
+                for key in ("username", "user", "owner"):
+                    candidate = as_text(data.get(key))
+                    if candidate:
+                        username = candidate
+                        break
     return token, username
     api.upload_folder(**kwargs)
+def extract_final_eval_loss(stage_reports: List[Dict[str, Any]]) -> Optional[float]:
+    for report in reversed(stage_reports):
+        eval_metrics = report.get("eval_metrics")
+        if not isinstance(eval_metrics, dict):
+            continue
+        value = eval_metrics.get("eval_loss")
+        if value is None:
+            continue
+        try:
+            return float(value)
+        except (TypeError, ValueError):
+            continue
+    return None
+def release_model_memory(model: Any) -> None:
+    try:
+        model.to("cpu")
+    except Exception:
+        pass
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+    gc.collect()
+def run_post_eval(
+    cfg: Dict[str, Any],
+    config_path: Path,
+    output_root: Path,
+    final_adapter_dir: Path,
+) -> Optional[Dict[str, Any]]:
+    post_cfg = cfg.get("post_eval", {})
+    if not as_bool(post_cfg.get("enabled"), False):
+        return None
+    eval_script = DEFAULT_EVAL_SCRIPT
+    if not eval_script.exists():
+        raise FileNotFoundError(f"Post-eval enabled but eval script is missing: {eval_script}")
+    data_cfg = cfg.get("data", {})
+    eval_file = Path(
+        as_text(post_cfg.get("eval_file"))
+        or as_text(data_cfg.get("default_validation_file"))
+        or "data/releases/v1/test.parquet"
+    )
+    if not eval_file.exists():
+        raise FileNotFoundError(f"Post-eval file not found: {eval_file}")
+    output_json = Path(as_text(post_cfg.get("output_json")) or str(output_root / "post_eval_report.json"))
+    base_model = as_text(cfg.get("model", {}).get("base_model"))
+    if not base_model:
+        raise ValueError("model.base_model is required for post-eval.")
+    cmd = [
+        sys.executable,
+        str(eval_script),
+        "--config",
+        str(config_path),
+        "--base-model",
+        base_model,
+        "--adapter-path",
+        str(final_adapter_dir),
+        "--eval-file",
+        str(eval_file),
+        "--max-samples",
+        str(as_int(post_cfg.get("max_samples"), 300)),
+        "--k",
+        str(as_int(post_cfg.get("k"), 4)),
+        "--max-new-tokens",
+        str(as_int(post_cfg.get("max_new_tokens"), 256)),
+        "--temperature",
+        str(as_float(post_cfg.get("temperature"), 0.7)),
+        "--top-p",
+        str(as_float(post_cfg.get("top_p"), 0.95)),
+        "--seed",
+        str(as_int(post_cfg.get("seed"), as_int(cfg.get("global", {}).get("seed"), 17))),
+        "--output-json",
+        str(output_json),
+    ]
+    print(f"Running post-training eval: {' '.join(cmd)}")
+    completed = subprocess.run(cmd, check=False)
+    if completed.returncode != 0:
+        raise RuntimeError(f"Post-training evaluation failed with exit code {completed.returncode}.")
+    if not output_json.exists():
+        raise FileNotFoundError(f"Post-eval report was not created: {output_json}")
+    report = json.loads(output_json.read_text(encoding="utf-8"))
+    return {
+        "enabled": True,
+        "report_path": str(output_json),
+        "report": report,
+        "command": cmd,
+    }
+def evaluate_quality_gate(
+    stage_reports: List[Dict[str, Any]],
+    post_eval_result: Optional[Dict[str, Any]],
+    gate_cfg: Dict[str, Any],
+) -> Dict[str, Any]:
+    enabled = as_bool(gate_cfg.get("enabled"), False)
+    result: Dict[str, Any] = {
+        "enabled": enabled,
+        "passed": True,
+        "violations": [],
+        "checks": [],
+    }
+    if not enabled:
+        return result
+    violations: List[str] = []
+    checks: List[Dict[str, Any]] = []
+    final_eval_loss = extract_final_eval_loss(stage_reports)
+    max_final_eval_loss = gate_cfg.get("max_final_eval_loss")
+    if max_final_eval_loss is not None:
+        threshold = as_float(max_final_eval_loss, 0.0)
+        checks.append(
+            {
+                "name": "max_final_eval_loss",
+                "actual": final_eval_loss,
+                "threshold": threshold,
+            }
+        )
+        if final_eval_loss is None:
+            violations.append("Final stage eval_loss is missing for max_final_eval_loss gate.")
+        elif final_eval_loss > threshold:
+            violations.append(
+                f"Final eval_loss {final_eval_loss:.4f} exceeds threshold {threshold:.4f}."
+            )
+    report: Optional[Dict[str, Any]] = None
+    if isinstance(post_eval_result, dict):
+        loaded = post_eval_result.get("report")
+        if isinstance(loaded, dict):
+            report = loaded
+    require_post_eval = as_bool(gate_cfg.get("require_post_eval"), False)
+    if report is None:
+        if require_post_eval:
+            violations.append("Quality gate requires post-eval metrics, but post-eval report is missing.")
+    else:
+        evaluated_rows = as_int(report.get("evaluated_rows"), 0)
+        min_rows = as_int(gate_cfg.get("min_evaluated_rows"), 0)
+        checks.append(
+            {
+                "name": "min_evaluated_rows",
+                "actual": evaluated_rows,
+                "threshold": min_rows,
+            }
+        )
+        if evaluated_rows < min_rows:
+            violations.append(
+                f"Post-eval rows {evaluated_rows} is below minimum required {min_rows}."
+            )
+        min_pass_at_1_raw = gate_cfg.get("min_pass_at_1")
+        if min_pass_at_1_raw is not None:
+            min_pass_at_1 = as_float(min_pass_at_1_raw, 0.0)
+            pass_at_1 = as_float(report.get("pass_at_1"), 0.0)
+            checks.append(
+                {
+                    "name": "min_pass_at_1",
+                    "actual": pass_at_1,
+                    "threshold": min_pass_at_1,
+                }
+            )
+            if pass_at_1 < min_pass_at_1:
+                violations.append(
+                    f"pass@1 {pass_at_1:.4f} is below threshold {min_pass_at_1:.4f}."
+                )
+        min_pass_at_k_raw = gate_cfg.get("min_pass_at_k")
+        if min_pass_at_k_raw is not None:
+            min_pass_at_k = as_float(min_pass_at_k_raw, 0.0)
+            pass_at_k = as_float(report.get("pass_at_k"), 0.0)
+            checks.append(
+                {
+                    "name": "min_pass_at_k",
+                    "actual": pass_at_k,
+                    "threshold": min_pass_at_k,
+                }
+            )
+            if pass_at_k < min_pass_at_k:
+                violations.append(
+                    f"pass@k {pass_at_k:.4f} is below threshold {min_pass_at_k:.4f}."
+                )
+        family_requirements = gate_cfg.get("required_family_pass_at_k", {})
+        family_metrics = report.get("family_metrics", {})
+        if isinstance(family_requirements, dict):
+            for family, threshold_raw in family_requirements.items():
+                threshold = as_float(threshold_raw, 0.0)
+                actual = None
+                if isinstance(family_metrics, dict):
+                    family_row = family_metrics.get(family)
+                    if isinstance(family_row, dict):
+                        try:
+                            actual = float(family_row.get("pass_at_k"))
+                        except (TypeError, ValueError):
+                            actual = None
+                checks.append(
+                    {
+                        "name": f"family_pass_at_k:{family}",
+                        "actual": actual,
+                        "threshold": threshold,
+                    }
+                )
+                if actual is None:
+                    violations.append(f"Missing pass@k metric for required family '{family}'.")
+                elif actual < threshold:
+                    violations.append(
+                        f"Family '{family}' pass@k {actual:.4f} is below threshold {threshold:.4f}."
+                    )
+    result["violations"] = violations
+    result["checks"] = checks
+    result["passed"] = len(violations) == 0
+    return result
 def main() -> None:
     args = parse_args()
     cfg = load_config(args.config)
     seed = as_int(cfg.get("global", {}).get("seed"), 17)
     set_seed(seed)
+    output_root = Path(as_text(cfg.get("global", {}).get("output_root")) or "runs/math-conjecture-sota")
     output_root.mkdir(parents=True, exist_ok=True)
     token, username = resolve_auth(cfg)
     repo_id = resolve_repo_id(cfg, username=username, output_root=output_root)
+    push_to_hub_requested = bool(cfg.get("hub", {}).get("push_to_hub", False))
+    if args.dry_run and push_to_hub_requested:
         print("Dry-run enabled. Disabling push_to_hub for this run.")
+    push_to_hub_requested = push_to_hub_requested and not args.dry_run
+    if push_to_hub_requested:
         if token is None:
             raise ValueError("Hub push requested but token is missing.")
         if repo_id is None:
         model = None
     else:
         model, tokenizer = build_model_and_tokenizer(cfg["model"], cfg.get("training_defaults", {}))
     data_cfg = cfg["data"]
+    stage_reports: List[Dict[str, Any]] = []
     start_stage = max(1, args.start_stage)
     stages = cfg["stages"]
         raw = load_dataset("parquet", data_files=split_files)
         train_rows_before = len(raw["train"])
         valid_rows_before = len(raw["validation"])
         filters = stage.get("filters", {})
         raw["train"] = apply_filters(raw["train"], filters)
         raw["validation"] = apply_filters(raw["validation"], filters)
         train_rows_after_filter = len(raw["train"])
         valid_rows_after_filter = len(raw["validation"])
         raw["train"] = maybe_select(raw["train"], stage.get("max_train_samples"))
         raw["validation"] = maybe_select(raw["validation"], stage.get("max_eval_samples"))
         train_rows_selected = len(raw["train"])
         valid_rows_selected = len(raw["validation"])
         print(
             f"[stage {index}] rows train: {train_rows_before} -> {train_rows_after_filter} -> {train_rows_selected}; "
             f"validation: {valid_rows_before} -> {valid_rows_after_filter} -> {valid_rows_selected}"
             sample_row = raw["train"][0]
             _ = build_prompt_text(sample_row, tokenizer, data_cfg)
             _ = build_answer_block(sample_row, data_cfg)
+            stage_reports.append(
+                {
+                    "stage_index": index,
+                    "stage_name": stage_name,
+                    "stage_slug": stage_slug,
+                    "mode": "dry_run",
+                    "train_rows_before_filter": train_rows_before,
+                    "validation_rows_before_filter": valid_rows_before,
+                    "train_rows_after_filter": train_rows_after_filter,
+                    "validation_rows_after_filter": valid_rows_after_filter,
+                    "train_rows_selected": train_rows_selected,
+                    "validation_rows_selected": valid_rows_selected,
+                }
+            )
             print(f"[stage {index}] Dry-run checks passed.")
             continue
         trainer.log_metrics("train", train_result.metrics)
         trainer.save_metrics("train", train_result.metrics)
         trainer.save_state()
         eval_metrics = None
         if eval_dataset is not None:
             eval_metrics = trainer.evaluate()
             trainer.log_metrics("eval", eval_metrics)
             trainer.save_metrics("eval", eval_metrics)
         trainer.save_model(str(stage_output_dir))
         tokenizer.save_pretrained(str(stage_output_dir))
+        stage_reports.append(
+            {
+                "stage_index": index,
+                "stage_name": stage_name,
+                "output_dir": str(stage_output_dir),
+                "train_rows_before_filter": train_rows_before,
+                "validation_rows_before_filter": valid_rows_before,
+                "train_rows_after_filter": train_rows_after_filter,
+                "validation_rows_after_filter": valid_rows_after_filter,
+                "train_rows_selected": train_rows_selected,
+                "validation_rows_selected": valid_rows_selected,
+                "train_rows": len(train_dataset),
+                "eval_rows": len(eval_dataset) if eval_dataset is not None else 0,
+                "train_metrics": train_result.metrics,
+                "eval_metrics": eval_metrics,
+            }
+        )
         print(
+            f"[stage {index}] Completed: train_rows={len(train_dataset)} "
+            f"eval_rows={len(eval_dataset) if eval_dataset is not None else 0} output={stage_output_dir}"
         )
     if args.dry_run:
     model.save_pretrained(str(final_dir))
     tokenizer.save_pretrained(str(final_dir))
+    release_model_memory(model)
+    del model
+    post_eval_result = run_post_eval(
+        cfg=cfg,
+        config_path=args.config,
+        output_root=output_root,
+        final_adapter_dir=final_dir,
+    )
+    quality_gate = evaluate_quality_gate(
+        stage_reports=stage_reports,
+        post_eval_result=post_eval_result,
+        gate_cfg=cfg.get("quality_gate", {}),
+    )
+    push_to_hub_performed = push_to_hub_requested
+    push_block_reason: Optional[str] = None
+    if push_to_hub_requested and not quality_gate.get("passed", True):
+        push_to_hub_performed = False
+        push_block_reason = "quality_gate_failed"
+        print("Quality gate failed; skipping hub push for this run.")
+    summary: Dict[str, Any] = {
         "config_path": str(args.config),
         "repo_id": repo_id,
         "seed": seed,
         "stages_ran": stage_reports,
         "final_adapter_dir": str(final_dir),
+        "quality_gate": quality_gate,
+        "push": {
+            "requested": bool(push_to_hub_requested),
+            "performed": bool(push_to_hub_performed),
+            "block_reason": push_block_reason,
+        },
     }
+    if post_eval_result is not None:
+        report = post_eval_result.get("report", {})
+        summary["post_eval"] = {
+            "report_path": post_eval_result.get("report_path"),
+            "evaluated_rows": report.get("evaluated_rows"),
+            "k": report.get("k"),
+            "pass_at_1": report.get("pass_at_1"),
+            "pass_at_k": report.get("pass_at_k"),
+            "exact_at_k": report.get("exact_at_k"),
+            "composite_score": report.get("composite_score"),
+        }
     summary_path = output_root / "training_summary.json"
     summary_path.write_text(json.dumps(summary, ensure_ascii=True, indent=2), encoding="utf-8")
+    if push_to_hub_performed and repo_id is not None and token is not None:
         api = HfApi(token=token)
         api.create_repo(
             repo_id=repo_id,
         )
         commit_message = as_text(cfg.get("hub", {}).get("commit_message")) or "Upload SOTA curriculum adapter."
         push_folder(api, repo_id, final_dir, commit_message=commit_message)
         if bool(cfg.get("hub", {}).get("upload_stage_checkpoints", False)):
             for report in stage_reports:
+                stage_dir_raw = report.get("output_dir")
+                if not stage_dir_raw:
+                    continue
+                stage_dir = Path(stage_dir_raw)
+                path_in_repo = f"checkpoints/{stage_dir.name}"
                 push_folder(
                     api,
                     repo_id,
                     stage_dir,
+                    commit_message=f"Upload stage checkpoint {report.get('stage_name', stage_dir.name)}",
                     path_in_repo=path_in_repo,
                 )
         api.upload_file(
             path_or_fileobj=str(summary_path),
             path_in_repo="training_summary.json",
             repo_type="model",
             commit_message="Upload training summary for SOTA curriculum run.",
         )
+        if post_eval_result is not None and post_eval_result.get("report_path"):
+            api.upload_file(
+                path_or_fileobj=str(post_eval_result["report_path"]),
+                path_in_repo="post_eval_report.json",
+                repo_id=repo_id,
+                repo_type="model",
+                commit_message="Upload post-training evaluation report.",
+            )
         print(f"Pushed training artifacts to https://huggingface.co/{repo_id}")
     print(f"Training complete. Final adapter: {final_dir}")