jayantaggarwal-sketch commited on
Commit
d53a65c
·
1 Parent(s): 98b25a9

Sync latest project updates to Hugging Face Space.

Browse files

Include current code, evaluation scripts, notebook, and docs while excluding PNG binaries required by Space push policy.

Made-with: Cursor

.env.example ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copy to .env and fill in. Never commit real secrets.
2
+
3
+ # --- inference.py (OpenAI-compatible HTTP API) ---
4
+ API_BASE_URL=https://api.openai.com/v1
5
+ MODEL_NAME=gpt-4o-mini
6
+ # Used as API key by inference.py (or set OPENAI_API_KEY instead)
7
+ HF_TOKEN=hf_xxx
8
+
9
+ # --- CommitmentOS HTTP environment (inference + LLM eval) ---
10
+ ENV_BASE_URL=https://jayant2304-commitment-os.hf.space
11
+
12
+ # --- evaluation/evaluate_llm_checkpoints.py (local Transformers + PEFT) ---
13
+ # Base model on Hugging Face (must match what you trained on)
14
+ BASELINE_MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct
15
+ # REQUIRED: absolute or relative path to a folder containing adapter_config.json
16
+ # (e.g. ./training_output after train_grpo.py, or a downloaded adapter dir)
17
+ TRAINED_MODEL_PATH=./training_output
18
+
19
+ # Optional eval protocol (defaults shown)
20
+ EVAL_SEED=42
21
+ EVAL_MAX_STEPS=12
22
+ EVAL_TEMPERATURE=0.0
23
+ EVAL_TOP_P=1.0
24
+ EVAL_MAX_NEW_TOKENS=256
25
+ EVAL_SUCCESS_THRESHOLD=0.6
26
+
27
+ # --- training/train_grpo.py --push_to_hub only ---
28
+ # Hub repo id when using: python training/train_grpo.py ... --push_to_hub --hub_model_id your/repo
29
+ # TRAINED_MODEL_NAME is not read by evaluate_llm_checkpoints.py; use TRAINED_MODEL_PATH.
.gitignore CHANGED
@@ -11,3 +11,6 @@ build/
11
  .ruff_cache/
12
  *.log
13
  .DS_Store
 
 
 
 
11
  .ruff_cache/
12
  *.log
13
  .DS_Store
14
+
15
+ # Local GRPO / LoRA output (large; do not commit)
16
+ training_output/
HF_README.md CHANGED
@@ -91,3 +91,22 @@ Headline metrics (`summary.json`):
91
  - Mean reward: **0.5427 -> 0.9777** (**+0.4350**)
92
  - Success rate: **0.3333 -> 1.0000** (**+0.6667**)
93
  - Median per-task reward delta: **+0.4200**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  - Mean reward: **0.5427 -> 0.9777** (**+0.4350**)
92
  - Success rate: **0.3333 -> 1.0000** (**+0.6667**)
93
  - Median per-task reward delta: **+0.4200**
94
+
95
+ For true model-learning proof (pre-RL checkpoint vs post-RL checkpoint),
96
+ run:
97
+
98
+ ```bash
99
+ # From cloned repo (core deps + torch/transformers/peft/… via optional extra):
100
+ pip install -e ".[llm-eval]"
101
+ export BASELINE_MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct
102
+ export TRAINED_MODEL_PATH=/content/commitment_os/training_output
103
+ export ENV_BASE_URL=https://jayant2304-commitment-os.hf.space
104
+ python3 evaluation/evaluate_llm_checkpoints.py
105
+ python3 evaluation/plot_llm_checkpoints.py
106
+ ```
107
+
108
+ Artifacts are written to `artifacts/evals_llm/`.
109
+
110
+ **Published LLM run (bundle on Drive):** success **46.7% → 60.0%** at reward threshold **0.6**; mean reward ~flat; gains concentrated on **hard** tasks. Traces: `artifacts/evals_llm/*.json` in the folder below.
111
+
112
+ **Pretrained adapter + LLM eval artifacts (Google Drive):** [commitment_os_bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing) — download `training_output/` and set `TRAINED_MODEL_PATH` accordingly; full `gdown` notes are in the GitHub `README.md`.
README.md CHANGED
@@ -91,3 +91,22 @@ Headline metrics (`summary.json`):
91
  - Mean reward: **0.5427 -> 0.9777** (**+0.4350**)
92
  - Success rate: **0.3333 -> 1.0000** (**+0.6667**)
93
  - Median per-task reward delta: **+0.4200**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  - Mean reward: **0.5427 -> 0.9777** (**+0.4350**)
92
  - Success rate: **0.3333 -> 1.0000** (**+0.6667**)
93
  - Median per-task reward delta: **+0.4200**
94
+
95
+ For true model-learning proof (pre-RL checkpoint vs post-RL checkpoint),
96
+ run:
97
+
98
+ ```bash
99
+ # From cloned repo (core deps + torch/transformers/peft/… via optional extra):
100
+ pip install -e ".[llm-eval]"
101
+ export BASELINE_MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct
102
+ export TRAINED_MODEL_PATH=/content/commitment_os/training_output
103
+ export ENV_BASE_URL=https://jayant2304-commitment-os.hf.space
104
+ python3 evaluation/evaluate_llm_checkpoints.py
105
+ python3 evaluation/plot_llm_checkpoints.py
106
+ ```
107
+
108
+ Artifacts are written to `artifacts/evals_llm/`.
109
+
110
+ **Published LLM run (bundle on Drive):** success **46.7% → 60.0%** at reward threshold **0.6**; mean reward ~flat; gains concentrated on **hard** tasks. Traces: `artifacts/evals_llm/*.json` in the folder below.
111
+
112
+ **Pretrained adapter + LLM eval artifacts (Google Drive):** [commitment_os_bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing) — download `training_output/` and set `TRAINED_MODEL_PATH` accordingly; full `gdown` notes are in the GitHub `README.md`.
artifacts/evals/README.md CHANGED
@@ -2,6 +2,8 @@
2
 
3
  This folder contains deterministic baseline-vs-trained-style evaluation outputs for all 15 CommitmentOS tasks.
4
 
 
 
5
  ## Files
6
 
7
  - `eval_protocol.json`: fixed protocol (task set, seed, max steps, decode config)
 
2
 
3
  This folder contains deterministic baseline-vs-trained-style evaluation outputs for all 15 CommitmentOS tasks.
4
 
5
+ This is **not** the same as the real LLM checkpoint comparison; see root **README** section **B) True LLM Learning Eval** and `artifacts/evals_llm/`.
6
+
7
  ## Files
8
 
9
  - `eval_protocol.json`: fixed protocol (task set, seed, max steps, decode config)
artifacts/evals_llm/README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # True LLM Learning Evaluation (Pre-RL vs Post-RL)
2
+
3
+ This folder is for checkpoint-vs-checkpoint evidence:
4
+
5
+ - pre-RL base model
6
+ - post-RL trained checkpoint
7
+
8
+ Both are evaluated with an identical protocol.
9
+
10
+ ## Required environment variables
11
+
12
+ - `BASELINE_MODEL_NAME`
13
+ - `TRAINED_MODEL_PATH` (local directory with `adapter_config.json`)
14
+ - `ENV_BASE_URL` (CommitmentOS HTTP API)
15
+
16
+ Optional:
17
+
18
+ - `HF_TOKEN` (gated Hub models / rate limits)
19
+
20
+ Optional protocol overrides:
21
+
22
+ - `EVAL_SEED` (default: `42`)
23
+ - `EVAL_MAX_STEPS` (default: `12`)
24
+ - `EVAL_TEMPERATURE` (default: `0.0`)
25
+ - `EVAL_TOP_P` (default: `1.0`)
26
+ - `EVAL_MAX_NEW_TOKENS` (default: `256`)
27
+ - `EVAL_SUCCESS_THRESHOLD` (default: `0.6`)
28
+
29
+ ## Run
30
+
31
+ ```bash
32
+ cd commitment_os
33
+ pip install -e ".[llm-eval]"
34
+ python3 evaluation/evaluate_llm_checkpoints.py
35
+ python3 evaluation/plot_llm_checkpoints.py
36
+ ```
37
+
38
+ The evaluator prints one line per task (`[eval …] task i/n`) so long Colab runs do not look frozen.
39
+
40
+ ## After Colab
41
+
42
+ Zip weights + artifacts for download (paths assume `/content/commitment_os`):
43
+
44
+ ```bash
45
+ cd /content/commitment_os && zip -r /content/commitment_os_bundle.zip training_output artifacts/evals_llm
46
+ ```
47
+
48
+ Or copy `training_output/` and `artifacts/evals_llm/` to Google Drive if the zip is too large for the browser.
49
+
50
+ These bundles are **not** checked into git (clone speed + history). A **~330MB** zip (weights + this folder) is a normal size: publish it as a **GitHub Release** asset, **HF Hub**, or **Google Drive**.
51
+
52
+ **Drive (weights + this folder):** [commitment_os_bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing) — after download you should have `artifacts/evals_llm/` (this layout) next to `training_output/`. See root **README** for `gdown` / `TRAINED_MODEL_PATH` notes.
53
+
54
+ ## Expected outputs
55
+
56
+ - `llm_eval_protocol.json`
57
+ - `baseline_llm_eval.json`
58
+ - `trained_llm_eval.json`
59
+ - `llm_comparison.csv`
60
+ - `llm_summary.json`
61
+ - `llm_case_study_hard_015.md`
62
+ - `llm_reward_by_task.svg`
63
+ - `llm_violations_before_after.svg`
evaluation/CommitmentOS_Checkpoint_Eval_Colab.ipynb ADDED
@@ -0,0 +1,247 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# CommitmentOS Checkpoint Evaluation (Colab)\n",
8
+ "\n",
9
+ "This notebook compares a base model against a locally saved LoRA-trained checkpoint on the CommitmentOS environment.\n",
10
+ "\n",
11
+ "It uses:\n",
12
+ "- `BASELINE_MODEL_NAME` from Hugging Face\n",
13
+ "- `TRAINED_MODEL_PATH` from disk in Colab\n",
14
+ "- the existing `evaluation/evaluate_llm_checkpoints.py` script\n",
15
+ "\n",
16
+ "By default the notebook evaluates against the hosted CommitmentOS environment on Hugging Face Space. An optional local-server cell is included below."
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "code",
21
+ "execution_count": null,
22
+ "id": "d43c692d",
23
+ "metadata": {},
24
+ "outputs": [],
25
+ "source": [
26
+ "!pip -q install --upgrade pip\n",
27
+ "!pip -q install transformers peft accelerate torch sentencepiece fastapi uvicorn requests python-dotenv pydantic \"openenv-core>=0.2.0\""
28
+ ]
29
+ },
30
+ {
31
+ "cell_type": "code",
32
+ "execution_count": null,
33
+ "metadata": {},
34
+ "outputs": [],
35
+ "source": [
36
+ "!git clone https://github.com/Jayant2304/commitment_os.git\n",
37
+ "%cd commitment_os"
38
+ ]
39
+ },
40
+ {
41
+ "cell_type": "markdown",
42
+ "metadata": {},
43
+ "source": [
44
+ "## Configure Paths\n",
45
+ "\n",
46
+ "Set the base model ID and the local adapter/checkpoint path. Change `TRAINED_MODEL_PATH` to the folder you actually want to evaluate.\n",
47
+ "\n",
48
+ "If the base model is gated, set `HF_TOKEN` as well."
49
+ ]
50
+ },
51
+ {
52
+ "cell_type": "code",
53
+ "execution_count": null,
54
+ "metadata": {},
55
+ "outputs": [],
56
+ "source": [
57
+ "import os\n",
58
+ "\n",
59
+ "# Colab: load Hugging Face token from Secrets (key must be exactly HF_TOKEN)\n",
60
+ "try:\n",
61
+ " from google.colab import userdata\n",
62
+ "\n",
63
+ " os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")\n",
64
+ " print(\"HF_TOKEN loaded from Colab secrets\")\n",
65
+ "except ImportError:\n",
66
+ " print(\"Not on Colab; set HF_TOKEN in the shell or .env if downloads fail.\")\n",
67
+ "except Exception as exc:\n",
68
+ " print(\"Could not load HF_TOKEN from secrets:\", exc)\n",
69
+ "\n",
70
+ "os.environ[\"BASELINE_MODEL_NAME\"] = \"Qwen/Qwen2.5-1.5B-Instruct\"\n",
71
+ "os.environ[\"TRAINED_MODEL_PATH\"] = \"/content/commitment_os/training_output\"\n",
72
+ "os.environ[\"ENV_BASE_URL\"] = \"https://jayant2304-commitment-os.hf.space\"\n",
73
+ "\n",
74
+ "# Optional for gated base models:\n",
75
+ "# os.environ[\"HF_TOKEN\"] = \"hf_xxx\"\n",
76
+ "\n",
77
+ "# Optional eval overrides:\n",
78
+ "os.environ[\"EVAL_SEED\"] = \"42\"\n",
79
+ "os.environ[\"EVAL_MAX_STEPS\"] = \"12\"\n",
80
+ "os.environ[\"EVAL_TEMPERATURE\"] = \"0.0\"\n",
81
+ "os.environ[\"EVAL_TOP_P\"] = \"1.0\"\n",
82
+ "os.environ[\"EVAL_MAX_NEW_TOKENS\"] = \"256\"\n",
83
+ "os.environ[\"EVAL_SUCCESS_THRESHOLD\"] = \"0.6\"\n",
84
+ "\n",
85
+ "for key in [\n",
86
+ " \"BASELINE_MODEL_NAME\",\n",
87
+ " \"TRAINED_MODEL_PATH\",\n",
88
+ " \"ENV_BASE_URL\",\n",
89
+ " \"EVAL_SEED\",\n",
90
+ " \"EVAL_MAX_STEPS\",\n",
91
+ " \"EVAL_TEMPERATURE\",\n",
92
+ " \"EVAL_TOP_P\",\n",
93
+ " \"EVAL_MAX_NEW_TOKENS\",\n",
94
+ " \"EVAL_SUCCESS_THRESHOLD\",\n",
95
+ "]:\n",
96
+ " print(f\"{key}={os.environ[key]}\")"
97
+ ]
98
+ },
99
+ {
100
+ "cell_type": "code",
101
+ "execution_count": null,
102
+ "metadata": {},
103
+ "outputs": [],
104
+ "source": [
105
+ "from pathlib import Path\n",
106
+ "\n",
107
+ "trained_path = Path(os.environ[\"TRAINED_MODEL_PATH\"])\n",
108
+ "print(\"Checkpoint exists:\", trained_path.exists())\n",
109
+ "if trained_path.exists():\n",
110
+ " print(\"Checkpoint contents:\")\n",
111
+ " for item in sorted(trained_path.iterdir()):\n",
112
+ " print(\" -\", item.name)"
113
+ ]
114
+ },
115
+ {
116
+ "cell_type": "markdown",
117
+ "metadata": {},
118
+ "source": [
119
+ "## Optional: Run CommitmentOS Locally Instead Of HF Space\n",
120
+ "\n",
121
+ "Only run this if you want evaluation against a local server inside Colab. Otherwise skip this section and keep `ENV_BASE_URL` pointed at the hosted Space."
122
+ ]
123
+ },
124
+ {
125
+ "cell_type": "code",
126
+ "execution_count": null,
127
+ "metadata": {},
128
+ "outputs": [],
129
+ "source": [
130
+ "# Optional local server setup\n",
131
+ "# import os\n",
132
+ "# os.environ[\"ENV_BASE_URL\"] = \"http://127.0.0.1:7860\"\n",
133
+ "# !nohup python -m uvicorn server.app:app --host 0.0.0.0 --port 7860 >/tmp/commitmentos.log 2>&1 &\n",
134
+ "# !sleep 5\n",
135
+ "# !curl -s http://127.0.0.1:7860/health"
136
+ ]
137
+ },
138
+ {
139
+ "cell_type": "markdown",
140
+ "metadata": {},
141
+ "source": [
142
+ "## Run Checkpoint Comparison"
143
+ ]
144
+ },
145
+ {
146
+ "cell_type": "code",
147
+ "execution_count": null,
148
+ "metadata": {},
149
+ "outputs": [],
150
+ "source": [
151
+ "!python evaluation/evaluate_llm_checkpoints.py"
152
+ ]
153
+ },
154
+ {
155
+ "cell_type": "code",
156
+ "execution_count": null,
157
+ "metadata": {},
158
+ "outputs": [],
159
+ "source": [
160
+ "!python evaluation/plot_llm_checkpoints.py"
161
+ ]
162
+ },
163
+ {
164
+ "cell_type": "markdown",
165
+ "metadata": {},
166
+ "source": [
167
+ "## Inspect Artifacts"
168
+ ]
169
+ },
170
+ {
171
+ "cell_type": "code",
172
+ "execution_count": null,
173
+ "metadata": {},
174
+ "outputs": [],
175
+ "source": [
176
+ "import json\n",
177
+ "from pathlib import Path\n",
178
+ "\n",
179
+ "artifact_dir = Path(\"artifacts/evals_llm\")\n",
180
+ "print(sorted(p.name for p in artifact_dir.iterdir()))\n",
181
+ "\n",
182
+ "summary = json.loads((artifact_dir / \"llm_summary.json\").read_text())\n",
183
+ "summary"
184
+ ]
185
+ },
186
+ {
187
+ "cell_type": "code",
188
+ "execution_count": null,
189
+ "metadata": {},
190
+ "outputs": [],
191
+ "source": [
192
+ "import pandas as pd\n",
193
+ "\n",
194
+ "pd.read_csv(\"artifacts/evals_llm/llm_comparison.csv\")"
195
+ ]
196
+ },
197
+ {
198
+ "cell_type": "code",
199
+ "execution_count": null,
200
+ "metadata": {},
201
+ "outputs": [],
202
+ "source": [
203
+ "from IPython.display import SVG, display\n",
204
+ "\n",
205
+ "display(SVG(filename=\"artifacts/evals_llm/llm_reward_by_task.svg\"))\n",
206
+ "display(SVG(filename=\"artifacts/evals_llm/llm_violations_before_after.svg\"))"
207
+ ]
208
+ },
209
+ {
210
+ "cell_type": "markdown",
211
+ "id": "9e8a35c5",
212
+ "metadata": {},
213
+ "source": [
214
+ "## Backup results (zip and download)\n",
215
+ "\n",
216
+ "Run after eval/plot finish. Large runs: copy `training_output` to Google Drive instead of browser download.\n"
217
+ ]
218
+ },
219
+ {
220
+ "cell_type": "code",
221
+ "execution_count": null,
222
+ "id": "b4a5bcc7",
223
+ "metadata": {},
224
+ "outputs": [],
225
+ "source": [
226
+ "!cd /content/commitment_os && du -sh training_output artifacts/evals_llm 2>/dev/null || true\n",
227
+ "!cd /content/commitment_os && zip -r /content/commitment_os_bundle.zip training_output artifacts/evals_llm\n",
228
+ "from google.colab import files\n",
229
+ "\n",
230
+ "files.download(\"/content/commitment_os_bundle.zip\")\n"
231
+ ]
232
+ }
233
+ ],
234
+ "metadata": {
235
+ "kernelspec": {
236
+ "display_name": "Python 3",
237
+ "language": "python",
238
+ "name": "python3"
239
+ },
240
+ "language_info": {
241
+ "name": "python",
242
+ "version": "3.x"
243
+ }
244
+ },
245
+ "nbformat": 4,
246
+ "nbformat_minor": 5
247
+ }
evaluation/evaluate_llm_checkpoints.py ADDED
@@ -0,0 +1,565 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evaluate base vs RL-trained LLM checkpoints on CommitmentOS.
2
+
3
+ This script runs the SAME protocol for two local-loading model setups:
4
+ - baseline model loaded from a Hugging Face model ID
5
+ - trained model loaded from a local LoRA adapter path on top of that base model
6
+
7
+ It writes judge-friendly artifacts under artifacts/evals_llm/.
8
+ """
9
+
10
+ from __future__ import annotations
11
+
12
+ import csv
13
+ import gc
14
+ import json
15
+ import os
16
+ import sys
17
+ import uuid
18
+ from pathlib import Path
19
+ from statistics import mean, median
20
+ from typing import Any
21
+
22
+ import requests
23
+ from dotenv import load_dotenv
24
+ from pydantic import ValidationError
25
+
26
+ PROJECT_ROOT = Path(__file__).resolve().parents[1]
27
+ if str(PROJECT_ROOT) not in sys.path:
28
+ sys.path.insert(0, str(PROJECT_ROOT))
29
+
30
+ from models import CommitmentAction
31
+
32
+ ARTIFACT_DIR = Path("artifacts/evals_llm")
33
+ ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)
34
+
35
+ load_dotenv()
36
+
37
+ ENV_BASE_URL = os.getenv("ENV_BASE_URL", "https://jayant2304-commitment-os.hf.space")
38
+ HF_TOKEN = os.getenv("HF_TOKEN", "").strip() or None
39
+
40
+ BASELINE_MODEL = os.getenv("BASELINE_MODEL_NAME", "").strip()
41
+ TRAINED_MODEL_PATH = os.getenv("TRAINED_MODEL_PATH", "").strip()
42
+
43
+ EVAL_SEED = int(os.getenv("EVAL_SEED", "42"))
44
+ MAX_STEPS = int(os.getenv("EVAL_MAX_STEPS", "12"))
45
+ TEMPERATURE = float(os.getenv("EVAL_TEMPERATURE", "0.0"))
46
+ TOP_P = float(os.getenv("EVAL_TOP_P", "1.0"))
47
+ MAX_NEW_TOKENS = int(os.getenv("EVAL_MAX_NEW_TOKENS", "256"))
48
+ SUCCESS_THRESHOLD = float(os.getenv("EVAL_SUCCESS_THRESHOLD", "0.6"))
49
+
50
+ SYSTEM_PROMPT = """You are an expert executive assistant AI. You manage calendars, emails, and dining reservations.
51
+
52
+ You will be given a scenario briefing describing a situation with calendar conflicts, emails, or planning tasks.
53
+
54
+ For each turn, you must respond with EXACTLY ONE JSON object choosing a tool to call:
55
+
56
+ Available tools:
57
+ - {"action_type": "view_calendar", "date": "2026-04-25"}
58
+ - {"action_type": "check_availability", "person": "Client_Jones"}
59
+ - {"action_type": "search_restaurants", "cuisine": "Italian", "max_price": 50, "dietary": "vegetarian", "max_distance_miles": 3.0, "near_airport": false}
60
+ - {"action_type": "schedule_meeting", "title": "Demo", "date": "2026-04-25", "time": "14:00", "duration_min": 60, "participants": ["Client_Jones"], "location": "Room A"}
61
+ - {"action_type": "reschedule_event", "event_id": "evt_1", "new_time": "15:00"}
62
+ - {"action_type": "cancel_event", "event_id": "evt_1"}
63
+ - {"action_type": "send_email", "to": "VP_Chen", "subject": "Meeting update", "body": "Hi, I need to reschedule..."}
64
+ - {"action_type": "book_restaurant", "restaurant_name": "Sky Lounge"}
65
+ - {"action_type": "submit_plan"}
66
+
67
+ IMPORTANT RULES:
68
+ 1. Respond with ONLY a JSON object, no markdown, no explanation
69
+ 2. Handle higher-priority items before lower-priority ones
70
+ 3. When cancelling or rescheduling commitments, ALWAYS send an email to affected parties BEFORE submitting
71
+ 4. Call submit_plan when you have resolved all issues
72
+ 5. Never silently drop a commitment — always notify the affected person"""
73
+
74
+
75
+ def _require_env() -> None:
76
+ if not BASELINE_MODEL:
77
+ raise RuntimeError("Set BASELINE_MODEL_NAME")
78
+ if not TRAINED_MODEL_PATH:
79
+ raise RuntimeError("Set TRAINED_MODEL_PATH")
80
+ if not Path(TRAINED_MODEL_PATH).exists():
81
+ raise RuntimeError(f"TRAINED_MODEL_PATH does not exist: {TRAINED_MODEL_PATH}")
82
+
83
+
84
+ def _load_runtime_deps() -> tuple[Any, Any, Any, Any]:
85
+ try:
86
+ import torch
87
+ from peft import PeftModel
88
+ from transformers import AutoModelForCausalLM, AutoTokenizer
89
+ except ImportError as exc:
90
+ raise RuntimeError(
91
+ "Missing evaluation dependencies. From the repo root: "
92
+ 'pip install -e ".[llm-eval]"'
93
+ " (or: pip install transformers peft accelerate torch sentencepiece)"
94
+ ) from exc
95
+ return torch, AutoModelForCausalLM, AutoTokenizer, PeftModel
96
+
97
+
98
+ def _get_task_ids() -> list[str]:
99
+ resp = requests.get(f"{ENV_BASE_URL}/tasks", timeout=30)
100
+ resp.raise_for_status()
101
+ data = resp.json()
102
+ task_ids: list[str] = []
103
+ for difficulty in ("easy", "medium", "hard"):
104
+ task_ids.extend(data.get(difficulty, []))
105
+ return task_ids
106
+
107
+
108
+ def _parse_action(text: str) -> dict[str, Any]:
109
+ text = (text or "").strip()
110
+ if text.startswith("```"):
111
+ lines = text.split("\n")
112
+ text = "\n".join(lines[1:-1]) if len(lines) > 2 else lines[0]
113
+ try:
114
+ action = json.loads(text)
115
+ if isinstance(action, dict) and action.get("action_type"):
116
+ return action
117
+ except json.JSONDecodeError:
118
+ pass
119
+ return {"action_type": "submit_plan"}
120
+
121
+
122
+ def _normalize_action(action: dict[str, Any]) -> dict[str, Any]:
123
+ allowed_fields = set(CommitmentAction.model_fields.keys())
124
+ payload = {k: v for k, v in action.items() if k in allowed_fields}
125
+
126
+ if isinstance(payload.get("participants"), str):
127
+ participants = [
128
+ item.strip()
129
+ for item in payload["participants"].split(",")
130
+ if item.strip()
131
+ ]
132
+ payload["participants"] = participants
133
+
134
+ if "duration_min" in payload:
135
+ try:
136
+ payload["duration_min"] = int(payload["duration_min"])
137
+ except (TypeError, ValueError):
138
+ payload.pop("duration_min", None)
139
+
140
+ if "max_price" in payload:
141
+ try:
142
+ payload["max_price"] = int(payload["max_price"])
143
+ except (TypeError, ValueError):
144
+ payload.pop("max_price", None)
145
+
146
+ if "max_distance_miles" in payload:
147
+ try:
148
+ payload["max_distance_miles"] = float(payload["max_distance_miles"])
149
+ except (TypeError, ValueError):
150
+ payload.pop("max_distance_miles", None)
151
+
152
+ if isinstance(payload.get("near_airport"), str):
153
+ payload["near_airport"] = payload["near_airport"].strip().lower() in {"true", "1", "yes"}
154
+
155
+ try:
156
+ return CommitmentAction.model_validate(payload).model_dump()
157
+ except ValidationError:
158
+ return CommitmentAction(action_type="submit_plan").model_dump()
159
+
160
+
161
+ def _dtype_and_device(torch_mod: Any) -> tuple[Any, str | None]:
162
+ if not torch_mod.cuda.is_available():
163
+ return torch_mod.float32, None
164
+ if torch_mod.cuda.is_bf16_supported():
165
+ return torch_mod.bfloat16, "auto"
166
+ return torch_mod.float16, "auto"
167
+
168
+
169
+ def _path_has_tokenizer_files(path: Path) -> bool:
170
+ tokenizer_files = {
171
+ "tokenizer.json",
172
+ "tokenizer_config.json",
173
+ "special_tokens_map.json",
174
+ "vocab.json",
175
+ "merges.txt",
176
+ "spiece.model",
177
+ }
178
+ return any((path / file_name).exists() for file_name in tokenizer_files)
179
+
180
+
181
+ class LocalChatModel:
182
+ def __init__(
183
+ self,
184
+ *,
185
+ display_name: str,
186
+ tokenizer: Any,
187
+ model: Any,
188
+ torch_mod: Any,
189
+ ) -> None:
190
+ self.display_name = display_name
191
+ self.tokenizer = tokenizer
192
+ self.model = model
193
+ self.torch = torch_mod
194
+
195
+ def generate_action(self, messages: list[dict[str, str]]) -> tuple[dict[str, Any], str]:
196
+ prompt = self.tokenizer.apply_chat_template(
197
+ messages,
198
+ tokenize=False,
199
+ add_generation_prompt=True,
200
+ )
201
+ inputs = self.tokenizer(prompt, return_tensors="pt")
202
+ target_device = next(self.model.parameters()).device
203
+ inputs = {k: v.to(target_device) for k, v in inputs.items()}
204
+
205
+ generation_kwargs: dict[str, Any] = {
206
+ "max_new_tokens": MAX_NEW_TOKENS,
207
+ "pad_token_id": self.tokenizer.pad_token_id,
208
+ "eos_token_id": self.tokenizer.eos_token_id,
209
+ }
210
+ if TEMPERATURE > 0:
211
+ generation_kwargs.update(
212
+ {
213
+ "do_sample": True,
214
+ "temperature": TEMPERATURE,
215
+ "top_p": TOP_P,
216
+ }
217
+ )
218
+ else:
219
+ generation_kwargs["do_sample"] = False
220
+
221
+ with self.torch.inference_mode():
222
+ output_ids = self.model.generate(**inputs, **generation_kwargs)
223
+
224
+ prompt_len = inputs["input_ids"].shape[-1]
225
+ new_tokens = output_ids[0][prompt_len:]
226
+ raw = self.tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
227
+ return _normalize_action(_parse_action(raw)), raw
228
+
229
+ def unload(self) -> None:
230
+ del self.model
231
+ gc.collect()
232
+ if self.torch.cuda.is_available():
233
+ self.torch.cuda.empty_cache()
234
+
235
+
236
+ def _load_tokenizer(AutoTokenizer: Any, model_or_path: str | Path) -> Any:
237
+ tokenizer = AutoTokenizer.from_pretrained(
238
+ model_or_path,
239
+ trust_remote_code=True,
240
+ token=HF_TOKEN,
241
+ )
242
+ if tokenizer.pad_token is None:
243
+ tokenizer.pad_token = tokenizer.eos_token
244
+ return tokenizer
245
+
246
+
247
+ def load_baseline_model() -> LocalChatModel:
248
+ torch_mod, AutoModelForCausalLM, AutoTokenizer, _ = _load_runtime_deps()
249
+ dtype, device_map = _dtype_and_device(torch_mod)
250
+ tokenizer = _load_tokenizer(AutoTokenizer, BASELINE_MODEL)
251
+ model = AutoModelForCausalLM.from_pretrained(
252
+ BASELINE_MODEL,
253
+ trust_remote_code=True,
254
+ token=HF_TOKEN,
255
+ dtype=dtype,
256
+ device_map=device_map,
257
+ )
258
+ model.eval()
259
+ return LocalChatModel(
260
+ display_name=BASELINE_MODEL,
261
+ tokenizer=tokenizer,
262
+ model=model,
263
+ torch_mod=torch_mod,
264
+ )
265
+
266
+
267
+ def load_trained_model() -> LocalChatModel:
268
+ torch_mod, AutoModelForCausalLM, AutoTokenizer, PeftModel = _load_runtime_deps()
269
+ dtype, device_map = _dtype_and_device(torch_mod)
270
+ adapter_path = Path(TRAINED_MODEL_PATH)
271
+ tokenizer_source: str | Path = adapter_path if _path_has_tokenizer_files(adapter_path) else BASELINE_MODEL
272
+ tokenizer = _load_tokenizer(AutoTokenizer, tokenizer_source)
273
+
274
+ base_model = AutoModelForCausalLM.from_pretrained(
275
+ BASELINE_MODEL,
276
+ trust_remote_code=True,
277
+ token=HF_TOKEN,
278
+ dtype=dtype,
279
+ device_map=device_map,
280
+ )
281
+ model = PeftModel.from_pretrained(base_model, adapter_path)
282
+ model.eval()
283
+ return LocalChatModel(
284
+ display_name=str(adapter_path),
285
+ tokenizer=tokenizer,
286
+ model=model,
287
+ torch_mod=torch_mod,
288
+ )
289
+
290
+
291
+ def _env_reset(task_id: str, episode_id: str) -> dict[str, Any]:
292
+ resp = requests.post(
293
+ f"{ENV_BASE_URL}/reset",
294
+ params={"task_id": task_id, "seed": EVAL_SEED, "episode_id": episode_id},
295
+ timeout=30,
296
+ )
297
+ resp.raise_for_status()
298
+ data = resp.json()
299
+ return data.get("observation", data)
300
+
301
+
302
+ def _env_step(action: dict[str, Any], episode_id: str) -> dict[str, Any]:
303
+ resp = requests.post(
304
+ f"{ENV_BASE_URL}/step",
305
+ params={"episode_id": episode_id},
306
+ json={"action": action},
307
+ timeout=30,
308
+ )
309
+ if resp.status_code >= 400:
310
+ raise requests.HTTPError(
311
+ f"{resp.status_code} {resp.reason}: {resp.text}",
312
+ response=resp,
313
+ )
314
+ data = resp.json()
315
+ obs = data.get("observation", data)
316
+ obs["done"] = data.get("done", obs.get("done", False))
317
+ obs["reward"] = float(data.get("reward", obs.get("reward", 0.0)) or 0.0)
318
+ return obs
319
+
320
+
321
+ def _env_state(episode_id: str) -> dict[str, Any]:
322
+ resp = requests.get(f"{ENV_BASE_URL}/state", params={"episode_id": episode_id}, timeout=30)
323
+ resp.raise_for_status()
324
+ return resp.json()
325
+
326
+
327
+ def run_task(chat_model: LocalChatModel, task_id: str) -> dict[str, Any]:
328
+ safe_name = chat_model.display_name.replace("/", "-").replace(" ", "_")
329
+ episode_id = f"eval-{safe_name}-{task_id}-{uuid.uuid4().hex[:8]}"
330
+ obs = _env_reset(task_id, episode_id)
331
+
332
+ briefing = obs.get("briefing", "")
333
+ calendar = json.dumps(obs.get("calendar_snapshot", []), indent=2)
334
+ inbox = json.dumps(obs.get("inbox", []), indent=2)
335
+ messages: list[dict[str, str]] = [
336
+ {"role": "system", "content": SYSTEM_PROMPT},
337
+ {"role": "user", "content": f"SCENARIO: {briefing}\n\nCALENDAR:\n{calendar}\n\nINBOX:\n{inbox}\n\nWhat is your first action?"},
338
+ ]
339
+
340
+ trace: list[dict[str, Any]] = []
341
+ step_num = 0
342
+ done = False
343
+ final_obs: dict[str, Any] = obs
344
+
345
+ for step_num in range(1, MAX_STEPS + 1):
346
+ action, raw = chat_model.generate_action(messages)
347
+ step_obs = _env_step(action, episode_id)
348
+ final_obs = step_obs
349
+ done = bool(step_obs.get("done", False))
350
+ trace.append(
351
+ {
352
+ "step": step_num,
353
+ "action": action,
354
+ "raw_model_output": raw,
355
+ "reward": float(step_obs.get("reward", 0.0)),
356
+ "done": done,
357
+ "tool_result": step_obs.get("tool_result", ""),
358
+ }
359
+ )
360
+ if done:
361
+ break
362
+ messages.append({"role": "assistant", "content": raw})
363
+ messages.append({"role": "user", "content": f"TOOL RESULT: {step_obs.get('tool_result', '')}\n\nWhat is your next action?"})
364
+
365
+ if not done:
366
+ final_obs = _env_step({"action_type": "submit_plan"}, episode_id)
367
+ step_num += 1
368
+ trace.append(
369
+ {
370
+ "step": step_num,
371
+ "action": {"action_type": "submit_plan"},
372
+ "raw_model_output": '{"action_type":"submit_plan"}',
373
+ "reward": float(final_obs.get("reward", 0.0)),
374
+ "done": True,
375
+ "tool_result": final_obs.get("tool_result", ""),
376
+ }
377
+ )
378
+
379
+ state = _env_state(episode_id)
380
+ final_reward = float(final_obs.get("reward", 0.0))
381
+ return {
382
+ "task_id": task_id,
383
+ "difficulty": final_obs.get("difficulty", ""),
384
+ "model_name": chat_model.display_name,
385
+ "final_reward": round(final_reward, 4),
386
+ "success": final_reward >= SUCCESS_THRESHOLD,
387
+ "steps_used": int(state.get("step_count", step_num)),
388
+ "violation_count": int(state.get("violation_count", 0)),
389
+ "reward_breakdown": final_obs.get("reward_breakdown", {}),
390
+ "feedback": final_obs.get("feedback", ""),
391
+ "trace": trace,
392
+ }
393
+
394
+
395
+ def run_model(chat_model: LocalChatModel, task_ids: list[str]) -> list[dict[str, Any]]:
396
+ results: list[dict[str, Any]] = []
397
+ n = len(task_ids)
398
+ label = chat_model.display_name
399
+ for i, task_id in enumerate(task_ids, start=1):
400
+ print(f"[eval {label}] task {i}/{n}: {task_id}", flush=True)
401
+ results.append(run_task(chat_model, task_id=task_id))
402
+ return results
403
+
404
+
405
+ def _write_json(path: Path, payload: Any) -> None:
406
+ path.write_text(json.dumps(payload, indent=2))
407
+
408
+
409
+ def write_artifacts(baseline: list[dict[str, Any]], trained: list[dict[str, Any]]) -> None:
410
+ by_task = {row["task_id"]: row for row in trained}
411
+ comparison_rows: list[dict[str, Any]] = []
412
+ for base in baseline:
413
+ tr = by_task[base["task_id"]]
414
+ comparison_rows.append(
415
+ {
416
+ "task_id": base["task_id"],
417
+ "difficulty": base["difficulty"],
418
+ "baseline_reward": base["final_reward"],
419
+ "trained_reward": tr["final_reward"],
420
+ "reward_delta": round(tr["final_reward"] - base["final_reward"], 4),
421
+ "baseline_steps": base["steps_used"],
422
+ "trained_steps": tr["steps_used"],
423
+ "step_delta": tr["steps_used"] - base["steps_used"],
424
+ "baseline_violations": base["violation_count"],
425
+ "trained_violations": tr["violation_count"],
426
+ "violation_delta": tr["violation_count"] - base["violation_count"],
427
+ "baseline_success": int(base["success"]),
428
+ "trained_success": int(tr["success"]),
429
+ }
430
+ )
431
+
432
+ _write_json(ARTIFACT_DIR / "baseline_llm_eval.json", baseline)
433
+ _write_json(ARTIFACT_DIR / "trained_llm_eval.json", trained)
434
+ _write_json(
435
+ ARTIFACT_DIR / "llm_eval_protocol.json",
436
+ {
437
+ "task_set": "easy_001..hard_015",
438
+ "seed": EVAL_SEED,
439
+ "max_steps": MAX_STEPS,
440
+ "decode_config": {
441
+ "temperature": TEMPERATURE,
442
+ "top_p": TOP_P,
443
+ "max_new_tokens": MAX_NEW_TOKENS,
444
+ },
445
+ "env_base_url": ENV_BASE_URL,
446
+ "baseline_model_name": BASELINE_MODEL,
447
+ "trained_model_path": TRAINED_MODEL_PATH,
448
+ "success_threshold": SUCCESS_THRESHOLD,
449
+ },
450
+ )
451
+
452
+ with (ARTIFACT_DIR / "llm_comparison.csv").open("w", newline="") as f:
453
+ writer = csv.DictWriter(f, fieldnames=list(comparison_rows[0].keys()))
454
+ writer.writeheader()
455
+ writer.writerows(comparison_rows)
456
+
457
+ baseline_rewards = [r["baseline_reward"] for r in comparison_rows]
458
+ trained_rewards = [r["trained_reward"] for r in comparison_rows]
459
+ reward_deltas = [r["reward_delta"] for r in comparison_rows]
460
+ baseline_steps = [r["baseline_steps"] for r in comparison_rows]
461
+ trained_steps = [r["trained_steps"] for r in comparison_rows]
462
+ baseline_violations = [r["baseline_violations"] for r in comparison_rows]
463
+ trained_violations = [r["trained_violations"] for r in comparison_rows]
464
+ baseline_success = [r["baseline_success"] for r in comparison_rows]
465
+ trained_success = [r["trained_success"] for r in comparison_rows]
466
+
467
+ summary = {
468
+ "task_count": len(comparison_rows),
469
+ "baseline_mean_reward": round(mean(baseline_rewards), 4),
470
+ "trained_mean_reward": round(mean(trained_rewards), 4),
471
+ "mean_reward_delta": round(mean(trained_rewards) - mean(baseline_rewards), 4),
472
+ "median_reward_delta": round(median(reward_deltas), 4),
473
+ "baseline_success_rate": round(mean(baseline_success), 4),
474
+ "trained_success_rate": round(mean(trained_success), 4),
475
+ "success_rate_delta": round(mean(trained_success) - mean(baseline_success), 4),
476
+ "baseline_mean_steps": round(mean(baseline_steps), 4),
477
+ "trained_mean_steps": round(mean(trained_steps), 4),
478
+ "step_delta": round(mean(trained_steps) - mean(baseline_steps), 4),
479
+ "baseline_mean_violations": round(mean(baseline_violations), 4),
480
+ "trained_mean_violations": round(mean(trained_violations), 4),
481
+ "violation_delta": round(mean(trained_violations) - mean(baseline_violations), 4),
482
+ "tasks_with_positive_reward_delta": sum(1 for x in reward_deltas if x > 0),
483
+ "tasks_with_no_reward_delta": sum(1 for x in reward_deltas if x == 0),
484
+ "per_difficulty": {},
485
+ }
486
+
487
+ for difficulty in ("easy", "medium", "hard"):
488
+ subset = [r for r in comparison_rows if r["difficulty"] == difficulty]
489
+ if not subset:
490
+ continue
491
+ summary["per_difficulty"][difficulty] = {
492
+ "count": len(subset),
493
+ "baseline_mean_reward": round(mean([r["baseline_reward"] for r in subset]), 4),
494
+ "trained_mean_reward": round(mean([r["trained_reward"] for r in subset]), 4),
495
+ "reward_delta": round(
496
+ mean([r["trained_reward"] for r in subset]) - mean([r["baseline_reward"] for r in subset]),
497
+ 4,
498
+ ),
499
+ "baseline_mean_steps": round(mean([r["baseline_steps"] for r in subset]), 4),
500
+ "trained_mean_steps": round(mean([r["trained_steps"] for r in subset]), 4),
501
+ "step_delta": round(
502
+ mean([r["trained_steps"] for r in subset]) - mean([r["baseline_steps"] for r in subset]),
503
+ 4,
504
+ ),
505
+ }
506
+
507
+ _write_json(ARTIFACT_DIR / "llm_summary.json", summary)
508
+
509
+ target_task = "hard_015"
510
+ base_case = next((r for r in baseline if r["task_id"] == target_task), None)
511
+ tr_case = next((r for r in trained if r["task_id"] == target_task), None)
512
+ if base_case and tr_case:
513
+ case_study = f"""# LLM Case Study: {target_task}
514
+
515
+ ## Baseline model ({BASELINE_MODEL})
516
+ - Reward: {base_case['final_reward']:.4f}
517
+ - Steps: {base_case['steps_used']}
518
+ - Violations: {base_case['violation_count']}
519
+ - Feedback: {base_case['feedback']}
520
+
521
+ ## Trained model ({TRAINED_MODEL_PATH})
522
+ - Reward: {tr_case['final_reward']:.4f}
523
+ - Steps: {tr_case['steps_used']}
524
+ - Violations: {tr_case['violation_count']}
525
+ - Feedback: {tr_case['feedback']}
526
+ """
527
+ (ARTIFACT_DIR / "llm_case_study_hard_015.md").write_text(case_study)
528
+
529
+
530
+ def _print_summary() -> None:
531
+ summary_path = ARTIFACT_DIR / "llm_summary.json"
532
+ summary = json.loads(summary_path.read_text())
533
+ print("\nCheckpoint comparison summary")
534
+ print(f"Baseline mean reward: {summary['baseline_mean_reward']:.4f}")
535
+ print(f"Trained mean reward: {summary['trained_mean_reward']:.4f}")
536
+ print(f"Reward delta: {summary['mean_reward_delta']:+.4f}")
537
+ print(f"Baseline success: {summary['baseline_success_rate']:.4f}")
538
+ print(f"Trained success: {summary['trained_success_rate']:.4f}")
539
+ print(f"Success delta: {summary['success_rate_delta']:+.4f}")
540
+
541
+
542
+ def main() -> None:
543
+ _require_env()
544
+ task_ids = _get_task_ids()
545
+ print(f"CommitmentOS LLM eval: {len(task_ids)} tasks, env={ENV_BASE_URL}", flush=True)
546
+
547
+ print("Loading baseline model…", flush=True)
548
+ baseline_model = load_baseline_model()
549
+ print("Running baseline…", flush=True)
550
+ baseline_results = run_model(baseline_model, task_ids)
551
+ baseline_model.unload()
552
+
553
+ print("Loading trained adapter…", flush=True)
554
+ trained_model = load_trained_model()
555
+ print("Running trained…", flush=True)
556
+ trained_results = run_model(trained_model, task_ids)
557
+ trained_model.unload()
558
+
559
+ write_artifacts(baseline_results, trained_results)
560
+ print("Wrote LLM checkpoint artifacts to", ARTIFACT_DIR)
561
+ _print_summary()
562
+
563
+
564
+ if __name__ == "__main__":
565
+ main()
evaluation/plot_llm_checkpoints.py ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Render SVG visuals for LLM checkpoint comparison."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import csv
6
+ from pathlib import Path
7
+
8
+ ARTIFACT_DIR = Path("artifacts/evals_llm")
9
+ COMPARISON_CSV = ARTIFACT_DIR / "llm_comparison.csv"
10
+
11
+
12
+ def _svg_header(width: int, height: int) -> list[str]:
13
+ return [
14
+ f'<svg xmlns="http://www.w3.org/2000/svg" width="{width}" height="{height}" viewBox="0 0 {width} {height}">',
15
+ '<rect width="100%" height="100%" fill="#FFFFFF"/>',
16
+ ]
17
+
18
+
19
+ def _svg_footer() -> list[str]:
20
+ return ["</svg>"]
21
+
22
+
23
+ def _rows() -> list[dict[str, str]]:
24
+ with COMPARISON_CSV.open() as f:
25
+ return list(csv.DictReader(f))
26
+
27
+
28
+ def plot_reward(rows: list[dict[str, str]]) -> None:
29
+ tasks = [r["task_id"] for r in rows]
30
+ base = [float(r["baseline_reward"]) for r in rows]
31
+ trained = [float(r["trained_reward"]) for r in rows]
32
+
33
+ width, height = 1360, 520
34
+ left, right, top, bottom = 80, 40, 70, 110
35
+ plot_w = width - left - right
36
+ plot_h = height - top - bottom
37
+ group_w = plot_w / max(len(tasks), 1)
38
+ bar_w = max(group_w * 0.32, 10)
39
+
40
+ lines = _svg_header(width, height)
41
+ lines.append('<text x="80" y="35" font-size="22" font-family="Arial" fill="#111827">Base vs Trained LLM Reward by Task</text>')
42
+ lines.append(f'<line x1="{left}" y1="{top+plot_h}" x2="{left+plot_w}" y2="{top+plot_h}" stroke="#374151" stroke-width="1"/>')
43
+ lines.append(f'<line x1="{left}" y1="{top}" x2="{left}" y2="{top+plot_h}" stroke="#374151" stroke-width="1"/>')
44
+
45
+ for tick in range(0, 6):
46
+ value = tick / 5
47
+ y = top + plot_h - (value * plot_h)
48
+ lines.append(f'<line x1="{left}" y1="{y:.2f}" x2="{left+plot_w}" y2="{y:.2f}" stroke="#E5E7EB" stroke-width="1"/>')
49
+ lines.append(f'<text x="{left-38}" y="{y+5:.2f}" font-size="12" font-family="Arial" fill="#374151">{value:.1f}</text>')
50
+
51
+ for idx, task in enumerate(tasks):
52
+ gx = left + (idx * group_w) + (group_w * 0.5)
53
+ b_h = base[idx] * plot_h
54
+ t_h = trained[idx] * plot_h
55
+ b_x = gx - bar_w - 2
56
+ t_x = gx + 2
57
+ b_y = top + plot_h - b_h
58
+ t_y = top + plot_h - t_h
59
+ lines.append(f'<rect x="{b_x:.2f}" y="{b_y:.2f}" width="{bar_w:.2f}" height="{b_h:.2f}" fill="#9CA3AF"/>')
60
+ lines.append(f'<rect x="{t_x:.2f}" y="{t_y:.2f}" width="{bar_w:.2f}" height="{t_h:.2f}" fill="#2563EB"/>')
61
+ lines.append(
62
+ f'<text x="{gx:.2f}" y="{top+plot_h+22}" font-size="10" text-anchor="middle" '
63
+ f'font-family="Arial" fill="#374151" transform="rotate(25 {gx:.2f},{top+plot_h+22})">{task}</text>'
64
+ )
65
+
66
+ legend_y = 52
67
+ lines.append(f'<rect x="{width-310}" y="{legend_y-10}" width="12" height="12" fill="#9CA3AF"/>')
68
+ lines.append(f'<text x="{width-292}" y="{legend_y}" font-size="12" font-family="Arial" fill="#111827">Base</text>')
69
+ lines.append(f'<rect x="{width-230}" y="{legend_y-10}" width="12" height="12" fill="#2563EB"/>')
70
+ lines.append(f'<text x="{width-212}" y="{legend_y}" font-size="12" font-family="Arial" fill="#111827">Trained</text>')
71
+ lines.extend(_svg_footer())
72
+ (ARTIFACT_DIR / "llm_reward_by_task.svg").write_text("\n".join(lines))
73
+
74
+
75
+ def plot_violations(rows: list[dict[str, str]]) -> None:
76
+ tasks = [r["task_id"] for r in rows]
77
+ base = [int(r["baseline_violations"]) for r in rows]
78
+ trained = [int(r["trained_violations"]) for r in rows]
79
+ max_v = max(max(base, default=0), max(trained, default=0), 1)
80
+
81
+ width, height = 1360, 500
82
+ left, right, top, bottom = 80, 40, 70, 100
83
+ plot_w = width - left - right
84
+ plot_h = height - top - bottom
85
+
86
+ def point_x(i: int) -> float:
87
+ return left + (i / max(len(tasks) - 1, 1)) * plot_w
88
+
89
+ def point_y(v: int) -> float:
90
+ return top + plot_h - ((v / max_v) * plot_h)
91
+
92
+ lines = _svg_header(width, height)
93
+ lines.append('<text x="80" y="35" font-size="22" font-family="Arial" fill="#111827">Base vs Trained LLM Commitment Violations</text>')
94
+ lines.append(f'<line x1="{left}" y1="{top+plot_h}" x2="{left+plot_w}" y2="{top+plot_h}" stroke="#374151" stroke-width="1"/>')
95
+ lines.append(f'<line x1="{left}" y1="{top}" x2="{left}" y2="{top+plot_h}" stroke="#374151" stroke-width="1"/>')
96
+
97
+ for tick in range(max_v + 1):
98
+ y = point_y(tick)
99
+ lines.append(f'<line x1="{left}" y1="{y:.2f}" x2="{left+plot_w}" y2="{y:.2f}" stroke="#E5E7EB" stroke-width="1"/>')
100
+ lines.append(f'<text x="{left-24}" y="{y+5:.2f}" font-size="12" font-family="Arial" fill="#374151">{tick}</text>')
101
+
102
+ base_points = " ".join(f"{point_x(i):.2f},{point_y(v):.2f}" for i, v in enumerate(base))
103
+ tr_points = " ".join(f"{point_x(i):.2f},{point_y(v):.2f}" for i, v in enumerate(trained))
104
+ lines.append(f'<polyline points="{base_points}" fill="none" stroke="#DC2626" stroke-width="2"/>')
105
+ lines.append(f'<polyline points="{tr_points}" fill="none" stroke="#059669" stroke-width="2"/>')
106
+
107
+ for i, task in enumerate(tasks):
108
+ x = point_x(i)
109
+ lines.append(f'<circle cx="{x:.2f}" cy="{point_y(base[i]):.2f}" r="3" fill="#DC2626"/>')
110
+ lines.append(f'<circle cx="{x:.2f}" cy="{point_y(trained[i]):.2f}" r="3" fill="#059669"/>')
111
+ lines.append(
112
+ f'<text x="{x:.2f}" y="{top+plot_h+20}" font-size="10" text-anchor="middle" '
113
+ f'font-family="Arial" fill="#374151" transform="rotate(25 {x:.2f},{top+plot_h+20})">{task}</text>'
114
+ )
115
+
116
+ legend_y = 52
117
+ lines.append(f'<line x1="{width-320}" y1="{legend_y-5}" x2="{width-300}" y2="{legend_y-5}" stroke="#DC2626" stroke-width="2"/>')
118
+ lines.append(f'<text x="{width-295}" y="{legend_y}" font-size="12" font-family="Arial" fill="#111827">Base</text>')
119
+ lines.append(f'<line x1="{width-230}" y1="{legend_y-5}" x2="{width-210}" y2="{legend_y-5}" stroke="#059669" stroke-width="2"/>')
120
+ lines.append(f'<text x="{width-205}" y="{legend_y}" font-size="12" font-family="Arial" fill="#111827">Trained</text>')
121
+ lines.extend(_svg_footer())
122
+ (ARTIFACT_DIR / "llm_violations_before_after.svg").write_text("\n".join(lines))
123
+
124
+
125
+ def main() -> None:
126
+ rows = _rows()
127
+ plot_reward(rows)
128
+ plot_violations(rows)
129
+ print("Wrote checkpoint comparison SVG plots to", ARTIFACT_DIR)
130
+
131
+
132
+ if __name__ == "__main__":
133
+ main()
pyproject.toml CHANGED
@@ -7,7 +7,7 @@ name = "commitment-os"
7
  version = "0.1.0"
8
  description = "CommitmentOS: the first RL environment that trains temporal commitment coherence in LLMs"
9
  requires-python = ">=3.10"
10
- license = {text = "MIT"}
11
  authors = [
12
  {name = "Jayant Aggarwal"},
13
  ]
@@ -40,4 +40,19 @@ training = [
40
  "torch>=2.0.0",
41
  "peft>=0.14.0",
42
  "datasets>=3.0.0",
 
 
43
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  version = "0.1.0"
8
  description = "CommitmentOS: the first RL environment that trains temporal commitment coherence in LLMs"
9
  requires-python = ">=3.10"
10
+ license = "MIT"
11
  authors = [
12
  {name = "Jayant Aggarwal"},
13
  ]
 
40
  "torch>=2.0.0",
41
  "peft>=0.14.0",
42
  "datasets>=3.0.0",
43
+ "accelerate>=0.30.0",
44
+ "sentencepiece>=0.2.0",
45
  ]
46
+ # Local Transformers + PEFT eval (evaluate_llm_checkpoints.py); not in Docker requirements.txt
47
+ llm-eval = [
48
+ "transformers>=4.45.0",
49
+ "peft>=0.14.0",
50
+ "torch>=2.0.0",
51
+ "accelerate>=0.30.0",
52
+ "sentencepiece>=0.2.0",
53
+ "requests>=2.31.0",
54
+ ]
55
+
56
+ [tool.setuptools.packages.find]
57
+ where = ["."]
58
+ include = ["server*", "training*"]
server/__init__.py CHANGED
@@ -0,0 +1 @@
 
 
1
+ """CommitmentOS HTTP server and environment implementation."""
training/CommitmentOS_Training.ipynb CHANGED
@@ -1,95 +1,119 @@
1
  {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "metadata": {},
6
- "source": [
7
- "# CommitmentOS Training Notebook\\n",
8
- "\\n",
9
- "This notebook reproduces GRPO training for CommitmentOS using TRL + LoRA."
10
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  },
12
- {
13
- "cell_type": "code",
14
- "execution_count": null,
15
- "metadata": {},
16
- "outputs": [],
17
- "source": [
18
- "!pip -q install --upgrade pip\\n",
19
- "!pip -q install openenv trl transformers peft datasets torch accelerate bitsandbytes matplotlib pandas"
20
- ]
21
- },
22
- {
23
- "cell_type": "code",
24
- "execution_count": null,
25
- "metadata": {},
26
- "outputs": [],
27
- "source": [
28
- "!git clone https://github.com/Jayant2304/commitment_os.git\\n",
29
- "%cd commitment_os\\n",
30
- "!python -m pytest tests/test_environment.py -q"
31
- ]
32
- },
33
- {
34
- "cell_type": "code",
35
- "execution_count": null,
36
- "metadata": {},
37
- "outputs": [],
38
- "source": [
39
- "!python training/train_grpo.py \\\\\\n",
40
- " --model Qwen/Qwen2.5-1.5B-Instruct \\\\\\n",
41
- " --epochs 2 \\\\\\n",
42
- " --lr 5e-6 \\\\\\n",
43
- " --batch_size 1 \\\\\\n",
44
- " --group_size 2 \\\\\\n",
45
- " --lora_rank 16 \\\\\\n",
46
- " --lora_alpha 32 \\\\\\n",
47
- " --output_dir ./training_output"
48
- ]
49
- },
50
- {
51
- "cell_type": "code",
52
- "execution_count": null,
53
- "metadata": {},
54
- "outputs": [],
55
- "source": [
56
- "import json\\n",
57
- "import matplotlib.pyplot as plt\\n",
58
- "from pathlib import Path\\n",
59
- "\\n",
60
- "p = Path('training_output/training_metrics.json')\\n",
61
- "logs = json.loads(p.read_text())\\n",
62
- "\\n",
63
- "steps = [float(x['step']) for x in logs if 'step' in x and 'loss' in x]\\n",
64
- "loss = [float(x['loss']) for x in logs if 'step' in x and 'loss' in x]\\n",
65
- "r_steps = [float(x['step']) for x in logs if 'step' in x and 'reward' in x]\\n",
66
- "rewards = [float(x['reward']) for x in logs if 'step' in x and 'reward' in x]\\n",
67
- "\\n",
68
- "plt.figure(figsize=(8,5))\\n",
69
- "plt.plot(steps, loss, marker='o')\\n",
70
- "plt.title('CommitmentOS GRPO Loss vs Step')\\n",
71
- "plt.xlabel('Step'); plt.ylabel('Loss'); plt.grid(alpha=0.3)\\n",
72
- "plt.tight_layout(); plt.savefig('loss_curve.png', dpi=200); plt.show()\\n",
73
- "\\n",
74
- "plt.figure(figsize=(8,5))\\n",
75
- "plt.plot(r_steps, rewards, marker='o')\\n",
76
- "plt.title('CommitmentOS GRPO Reward vs Step')\\n",
77
- "plt.xlabel('Step'); plt.ylabel('Reward'); plt.grid(alpha=0.3)\\n",
78
- "plt.tight_layout(); plt.savefig('reward_curve.png', dpi=200); plt.show()"
79
- ]
80
- }
81
- ],
82
- "metadata": {
83
- "kernelspec": {
84
- "display_name": "Python 3",
85
- "language": "python",
86
- "name": "python3"
87
- },
88
- "language_info": {
89
- "name": "python",
90
- "version": "3.10"
91
- }
92
- },
93
- "nbformat": 4,
94
- "nbformat_minor": 5
95
  }
 
1
  {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# CommitmentOS Training Notebook\\n\n",
8
+ "\\n\n",
9
+ "This notebook reproduces GRPO training for CommitmentOS using TRL + LoRA."
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "code",
14
+ "execution_count": null,
15
+ "id": "5bc9c2fe",
16
+ "metadata": {},
17
+ "outputs": [],
18
+ "source": [
19
+ "!pip -q install --upgrade pip\\n\n",
20
+ "!pip -q install \"openenv-core>=0.2.0\" trl transformers peft datasets torch accelerate bitsandbytes matplotlib pandas pydantic"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": null,
26
+ "metadata": {},
27
+ "outputs": [],
28
+ "source": [
29
+ "!git clone https://github.com/Jayant2304/commitment_os.git\\n\n",
30
+ "%cd commitment_os\\n\n",
31
+ "!python -m pytest tests/test_environment.py -q"
32
+ ]
33
+ },
34
+ {
35
+ "cell_type": "code",
36
+ "execution_count": null,
37
+ "metadata": {},
38
+ "outputs": [],
39
+ "source": [
40
+ "!python training/train_grpo.py \\\\\\n\n",
41
+ " --model Qwen/Qwen2.5-1.5B-Instruct \\\\\\n\n",
42
+ " --epochs 2 \\\\\\n\n",
43
+ " --lr 5e-6 \\\\\\n\n",
44
+ " --batch_size 1 \\\\\\n\n",
45
+ " --group_size 2 \\\\\\n\n",
46
+ " --lora_rank 16 \\\\\\n\n",
47
+ " --lora_alpha 32 \\\\\\n\n",
48
+ " --output_dir ./training_output"
49
+ ]
50
+ },
51
+ {
52
+ "cell_type": "code",
53
+ "execution_count": null,
54
+ "metadata": {},
55
+ "outputs": [],
56
+ "source": [
57
+ "import json\\n\n",
58
+ "import matplotlib.pyplot as plt\\n\n",
59
+ "from pathlib import Path\\n\n",
60
+ "\\n\n",
61
+ "p = Path('training_output/training_metrics.json')\\n\n",
62
+ "logs = json.loads(p.read_text())\\n\n",
63
+ "\\n\n",
64
+ "steps = [float(x['step']) for x in logs if 'step' in x and 'loss' in x]\\n\n",
65
+ "loss = [float(x['loss']) for x in logs if 'step' in x and 'loss' in x]\\n\n",
66
+ "r_steps = [float(x['step']) for x in logs if 'step' in x and 'reward' in x]\\n\n",
67
+ "rewards = [float(x['reward']) for x in logs if 'step' in x and 'reward' in x]\\n\n",
68
+ "\\n\n",
69
+ "plt.figure(figsize=(8,5))\\n\n",
70
+ "plt.plot(steps, loss, marker='o')\\n\n",
71
+ "plt.title('CommitmentOS GRPO Loss vs Step')\\n\n",
72
+ "plt.xlabel('Step'); plt.ylabel('Loss'); plt.grid(alpha=0.3)\\n\n",
73
+ "plt.tight_layout(); plt.savefig('loss_curve.png', dpi=200); plt.show()\\n\n",
74
+ "\\n\n",
75
+ "plt.figure(figsize=(8,5))\\n\n",
76
+ "plt.plot(r_steps, rewards, marker='o')\\n\n",
77
+ "plt.title('CommitmentOS GRPO Reward vs Step')\\n\n",
78
+ "plt.xlabel('Step'); plt.ylabel('Reward'); plt.grid(alpha=0.3)\\n\n",
79
+ "plt.tight_layout(); plt.savefig('reward_curve.png', dpi=200); plt.show()"
80
+ ]
81
+ },
82
+ {
83
+ "cell_type": "markdown",
84
+ "id": "e788b455",
85
+ "metadata": {},
86
+ "source": [
87
+ "### Optional: zip `training_output` for download\n",
88
+ "\n",
89
+ "Run after training completes. On Colab, use **Files** sidebar or `files.download` for the zip.\n"
90
+ ]
91
+ },
92
+ {
93
+ "cell_type": "code",
94
+ "execution_count": null,
95
+ "id": "1b3c760a",
96
+ "metadata": {},
97
+ "outputs": [],
98
+ "source": [
99
+ "!cd /content/commitment_os && du -sh training_output && zip -r /content/training_output_only.zip training_output\n",
100
+ "from google.colab import files\n",
101
+ "\n",
102
+ "files.download(\"/content/training_output_only.zip\")\n"
103
+ ]
104
+ }
105
+ ],
106
+ "metadata": {
107
+ "kernelspec": {
108
+ "display_name": "Python 3",
109
+ "language": "python",
110
+ "name": "python3"
111
+ },
112
+ "language_info": {
113
+ "name": "python",
114
+ "version": "3.10"
115
+ }
116
  },
117
+ "nbformat": 4,
118
+ "nbformat_minor": 5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
  }
uv.lock CHANGED
@@ -660,9 +660,19 @@ inference = [
660
  { name = "openai" },
661
  { name = "requests" },
662
  ]
 
 
 
 
 
 
 
 
663
  training = [
 
664
  { name = "datasets" },
665
  { name = "peft" },
 
666
  { name = "torch" },
667
  { name = "transformers" },
668
  { name = "trl" },
@@ -670,24 +680,32 @@ training = [
670
 
671
  [package.metadata]
672
  requires-dist = [
 
 
673
  { name = "datasets", marker = "extra == 'training'", specifier = ">=3.0.0" },
674
  { name = "fastapi", specifier = ">=0.110.0" },
675
  { name = "httpx", marker = "extra == 'dev'", specifier = ">=0.27.0" },
676
  { name = "openai", marker = "extra == 'dev'", specifier = ">=1.0.0" },
677
  { name = "openai", marker = "extra == 'inference'", specifier = ">=1.0.0" },
678
  { name = "openenv-core", specifier = ">=0.2.0" },
 
679
  { name = "peft", marker = "extra == 'training'", specifier = ">=0.14.0" },
680
  { name = "pydantic", specifier = ">=2.0.0" },
681
  { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },
682
  { name = "python-dotenv", specifier = ">=1.0.0" },
683
  { name = "requests", marker = "extra == 'dev'", specifier = ">=2.31.0" },
684
  { name = "requests", marker = "extra == 'inference'", specifier = ">=2.31.0" },
 
 
 
 
685
  { name = "torch", marker = "extra == 'training'", specifier = ">=2.0.0" },
 
686
  { name = "transformers", marker = "extra == 'training'", specifier = ">=4.45.0" },
687
  { name = "trl", marker = "extra == 'training'", specifier = ">=0.14.0" },
688
  { name = "uvicorn", extras = ["standard"], specifier = ">=0.29.0" },
689
  ]
690
- provides-extras = ["inference", "dev", "training"]
691
 
692
  [[package]]
693
  name = "cryptography"
@@ -754,7 +772,7 @@ name = "cuda-bindings"
754
  version = "13.2.0"
755
  source = { registry = "https://pypi.org/simple" }
756
  dependencies = [
757
- { name = "cuda-pathfinder" },
758
  ]
759
  wheels = [
760
  { url = "https://files.pythonhosted.org/packages/1a/fe/7351d7e586a8b4c9f89731bfe4cf0148223e8f9903ff09571f78b3fb0682/cuda_bindings-13.2.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:08b395f79cb89ce0cd8effff07c4a1e20101b873c256a1aeb286e8fd7bd0f556", size = 5744254 },
@@ -789,37 +807,37 @@ wheels = [
789
 
790
  [package.optional-dependencies]
791
  cublas = [
792
- { name = "nvidia-cublas", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
793
  ]
794
  cudart = [
795
- { name = "nvidia-cuda-runtime", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
796
  ]
797
  cufft = [
798
- { name = "nvidia-cufft", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
799
  ]
800
  cufile = [
801
  { name = "nvidia-cufile", marker = "sys_platform == 'linux'" },
802
  ]
803
  cupti = [
804
- { name = "nvidia-cuda-cupti", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
805
  ]
806
  curand = [
807
- { name = "nvidia-curand", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
808
  ]
809
  cusolver = [
810
- { name = "nvidia-cusolver", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
811
  ]
812
  cusparse = [
813
- { name = "nvidia-cusparse", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
814
  ]
815
  nvjitlink = [
816
- { name = "nvidia-nvjitlink", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
817
  ]
818
  nvrtc = [
819
- { name = "nvidia-cuda-nvrtc", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
820
  ]
821
  nvtx = [
822
- { name = "nvidia-nvtx", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
823
  ]
824
 
825
  [[package]]
@@ -2158,7 +2176,7 @@ name = "nvidia-cudnn-cu13"
2158
  version = "9.19.0.56"
2159
  source = { registry = "https://pypi.org/simple" }
2160
  dependencies = [
2161
- { name = "nvidia-cublas" },
2162
  ]
2163
  wheels = [
2164
  { url = "https://files.pythonhosted.org/packages/f1/84/26025437c1e6b61a707442184fa0c03d083b661adf3a3eecfd6d21677740/nvidia_cudnn_cu13-9.19.0.56-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:6ed29ffaee1176c612daf442e4dd6cfeb6a0caa43ddcbeb59da94953030b1be4", size = 433781201 },
@@ -2170,7 +2188,7 @@ name = "nvidia-cufft"
2170
  version = "12.0.0.61"
2171
  source = { registry = "https://pypi.org/simple" }
2172
  dependencies = [
2173
- { name = "nvidia-nvjitlink" },
2174
  ]
2175
  wheels = [
2176
  { url = "https://files.pythonhosted.org/packages/8b/ae/f417a75c0259e85c1d2f83ca4e960289a5f814ed0cea74d18c353d3e989d/nvidia_cufft-12.0.0.61-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:2708c852ef8cd89d1d2068bdbece0aa188813a0c934db3779b9b1faa8442e5f5", size = 214053554 },
@@ -2200,9 +2218,9 @@ name = "nvidia-cusolver"
2200
  version = "12.0.4.66"
2201
  source = { registry = "https://pypi.org/simple" }
2202
  dependencies = [
2203
- { name = "nvidia-cublas" },
2204
- { name = "nvidia-cusparse" },
2205
- { name = "nvidia-nvjitlink" },
2206
  ]
2207
  wheels = [
2208
  { url = "https://files.pythonhosted.org/packages/c8/c3/b30c9e935fc01e3da443ec0116ed1b2a009bb867f5324d3f2d7e533e776b/nvidia_cusolver-12.0.4.66-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:02c2457eaa9e39de20f880f4bd8820e6a1cfb9f9a34f820eb12a155aa5bc92d2", size = 223467760 },
@@ -2214,7 +2232,7 @@ name = "nvidia-cusparse"
2214
  version = "12.6.3.3"
2215
  source = { registry = "https://pypi.org/simple" }
2216
  dependencies = [
2217
- { name = "nvidia-nvjitlink" },
2218
  ]
2219
  wheels = [
2220
  { url = "https://files.pythonhosted.org/packages/f8/94/5c26f33738ae35276672f12615a64bd008ed5be6d1ebcb23579285d960a9/nvidia_cusparse-12.6.3.3-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:80bcc4662f23f1054ee334a15c72b8940402975e0eab63178fc7e670aa59472c", size = 162155568 },
@@ -3637,6 +3655,70 @@ wheels = [
3637
  { url = "https://files.pythonhosted.org/packages/6a/23/8146aad7d88f4fcb3a6218f41a60f6c2d4e3a72de72da1825dc7c8f7877c/semantic_version-2.10.0-py2.py3-none-any.whl", hash = "sha256:de78a3b8e0feda74cabc54aab2da702113e33ac9d9eb9d2389bcf1f58b7d9177", size = 15552 },
3638
  ]
3639
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3640
  [[package]]
3641
  name = "setuptools"
3642
  version = "81.0.0"
 
660
  { name = "openai" },
661
  { name = "requests" },
662
  ]
663
+ llm-eval = [
664
+ { name = "accelerate" },
665
+ { name = "peft" },
666
+ { name = "requests" },
667
+ { name = "sentencepiece" },
668
+ { name = "torch" },
669
+ { name = "transformers" },
670
+ ]
671
  training = [
672
+ { name = "accelerate" },
673
  { name = "datasets" },
674
  { name = "peft" },
675
+ { name = "sentencepiece" },
676
  { name = "torch" },
677
  { name = "transformers" },
678
  { name = "trl" },
 
680
 
681
  [package.metadata]
682
  requires-dist = [
683
+ { name = "accelerate", marker = "extra == 'llm-eval'", specifier = ">=0.30.0" },
684
+ { name = "accelerate", marker = "extra == 'training'", specifier = ">=0.30.0" },
685
  { name = "datasets", marker = "extra == 'training'", specifier = ">=3.0.0" },
686
  { name = "fastapi", specifier = ">=0.110.0" },
687
  { name = "httpx", marker = "extra == 'dev'", specifier = ">=0.27.0" },
688
  { name = "openai", marker = "extra == 'dev'", specifier = ">=1.0.0" },
689
  { name = "openai", marker = "extra == 'inference'", specifier = ">=1.0.0" },
690
  { name = "openenv-core", specifier = ">=0.2.0" },
691
+ { name = "peft", marker = "extra == 'llm-eval'", specifier = ">=0.14.0" },
692
  { name = "peft", marker = "extra == 'training'", specifier = ">=0.14.0" },
693
  { name = "pydantic", specifier = ">=2.0.0" },
694
  { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },
695
  { name = "python-dotenv", specifier = ">=1.0.0" },
696
  { name = "requests", marker = "extra == 'dev'", specifier = ">=2.31.0" },
697
  { name = "requests", marker = "extra == 'inference'", specifier = ">=2.31.0" },
698
+ { name = "requests", marker = "extra == 'llm-eval'", specifier = ">=2.31.0" },
699
+ { name = "sentencepiece", marker = "extra == 'llm-eval'", specifier = ">=0.2.0" },
700
+ { name = "sentencepiece", marker = "extra == 'training'", specifier = ">=0.2.0" },
701
+ { name = "torch", marker = "extra == 'llm-eval'", specifier = ">=2.0.0" },
702
  { name = "torch", marker = "extra == 'training'", specifier = ">=2.0.0" },
703
+ { name = "transformers", marker = "extra == 'llm-eval'", specifier = ">=4.45.0" },
704
  { name = "transformers", marker = "extra == 'training'", specifier = ">=4.45.0" },
705
  { name = "trl", marker = "extra == 'training'", specifier = ">=0.14.0" },
706
  { name = "uvicorn", extras = ["standard"], specifier = ">=0.29.0" },
707
  ]
708
+ provides-extras = ["inference", "dev", "training", "llm-eval"]
709
 
710
  [[package]]
711
  name = "cryptography"
 
772
  version = "13.2.0"
773
  source = { registry = "https://pypi.org/simple" }
774
  dependencies = [
775
+ { name = "cuda-pathfinder", marker = "(python_full_version < '3.11' and sys_platform == 'emscripten') or (python_full_version < '3.11' and sys_platform == 'win32') or (sys_platform != 'emscripten' and sys_platform != 'win32')" },
776
  ]
777
  wheels = [
778
  { url = "https://files.pythonhosted.org/packages/1a/fe/7351d7e586a8b4c9f89731bfe4cf0148223e8f9903ff09571f78b3fb0682/cuda_bindings-13.2.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:08b395f79cb89ce0cd8effff07c4a1e20101b873c256a1aeb286e8fd7bd0f556", size = 5744254 },
 
807
 
808
  [package.optional-dependencies]
809
  cublas = [
810
+ { name = "nvidia-cublas", marker = "(python_full_version < '3.11' and sys_platform == 'win32') or sys_platform == 'linux'" },
811
  ]
812
  cudart = [
813
+ { name = "nvidia-cuda-runtime", marker = "(python_full_version < '3.11' and sys_platform == 'win32') or sys_platform == 'linux'" },
814
  ]
815
  cufft = [
816
+ { name = "nvidia-cufft", marker = "(python_full_version < '3.11' and sys_platform == 'win32') or sys_platform == 'linux'" },
817
  ]
818
  cufile = [
819
  { name = "nvidia-cufile", marker = "sys_platform == 'linux'" },
820
  ]
821
  cupti = [
822
+ { name = "nvidia-cuda-cupti", marker = "(python_full_version < '3.11' and sys_platform == 'win32') or sys_platform == 'linux'" },
823
  ]
824
  curand = [
825
+ { name = "nvidia-curand", marker = "(python_full_version < '3.11' and sys_platform == 'win32') or sys_platform == 'linux'" },
826
  ]
827
  cusolver = [
828
+ { name = "nvidia-cusolver", marker = "(python_full_version < '3.11' and sys_platform == 'win32') or sys_platform == 'linux'" },
829
  ]
830
  cusparse = [
831
+ { name = "nvidia-cusparse", marker = "(python_full_version < '3.11' and sys_platform == 'win32') or sys_platform == 'linux'" },
832
  ]
833
  nvjitlink = [
834
+ { name = "nvidia-nvjitlink", marker = "(python_full_version < '3.11' and sys_platform == 'win32') or sys_platform == 'linux'" },
835
  ]
836
  nvrtc = [
837
+ { name = "nvidia-cuda-nvrtc", marker = "(python_full_version < '3.11' and sys_platform == 'win32') or sys_platform == 'linux'" },
838
  ]
839
  nvtx = [
840
+ { name = "nvidia-nvtx", marker = "(python_full_version < '3.11' and sys_platform == 'win32') or sys_platform == 'linux'" },
841
  ]
842
 
843
  [[package]]
 
2176
  version = "9.19.0.56"
2177
  source = { registry = "https://pypi.org/simple" }
2178
  dependencies = [
2179
+ { name = "nvidia-cublas", marker = "(python_full_version < '3.11' and sys_platform == 'emscripten') or (python_full_version < '3.11' and sys_platform == 'win32') or (sys_platform != 'emscripten' and sys_platform != 'win32')" },
2180
  ]
2181
  wheels = [
2182
  { url = "https://files.pythonhosted.org/packages/f1/84/26025437c1e6b61a707442184fa0c03d083b661adf3a3eecfd6d21677740/nvidia_cudnn_cu13-9.19.0.56-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:6ed29ffaee1176c612daf442e4dd6cfeb6a0caa43ddcbeb59da94953030b1be4", size = 433781201 },
 
2188
  version = "12.0.0.61"
2189
  source = { registry = "https://pypi.org/simple" }
2190
  dependencies = [
2191
+ { name = "nvidia-nvjitlink", marker = "(python_full_version < '3.11' and sys_platform == 'emscripten') or (python_full_version < '3.11' and sys_platform == 'win32') or (sys_platform != 'emscripten' and sys_platform != 'win32')" },
2192
  ]
2193
  wheels = [
2194
  { url = "https://files.pythonhosted.org/packages/8b/ae/f417a75c0259e85c1d2f83ca4e960289a5f814ed0cea74d18c353d3e989d/nvidia_cufft-12.0.0.61-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:2708c852ef8cd89d1d2068bdbece0aa188813a0c934db3779b9b1faa8442e5f5", size = 214053554 },
 
2218
  version = "12.0.4.66"
2219
  source = { registry = "https://pypi.org/simple" }
2220
  dependencies = [
2221
+ { name = "nvidia-cublas", marker = "(python_full_version < '3.11' and sys_platform == 'emscripten') or (python_full_version < '3.11' and sys_platform == 'win32') or (sys_platform != 'emscripten' and sys_platform != 'win32')" },
2222
+ { name = "nvidia-cusparse", marker = "(python_full_version < '3.11' and sys_platform == 'emscripten') or (python_full_version < '3.11' and sys_platform == 'win32') or (sys_platform != 'emscripten' and sys_platform != 'win32')" },
2223
+ { name = "nvidia-nvjitlink", marker = "(python_full_version < '3.11' and sys_platform == 'emscripten') or (python_full_version < '3.11' and sys_platform == 'win32') or (sys_platform != 'emscripten' and sys_platform != 'win32')" },
2224
  ]
2225
  wheels = [
2226
  { url = "https://files.pythonhosted.org/packages/c8/c3/b30c9e935fc01e3da443ec0116ed1b2a009bb867f5324d3f2d7e533e776b/nvidia_cusolver-12.0.4.66-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:02c2457eaa9e39de20f880f4bd8820e6a1cfb9f9a34f820eb12a155aa5bc92d2", size = 223467760 },
 
2232
  version = "12.6.3.3"
2233
  source = { registry = "https://pypi.org/simple" }
2234
  dependencies = [
2235
+ { name = "nvidia-nvjitlink", marker = "(python_full_version < '3.11' and sys_platform == 'emscripten') or (python_full_version < '3.11' and sys_platform == 'win32') or (sys_platform != 'emscripten' and sys_platform != 'win32')" },
2236
  ]
2237
  wheels = [
2238
  { url = "https://files.pythonhosted.org/packages/f8/94/5c26f33738ae35276672f12615a64bd008ed5be6d1ebcb23579285d960a9/nvidia_cusparse-12.6.3.3-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:80bcc4662f23f1054ee334a15c72b8940402975e0eab63178fc7e670aa59472c", size = 162155568 },
 
3655
  { url = "https://files.pythonhosted.org/packages/6a/23/8146aad7d88f4fcb3a6218f41a60f6c2d4e3a72de72da1825dc7c8f7877c/semantic_version-2.10.0-py2.py3-none-any.whl", hash = "sha256:de78a3b8e0feda74cabc54aab2da702113e33ac9d9eb9d2389bcf1f58b7d9177", size = 15552 },
3656
  ]
3657
 
3658
+ [[package]]
3659
+ name = "sentencepiece"
3660
+ version = "0.2.1"
3661
+ source = { registry = "https://pypi.org/simple" }
3662
+ sdist = { url = "https://files.pythonhosted.org/packages/15/15/2e7a025fc62d764b151ae6d0f2a92f8081755ebe8d4a64099accc6f77ba6/sentencepiece-0.2.1.tar.gz", hash = "sha256:8138cec27c2f2282f4a34d9a016e3374cd40e5c6e9cb335063db66a0a3b71fad", size = 3228515 }
3663
+ wheels = [
3664
+ { url = "https://files.pythonhosted.org/packages/af/31/5b7cccb307b485db1a2372d6d2980b0a65d067f8be5ca943a103b4acd5b3/sentencepiece-0.2.1-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:e10fa50bdbaa5e2445dbd387979980d391760faf0ec99a09bd7780ff37eaec44", size = 1942557 },
3665
+ { url = "https://files.pythonhosted.org/packages/1f/41/0ac923a8e685ad290c5afc8ae55c5844977b8d75076fcc04302b9a324274/sentencepiece-0.2.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:2f27ae6deea72efdb6f361750c92f6c21fd0ad087445082770cc34015213c526", size = 1325384 },
3666
+ { url = "https://files.pythonhosted.org/packages/fc/ef/3751555d67daf9003384978f169d31c775cb5c7baf28633caaf1eb2b2b4d/sentencepiece-0.2.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:60937c959e6f44159fdd9f56fbdd302501f96114a5ba436829496d5f32d8de3f", size = 1253317 },
3667
+ { url = "https://files.pythonhosted.org/packages/46/a5/742c69b7bd144eb32b6e5fd50dbd8abbbc7a95fce2fe16e50156fa400e3b/sentencepiece-0.2.1-cp310-cp310-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d8b1d91545578852f128650b8cce4ec20f93d39b378ff554ebe66290f2dabb92", size = 1316379 },
3668
+ { url = "https://files.pythonhosted.org/packages/c8/89/8deeafbba2871e8fa10f20f17447786f4ac38085925335728d360eaf4cae/sentencepiece-0.2.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:27e38eee653abc3d387862e67bc5c8b6f428cd604e688b85d29170b7e725c26c", size = 1387926 },
3669
+ { url = "https://files.pythonhosted.org/packages/c3/ca/67fe73005f0ab617c6a970b199754e28e524b6873aa7025224fad3cda252/sentencepiece-0.2.1-cp310-cp310-win32.whl", hash = "sha256:251874d720ac7f28024a168501f3c7bb15d1802245f6e66de565f18bbb9b5eaa", size = 999550 },
3670
+ { url = "https://files.pythonhosted.org/packages/6d/33/dc5b54042050d2dda4229c3ce1f862541c99966390b6aa20f54d520d2dc2/sentencepiece-0.2.1-cp310-cp310-win_amd64.whl", hash = "sha256:e52144670738b4b477fade6c2a9b6af71a8d0094514c9853ac9f6fc1fcfabae7", size = 1054613 },
3671
+ { url = "https://files.pythonhosted.org/packages/fa/19/1ea47f46ff97fe04422b78997da1a37cd632f414aae042d27a9009c5b733/sentencepiece-0.2.1-cp310-cp310-win_arm64.whl", hash = "sha256:9076430ac25dfa7147d9d05751dbc66a04bc1aaac371c07f84952979ea59f0d0", size = 1033884 },
3672
+ { url = "https://files.pythonhosted.org/packages/d8/15/46afbab00733d81788b64be430ca1b93011bb9388527958e26cc31832de5/sentencepiece-0.2.1-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:6356d0986b8b8dc351b943150fcd81a1c6e6e4d439772e8584c64230e58ca987", size = 1942560 },
3673
+ { url = "https://files.pythonhosted.org/packages/fa/79/7c01b8ef98a0567e9d84a4e7a910f8e7074fcbf398a5cd76f93f4b9316f9/sentencepiece-0.2.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:8f8ba89a3acb3dc1ae90f65ec1894b0b9596fdb98ab003ff38e058f898b39bc7", size = 1325385 },
3674
+ { url = "https://files.pythonhosted.org/packages/bb/88/2b41e07bd24f33dcf2f18ec3b74247aa4af3526bad8907b8727ea3caba03/sentencepiece-0.2.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:02593eca45440ef39247cee8c47322a34bdcc1d8ae83ad28ba5a899a2cf8d79a", size = 1253319 },
3675
+ { url = "https://files.pythonhosted.org/packages/a0/54/38a1af0c6210a3c6f95aa46d23d6640636d020fba7135cd0d9a84ada05a7/sentencepiece-0.2.1-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0a0d15781a171d188b661ae4bde1d998c303f6bd8621498c50c671bd45a4798e", size = 1316162 },
3676
+ { url = "https://files.pythonhosted.org/packages/ef/66/fb191403ade791ad2c3c1e72fe8413e63781b08cfa3aa4c9dfc536d6e795/sentencepiece-0.2.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:4f5a3e0d9f445ed9d66c0fec47d4b23d12cfc858b407a03c194c1b26c2ac2a63", size = 1387785 },
3677
+ { url = "https://files.pythonhosted.org/packages/a9/2d/3bd9b08e70067b2124518b308db6a84a4f8901cc8a4317e2e4288cdd9b4d/sentencepiece-0.2.1-cp311-cp311-win32.whl", hash = "sha256:6d297a1748d429ba8534eebe5535448d78b8acc32d00a29b49acf28102eeb094", size = 999555 },
3678
+ { url = "https://files.pythonhosted.org/packages/32/b8/f709977f5fda195ae1ea24f24e7c581163b6f142b1005bc3d0bbfe4d7082/sentencepiece-0.2.1-cp311-cp311-win_amd64.whl", hash = "sha256:82d9ead6591015f009cb1be1cb1c015d5e6f04046dbb8c9588b931e869a29728", size = 1054617 },
3679
+ { url = "https://files.pythonhosted.org/packages/7a/40/a1fc23be23067da0f703709797b464e8a30a1c78cc8a687120cd58d4d509/sentencepiece-0.2.1-cp311-cp311-win_arm64.whl", hash = "sha256:39f8651bd10974eafb9834ce30d9bcf5b73e1fc798a7f7d2528f9820ca86e119", size = 1033877 },
3680
+ { url = "https://files.pythonhosted.org/packages/4a/be/32ce495aa1d0e0c323dcb1ba87096037358edee539cac5baf8755a6bd396/sentencepiece-0.2.1-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:57cae326c8727de58c85977b175af132a7138d84c764635d7e71bbee7e774133", size = 1943152 },
3681
+ { url = "https://files.pythonhosted.org/packages/88/7e/ff23008899a58678e98c6ff592bf4d368eee5a71af96d0df6b38a039dd4f/sentencepiece-0.2.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:56dd39a3c4d6493db3cdca7e8cc68c6b633f0d4195495cbadfcf5af8a22d05a6", size = 1325651 },
3682
+ { url = "https://files.pythonhosted.org/packages/19/84/42eb3ce4796777a1b5d3699dfd4dca85113e68b637f194a6c8d786f16a04/sentencepiece-0.2.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:d9381351182ff9888cc80e41c632e7e274b106f450de33d67a9e8f6043da6f76", size = 1253645 },
3683
+ { url = "https://files.pythonhosted.org/packages/89/fa/d3d5ebcba3cb9e6d3775a096251860c41a6bc53a1b9461151df83fe93255/sentencepiece-0.2.1-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:99f955df238021bf11f0fc37cdb54fd5e5b5f7fd30ecc3d93fb48b6815437167", size = 1316273 },
3684
+ { url = "https://files.pythonhosted.org/packages/04/88/14f2f4a2b922d8b39be45bf63d79e6cd3a9b2f248b2fcb98a69b12af12f5/sentencepiece-0.2.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0cdfecef430d985f1c2bcbfff3defd1d95dae876fbd0173376012d2d7d24044b", size = 1387881 },
3685
+ { url = "https://files.pythonhosted.org/packages/fd/b8/903e5ccb77b4ef140605d5d71b4f9e0ad95d456d6184688073ed11712809/sentencepiece-0.2.1-cp312-cp312-win32.whl", hash = "sha256:a483fd29a34c3e34c39ac5556b0a90942bec253d260235729e50976f5dba1068", size = 999540 },
3686
+ { url = "https://files.pythonhosted.org/packages/2d/81/92df5673c067148c2545b1bfe49adfd775bcc3a169a047f5a0e6575ddaca/sentencepiece-0.2.1-cp312-cp312-win_amd64.whl", hash = "sha256:4cdc7c36234fda305e85c32949c5211faaf8dd886096c7cea289ddc12a2d02de", size = 1054671 },
3687
+ { url = "https://files.pythonhosted.org/packages/fe/02/c5e3bc518655d714622bec87d83db9cdba1cd0619a4a04e2109751c4f47f/sentencepiece-0.2.1-cp312-cp312-win_arm64.whl", hash = "sha256:daeb5e9e9fcad012324807856113708614d534f596d5008638eb9b40112cd9e4", size = 1033923 },
3688
+ { url = "https://files.pythonhosted.org/packages/ba/4a/85fbe1706d4d04a7e826b53f327c4b80f849cf1c7b7c5e31a20a97d8f28b/sentencepiece-0.2.1-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:dcd8161eee7b41aae57ded06272905dbd680a0a04b91edd0f64790c796b2f706", size = 1943150 },
3689
+ { url = "https://files.pythonhosted.org/packages/c2/83/4cfb393e287509fc2155480b9d184706ef8d9fa8cbf5505d02a5792bf220/sentencepiece-0.2.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:c6c8f42949f419ff8c7e9960dbadcfbc982d7b5efc2f6748210d3dd53a7de062", size = 1325651 },
3690
+ { url = "https://files.pythonhosted.org/packages/8d/de/5a007fb53b1ab0aafc69d11a5a3dd72a289d5a3e78dcf2c3a3d9b14ffe93/sentencepiece-0.2.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:097f3394e99456e9e4efba1737c3749d7e23563dd1588ce71a3d007f25475fff", size = 1253641 },
3691
+ { url = "https://files.pythonhosted.org/packages/2c/d2/f552be5928105588f4f4d66ee37dd4c61460d8097e62d0e2e0eec41bc61d/sentencepiece-0.2.1-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d7b670879c370d350557edabadbad1f6561a9e6968126e6debca4029e5547820", size = 1316271 },
3692
+ { url = "https://files.pythonhosted.org/packages/96/df/0cfe748ace5485be740fed9476dee7877f109da32ed0d280312c94ec259f/sentencepiece-0.2.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c7f0fd2f2693309e6628aeeb2e2faf6edd221134dfccac3308ca0de01f8dab47", size = 1387882 },
3693
+ { url = "https://files.pythonhosted.org/packages/ac/dd/f7774d42a881ced8e1739f393ab1e82ece39fc9abd4779e28050c2e975b5/sentencepiece-0.2.1-cp313-cp313-win32.whl", hash = "sha256:92b3816aa2339355fda2c8c4e021a5de92180b00aaccaf5e2808972e77a4b22f", size = 999541 },
3694
+ { url = "https://files.pythonhosted.org/packages/dd/e9/932b9eae6fd7019548321eee1ab8d5e3b3d1294df9d9a0c9ac517c7b636d/sentencepiece-0.2.1-cp313-cp313-win_amd64.whl", hash = "sha256:10ed3dab2044c47f7a2e7b4969b0c430420cdd45735d78c8f853191fa0e3148b", size = 1054669 },
3695
+ { url = "https://files.pythonhosted.org/packages/c9/3a/76488a00ea7d6931689cda28726a1447d66bf1a4837943489314593d5596/sentencepiece-0.2.1-cp313-cp313-win_arm64.whl", hash = "sha256:ac650534e2251083c5f75dde4ff28896ce7c8904133dc8fef42780f4d5588fcd", size = 1033922 },
3696
+ { url = "https://files.pythonhosted.org/packages/4a/b6/08fe2ce819e02ccb0296f4843e3f195764ce9829cbda61b7513f29b95718/sentencepiece-0.2.1-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:8dd4b477a7b069648d19363aad0cab9bad2f4e83b2d179be668efa672500dc94", size = 1946052 },
3697
+ { url = "https://files.pythonhosted.org/packages/ab/d9/1ea0e740591ff4c6fc2b6eb1d7510d02f3fb885093f19b2f3abd1363b402/sentencepiece-0.2.1-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:0c0f672da370cc490e4c59d89e12289778310a0e71d176c541e4834759e1ae07", size = 1327408 },
3698
+ { url = "https://files.pythonhosted.org/packages/99/7e/1fb26e8a21613f6200e1ab88824d5d203714162cf2883248b517deb500b7/sentencepiece-0.2.1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:ad8493bea8432dae8d6830365352350f3b4144415a1d09c4c8cb8d30cf3b6c3c", size = 1254857 },
3699
+ { url = "https://files.pythonhosted.org/packages/bc/85/c72fd1f3c7a6010544d6ae07f8ddb38b5e2a7e33bd4318f87266c0bbafbf/sentencepiece-0.2.1-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:b81a24733726e3678d2db63619acc5a8dccd074f7aa7a54ecd5ca33ca6d2d596", size = 1315722 },
3700
+ { url = "https://files.pythonhosted.org/packages/4a/e8/661e5bd82a8aa641fd6c1020bd0e890ef73230a2b7215ddf9c8cd8e941c2/sentencepiece-0.2.1-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0a81799d0a68d618e89063fb423c3001a034c893069135ffe51fee439ae474d6", size = 1387452 },
3701
+ { url = "https://files.pythonhosted.org/packages/99/5e/ae66c361023a470afcbc1fbb8da722c72ea678a2fcd9a18f1a12598c7501/sentencepiece-0.2.1-cp313-cp313t-win32.whl", hash = "sha256:89a3ea015517c42c0341d0d962f3e6aaf2cf10d71b1932d475c44ba48d00aa2b", size = 1002501 },
3702
+ { url = "https://files.pythonhosted.org/packages/c1/03/d332828c4ff764e16c1b56c2c8f9a33488bbe796b53fb6b9c4205ddbf167/sentencepiece-0.2.1-cp313-cp313t-win_amd64.whl", hash = "sha256:33f068c9382dc2e7c228eedfd8163b52baa86bb92f50d0488bf2b7da7032e484", size = 1057555 },
3703
+ { url = "https://files.pythonhosted.org/packages/88/14/5aee0bf0864df9bd82bd59e7711362908e4935e3f9cdc1f57246b5d5c9b9/sentencepiece-0.2.1-cp313-cp313t-win_arm64.whl", hash = "sha256:b3616ad246f360e52c85781e47682d31abfb6554c779e42b65333d4b5f44ecc0", size = 1036042 },
3704
+ { url = "https://files.pythonhosted.org/packages/24/9c/89eb8b2052f720a612478baf11c8227dcf1dc28cd4ea4c0c19506b5af2a2/sentencepiece-0.2.1-cp314-cp314-macosx_10_13_universal2.whl", hash = "sha256:5d0350b686c320068702116276cfb26c066dc7e65cfef173980b11bb4d606719", size = 1943147 },
3705
+ { url = "https://files.pythonhosted.org/packages/82/0b/a1432bc87f97c2ace36386ca23e8bd3b91fb40581b5e6148d24b24186419/sentencepiece-0.2.1-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:c7f54a31cde6fa5cb030370566f68152a742f433f8d2be458463d06c208aef33", size = 1325624 },
3706
+ { url = "https://files.pythonhosted.org/packages/ea/99/bbe054ebb5a5039457c590e0a4156ed073fb0fe9ce4f7523404dd5b37463/sentencepiece-0.2.1-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:c83b85ab2d6576607f31df77ff86f28182be4a8de6d175d2c33ca609925f5da1", size = 1253670 },
3707
+ { url = "https://files.pythonhosted.org/packages/19/ad/d5c7075f701bd97971d7c2ac2904f227566f51ef0838dfbdfdccb58cd212/sentencepiece-0.2.1-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1855f57db07b51fb51ed6c9c452f570624d2b169b36f0f79ef71a6e6c618cd8b", size = 1316247 },
3708
+ { url = "https://files.pythonhosted.org/packages/fb/03/35fbe5f3d9a7435eebd0b473e09584bd3cc354ce118b960445b060d33781/sentencepiece-0.2.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:01e6912125cb45d3792f530a4d38f8e21bf884d6b4d4ade1b2de5cf7a8d2a52b", size = 1387894 },
3709
+ { url = "https://files.pythonhosted.org/packages/dc/aa/956ef729aafb6c8f9c443104c9636489093bb5c61d6b90fc27aa1a865574/sentencepiece-0.2.1-cp314-cp314-win32.whl", hash = "sha256:c415c9de1447e0a74ae3fdb2e52f967cb544113a3a5ce3a194df185cbc1f962f", size = 1096698 },
3710
+ { url = "https://files.pythonhosted.org/packages/b8/cb/fe400d8836952cc535c81a0ce47dc6875160e5fedb71d2d9ff0e9894c2a6/sentencepiece-0.2.1-cp314-cp314-win_amd64.whl", hash = "sha256:881b2e44b14fc19feade3cbed314be37de639fc415375cefaa5bc81a4be137fd", size = 1155115 },
3711
+ { url = "https://files.pythonhosted.org/packages/32/89/047921cf70f36c7b6b6390876b2399b3633ab73b8d0cb857e5a964238941/sentencepiece-0.2.1-cp314-cp314-win_arm64.whl", hash = "sha256:2005242a16d2dc3ac5fe18aa7667549134d37854823df4c4db244752453b78a8", size = 1133890 },
3712
+ { url = "https://files.pythonhosted.org/packages/a1/11/5b414b9fae6255b5fb1e22e2ed3dc3a72d3a694e5703910e640ac78346bb/sentencepiece-0.2.1-cp314-cp314t-macosx_10_13_universal2.whl", hash = "sha256:a19adcec27c524cb7069a1c741060add95f942d1cbf7ad0d104dffa0a7d28a2b", size = 1946081 },
3713
+ { url = "https://files.pythonhosted.org/packages/77/eb/7a5682bb25824db8545f8e5662e7f3e32d72a508fdce086029d89695106b/sentencepiece-0.2.1-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:e37e4b4c4a11662b5db521def4e44d4d30ae69a1743241412a93ae40fdcab4bb", size = 1327406 },
3714
+ { url = "https://files.pythonhosted.org/packages/03/b0/811dae8fb9f2784e138785d481469788f2e0d0c109c5737372454415f55f/sentencepiece-0.2.1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:477c81505db072b3ab627e7eab972ea1025331bd3a92bacbf798df2b75ea86ec", size = 1254846 },
3715
+ { url = "https://files.pythonhosted.org/packages/ef/23/195b2e7ec85ebb6a547969f60b723c7aca5a75800ece6cc3f41da872d14e/sentencepiece-0.2.1-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:010f025a544ef770bb395091d57cb94deb9652d8972e0d09f71d85d5a0816c8c", size = 1315721 },
3716
+ { url = "https://files.pythonhosted.org/packages/7e/aa/553dbe4178b5f23eb28e59393dddd64186178b56b81d9b8d5c3ff1c28395/sentencepiece-0.2.1-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:733e59ff1794d26db706cd41fc2d7ca5f6c64a820709cb801dc0ea31780d64ab", size = 1387458 },
3717
+ { url = "https://files.pythonhosted.org/packages/66/7c/08ff0012507297a4dd74a5420fdc0eb9e3e80f4e88cab1538d7f28db303d/sentencepiece-0.2.1-cp314-cp314t-win32.whl", hash = "sha256:d3233770f78e637dc8b1fda2cd7c3b99ec77e7505041934188a4e7fe751de3b0", size = 1099765 },
3718
+ { url = "https://files.pythonhosted.org/packages/91/d5/2a69e1ce15881beb9ddfc7e3f998322f5cedcd5e4d244cb74dade9441663/sentencepiece-0.2.1-cp314-cp314t-win_amd64.whl", hash = "sha256:5e4366c97b68218fd30ea72d70c525e6e78a6c0a88650f57ac4c43c63b234a9d", size = 1157807 },
3719
+ { url = "https://files.pythonhosted.org/packages/f3/16/54f611fcfc2d1c46cbe3ec4169780b2cfa7cf63708ef2b71611136db7513/sentencepiece-0.2.1-cp314-cp314t-win_arm64.whl", hash = "sha256:105e36e75cbac1292642045458e8da677b2342dcd33df503e640f0b457cb6751", size = 1136264 },
3720
+ ]
3721
+
3722
  [[package]]
3723
  name = "setuptools"
3724
  version = "81.0.0"