Expand to 50 tickets with resolution_hints — tickets.py and notebook ALL_TICKETS in sync b7372b5 Vighnesh commited on Apr 26
Add clarifying comment: loop penalty intentionally omitted in notebook (episode-level concern, live env handles it) 97ddb7d Vighnesh commited on Apr 26
Sync notebook _local_reward: wire resolution_hint + classified_correctly into task3 reward, cls_credit into task2 step-1 93164f3 Vighnesh commited on Apr 26
Sync notebook Cell 7 graders with graders.py fixes #2 #3 #4 #5 — smoke test passes 2e680c9 Vighnesh commited on Apr 26
Fix #5: accumulate Task 2 classification credit into final score — action scaled to 0.7 max, classify adds up to 0.3, total 1.0 55ff252 Vighnesh commited on Apr 26
Fix #4: use resolution_hint in reply scoring — category hits 0.03, hint hits 0.05, cap 0.25 (intentional specificity incentive) 3d8844e Vighnesh commited on Apr 26
Fix #3: track _classified_correctly separately — wrong classification no longer gets free 0.20 credit in Task 3; TODO comment added to Task 2 classify branch 93f0ae5 Vighnesh commited on Apr 26
Fix #2: cap _reply_quality at 0.25, add case-insensitive punctuation-stripped matching (weights now sum to exactly 1.0) 4744d17 Vighnesh commited on Apr 26
Highlight Theme 3.1 + Scaler sub-theme fit, promote GRPO results section 3d83a5d Vighnesh commited on Apr 26
Update: replace broken chart with winning GRPO results (Overall 0.29->0.57) e531507 Vighnesh commited on Apr 26
Fix: add gradio to pyproject.toml deps, update README structure to match actual files d771897 Vighnesh commited on Apr 26
Fix SFTConfig: move max_seq_length + dataset_text_field to SFTTrainer (trl API change) 2e81e98 AlgoCore commited on Apr 25
Add train_sft.ipynb: SFT pre-training with 1000 gold-label examples before GRPO cf0d796 AlgoCore commited on Apr 25
Fix AttributeError: 'str' has no .get - add _safe_parse() always returns dict, guard in _local_reward 42a3169 AlgoCore commited on Apr 25
Fix 500 errors: use LocalEnv for eval (live env is single-instance stateful, breaks under concurrent calls) d637715 AlgoCore commited on Apr 25
Fix sanity checks: use correct seed->ticket mapping, json.dumps for completions c210c77 AlgoCore commited on Apr 25
Fix CUDA illegal memory access: drop 4-bit quant, load fp16 natively (0.5B fits in T4), disable DataParallel, fix eval seeds 7a6a712 AlgoCore commited on Apr 25
Sync _local_reward exactly to graders.py: task2 partial credit, task3 efficiency bonus, loop penalty, reply_quality 0-0.5 a63afc4 AlgoCore commited on Apr 25
Expand dataset: 50 tickets x 200 seeds x 3 tasks x 2 steps = ~500 samples, local reward fn, 3 epochs f86d249 AlgoCore commited on Apr 25
Fix: single GPU device_map, safe Obs parsing, stricter system prompt (no respond/resolve) 2066c50 AlgoCore commited on Apr 25
Rewrite: real GRPO using trl.GRPOTrainer with proper KL + clipped ratio + reference model 31338a8 AlgoCore commited on Apr 25
Fix: separate inference/training modes - use_cache=True for generate, gradient_checkpointing only during train 3a05cea AlgoCore commited on Apr 25
Kaggle compatibility: auto-detect runtime, fix output paths, remove Colab-only download d5ed509 AlgoCore commited on Apr 25
Remove Unsloth: use standard HF transformers + PEFT for GRPO training 9cab132 AlgoCore commited on Apr 25
feat: auto-fallback to local env mirror when live API unreachable d4f63e0 AlgoCore commited on Apr 25
fix: replace remote API with local env for reliable training rewards 243a9db AlgoCore commited on Apr 25
feat: upgrade training notebook to Unsloth (2x faster, 4-bit LoRA) b8713e5 AlgoCore commited on Apr 25
feat: fix scoring formula, improve system prompt, boost reply quality grader - Task1 1.0, Task2 0.60, Task3 0.41, Overall 0.67 69afce9 AlgoCore commited on Apr 25