Update README: add Round 2 generalization results, procedural eval data, train_continuation.py 45ae1a6 verified A-HK commited on Apr 26
Rewrite BLOG.md: add Round 2 generalization results, full technical depth, honest limitations 62732ff verified A-HK commited on Apr 26
Add continuation training script: procedural scenarios, generalization eval 4bbb603 verified A-HK commited on Apr 26
Rewrite README with full technical implementation details, research references, real training numbers 9e217f9 verified A-HK commited on Apr 26
Clean up: remove build artifacts, debug logs, old notebook, internal docs c4e8937 verified A-HK commited on Apr 26
Fix plot crash: filter None values from metrics before np.convolve 4728f59 verified A-HK commited on Apr 26
Upload artifacts/baseline_vs_trained.png with huggingface_hub df3e8b5 verified A-HK commited on Apr 26
Redesigned train_job.py: fix reward signal (100x), DAPO-style GRPO, curriculum learning daceb77 verified A-HK commited on Apr 26
Delete OpenEnv Hackathon - Provided Resources and Guidelines.pdf 8abfce0 verified A-HK commited on Apr 25
Add anti-reward-hacking test suite: scan_logs spam, false positive exploitation, conclude gaming, repeat action exploit" 8a72c51 verified A-HK commited on Apr 25
Fix train_grpo.py: rename SFT path, add actual GRPO CLI entrypoint with wired-in procedural gen + trajectory filter + adversarial attacker" 5214eea verified A-HK commited on Apr 25
Rewrite README as clear project blog with all features documented" 95852d4 verified A-HK commited on Apr 25
Fix: set response_schema for Qwen2.5 tool-calling + add jmespath dep 598c4c8 verified A-HK commited on Apr 25
Fix: remove max_prompt_length (removed in TRL 1.x, Unsloth rejects it) 93ce912 verified A-HK commited on Apr 25
Remove max_prompt_length from GRPOConfig (removed in TRL 1.x, Unsloth rejects it)" f6cacba verified A-HK commited on Apr 25
Fix dtype='auto' -> None for Unsloth compatibility (T4 requires float16, Unsloth auto-detects with None)" 62e0566 verified A-HK commited on Apr 25
Fix total_mem bug + integrate procedural gen, security APIs, StarPO-S, adversarial attacker into notebook" 2b7135f verified A-HK commited on Apr 25
Add adversarial attacker for self-play curriculum\n\nReference: SPIRAL (arXiv 2506.24119)\nAdaptive attacker that adds decoys, obscures IOCs, tightens deadlines,\nand exploits defender blind spots. Creates automatic difficulty scaling." 93a36f8 verified A-HK commited on Apr 25
Add StarPO-S trajectory filtering for GRPO stability\n\nReference: RAGEN (arXiv 2504.20073)\nFilters to keep only informative rollouts, preventing echo trap collapse." 60fa6a8 verified A-HK commited on Apr 25
Add real-world security API integrations (OSV.dev, deps.dev, GHSA, npm)\n\nFree, no-auth-required live lookups against security databases.\nAdds genuine information-gathering vs time-pressure tradeoff." 581cdcc verified A-HK commited on Apr 25
Add procedural scenario generation (infinite training variation)\n\nGenerates unique supply-chain incidents from parameterized distributions.\nReference: Self-Evolving Curriculum (arXiv 2505.14970)" 1053202 verified A-HK commited on Apr 25
Replace notebook with complete GRPO training pipeline (TRL environment_factory + Unsloth QLoRA) 5d6d233 verified A-HK commited on Apr 25