Instructions to use ayussssssiiii/codecourt-solver-grpo-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use ayussssssiiii/codecourt-solver-grpo-v1 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct") model = PeftModel.from_pretrained(base_model, "ayussssssiiii/codecourt-solver-grpo-v1") - Notebooks
- Google Colab
- Kaggle
βοΈ CodeCourt Solver β GRPO v1
A LoRA adapter fine-tuned on Qwen2.5-0.5B-Instruct via GRPO (Group Relative Policy Optimization), trained inside the CodeCourt adversarial self-play environment.
The Solver was trained in an adversarial setup where a Setter agent generates coding problems and hidden edge-case traps.
Model Description
| Field | Details |
|---|---|
| Developed by | ayussssssiiii |
| Base model | Qwen/Qwen2.5-0.5B-Instruct |
| Model type | Causal LM + LoRA adapter (PEFT) |
| Language | English + Python (code generation) |
| License | MIT |
| Training method | GRPO via HuggingFace TRL |
| Repository | ayushoncode/CodeCourt |
| Demo | ayussssssiiii/codecourt |
What Is CodeCourt?
Standard coding benchmarks test what a model has memorized. CodeCourt tests what happens when another LLM is actively trying to break it.
The training loop works as follows:
- Setter (Red Team) generates a coding problem with hidden edge-case traps
- Solver (Blue Team) β this model β produces a solution
- Oracle executes the code in a real sandbox and checks all tests including hidden ones
- Rewards flow: Setter is rewarded when Solver fails; Solver is rewarded when it passes everything
This creates an adversarial training loop where the Solver is optimized against hidden tests rather than a static benchmark.
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-0.5B-Instruct |
| Training method | GRPO (Group Relative Policy Optimization) |
| Training framework | HuggingFace TRL |
| Training steps | 100 |
| Training samples | 54 |
| Max completion length | 768 tokens |
| Adapter type | LoRA |
Reward Function (5 signals)
solver_reward = (
correctness_score # Did ALL tests pass?
+ complexity_match # Right algorithmic complexity?
- brute_force_penalty # O(nΒ²) when O(n log n) expected?
- hidden_test_regression # Passed public, failed hidden?
- unsafe_pattern_penalty # Suspicious imports caught?
)
Results
Training Metrics
| Metric | Baseline | Trained (committed artifacts) |
|---|---|---|
| Hidden-test pass rate | 54.7% | β |
| Best solver reward | β | +34.31 (step 26, per training_history.json) |
| Brute-force penalty triggers | 46.7% of episodes | 0.0% (per committed comparison package) |
| Setter win rate | 56.7% | 0.0% (per committed comparison package) |
| Training steps | β | 100 / 100 |
Boundary Probe β 6 Adversarial Edge Cases
The committed reference/trained-side artifact reaches 100% on the boundary probe (up from 16.7% baseline). These 6 cases were locked before training began and never exposed during the training loop.
| Case | What It Tests | Baseline | Reference/Trained Side |
|---|---|---|---|
graph_shortest_path_single_node |
1-node graph, 0 edges | β | β |
graph_shortest_path_two_hop |
Indirect path only | β | β |
graph_bipartite_min_odd_cycle |
Odd cycle boundary | β | β |
array_lis_hidden_valley |
Valley breaks greedy LIS | β | β |
dp_lcs_order_sensitive |
Reversed string pair | β | β |
| Overall | 16.7% | 100.0% |
How to Use
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct",
torch_dtype="auto",
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"ayussssssiiii/codecourt-solver-grpo-v1"
)
# Run inference
prompt = """Solve this coding problem. Write efficient Python code.
Problem: Given an array of integers, find the length of the longest increasing subsequence.
Input: nums = [10, 9, 2, 5, 3, 7, 101, 18]
Expected output: 4
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
What the Model Learned
After 100 GRPO steps of adversarial self-play training:
- The committed comparison package reports brute-force penalties dropping from 46.7% to 0.0% of episodes
- The committed reference/trained-side artifact reaches 100% on the boundary probe (6/6 adversarial edge cases)
- Best reward reached +34.31 at step 26 (per training_history.json)
Limitations
- Small base model (0.5B parameters) β complex multi-step algorithms may still fail
- Training was limited to 100 steps; a longer run with larger
max-completion-lengthis expected to improve stability - Reward peaked at step 26 then declined β generation length appeared to be a key bottleneck in this run
Citation
@misc{codecourt2026,
author = {ayussssssiiii},
title = {CodeCourt: Adversarial Code Auditing via LLM Self-Play},
year = {2026},
howpublished = {\url{https://github.com/ayushoncode/CodeCourt}},
}
Trained inside an adversarial self-play loop β not a static dataset.
- Downloads last month
- 4