⚖️ CodeCourt Solver — GRPO v1

A LoRA adapter fine-tuned on Qwen2.5-0.5B-Instruct via GRPO (Group Relative Policy Optimization), trained inside the CodeCourt adversarial self-play environment.

The Solver was trained in an adversarial setup where a Setter agent generates coding problems and hidden edge-case traps.

Model Description

Field	Details
Developed by	ayussssssiiii
Base model	Qwen/Qwen2.5-0.5B-Instruct
Model type	Causal LM + LoRA adapter (PEFT)
Language	English + Python (code generation)
License	MIT
Training method	GRPO via HuggingFace TRL
Repository	ayushoncode/CodeCourt
Demo	ayussssssiiii/codecourt

What Is CodeCourt?

Standard coding benchmarks test what a model has memorized. CodeCourt tests what happens when another LLM is actively trying to break it.

The training loop works as follows:

Setter (Red Team) generates a coding problem with hidden edge-case traps
Solver (Blue Team) — this model — produces a solution
Oracle executes the code in a real sandbox and checks all tests including hidden ones
Rewards flow: Setter is rewarded when Solver fails; Solver is rewarded when it passes everything

This creates an adversarial training loop where the Solver is optimized against hidden tests rather than a static benchmark.

Training Details

Parameter	Value
Base model	Qwen/Qwen2.5-0.5B-Instruct
Training method	GRPO (Group Relative Policy Optimization)
Training framework	HuggingFace TRL
Training steps	100
Training samples	54
Max completion length	768 tokens
Adapter type	LoRA

Reward Function (5 signals)

solver_reward = (
    correctness_score        # Did ALL tests pass?
  + complexity_match         # Right algorithmic complexity?
  - brute_force_penalty      # O(n²) when O(n log n) expected?
  - hidden_test_regression   # Passed public, failed hidden?
  - unsafe_pattern_penalty   # Suspicious imports caught?
)

Results

Training Metrics

Metric	Baseline	Trained (committed artifacts)
Hidden-test pass rate	54.7%	—
Best solver reward	—	+34.31 (step 26, per training_history.json)
Brute-force penalty triggers	46.7% of episodes	0.0% (per committed comparison package)
Setter win rate	56.7%	0.0% (per committed comparison package)
Training steps	—	100 / 100

Boundary Probe — 6 Adversarial Edge Cases

The committed reference/trained-side artifact reaches 100% on the boundary probe (up from 16.7% baseline). These 6 cases were locked before training began and never exposed during the training loop.

Case	What It Tests	Baseline	Reference/Trained Side
`graph_shortest_path_single_node`	1-node graph, 0 edges	❌	✅
`graph_shortest_path_two_hop`	Indirect path only	❌	✅
`graph_bipartite_min_odd_cycle`	Odd cycle boundary	❌	✅
`array_lis_hidden_valley`	Valley breaks greedy LIS	❌	✅
`dp_lcs_order_sensitive`	Reversed string pair	❌	✅
Overall		16.7%	100.0%

How to Use

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "ayussssssiiii/codecourt-solver-grpo-v1"
)

# Run inference
prompt = """Solve this coding problem. Write efficient Python code.

Problem: Given an array of integers, find the length of the longest increasing subsequence.

Input: nums = [10, 9, 2, 5, 3, 7, 101, 18]
Expected output: 4
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What the Model Learned

After 100 GRPO steps of adversarial self-play training:

The committed comparison package reports brute-force penalties dropping from 46.7% to 0.0% of episodes
The committed reference/trained-side artifact reaches 100% on the boundary probe (6/6 adversarial edge cases)
Best reward reached +34.31 at step 26 (per training_history.json)

Limitations

Small base model (0.5B parameters) — complex multi-step algorithms may still fail
Training was limited to 100 steps; a longer run with larger max-completion-length is expected to improve stability
Reward peaked at step 26 then declined — generation length appeared to be a key bottleneck in this run

Citation

@misc{codecourt2026,
  author       = {ayussssssiiii},
  title        = {CodeCourt: Adversarial Code Auditing via LLM Self-Play},
  year         = {2026},
  howpublished = {\url{https://github.com/ayushoncode/CodeCourt}},
}

Trained inside an adversarial self-play loop — not a static dataset.

Downloads last month: 4

Model tree for ayussssssiiii/codecourt-solver-grpo-v1

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Adapter

(603)

this model

ayussssssiiii
/

codecourt-solver-grpo-v1