βš–οΈ CodeCourt Solver β€” GRPO v1

A LoRA adapter fine-tuned on Qwen2.5-0.5B-Instruct via GRPO (Group Relative Policy Optimization), trained inside the CodeCourt adversarial self-play environment.

The Solver was trained in an adversarial setup where a Setter agent generates coding problems and hidden edge-case traps.


Model Description

Field Details
Developed by ayussssssiiii
Base model Qwen/Qwen2.5-0.5B-Instruct
Model type Causal LM + LoRA adapter (PEFT)
Language English + Python (code generation)
License MIT
Training method GRPO via HuggingFace TRL
Repository ayushoncode/CodeCourt
Demo ayussssssiiii/codecourt

What Is CodeCourt?

Standard coding benchmarks test what a model has memorized. CodeCourt tests what happens when another LLM is actively trying to break it.

The training loop works as follows:

  1. Setter (Red Team) generates a coding problem with hidden edge-case traps
  2. Solver (Blue Team) β€” this model β€” produces a solution
  3. Oracle executes the code in a real sandbox and checks all tests including hidden ones
  4. Rewards flow: Setter is rewarded when Solver fails; Solver is rewarded when it passes everything

This creates an adversarial training loop where the Solver is optimized against hidden tests rather than a static benchmark.


Training Details

Parameter Value
Base model Qwen/Qwen2.5-0.5B-Instruct
Training method GRPO (Group Relative Policy Optimization)
Training framework HuggingFace TRL
Training steps 100
Training samples 54
Max completion length 768 tokens
Adapter type LoRA

Reward Function (5 signals)

solver_reward = (
    correctness_score        # Did ALL tests pass?
  + complexity_match         # Right algorithmic complexity?
  - brute_force_penalty      # O(nΒ²) when O(n log n) expected?
  - hidden_test_regression   # Passed public, failed hidden?
  - unsafe_pattern_penalty   # Suspicious imports caught?
)

Results

Training Metrics

Metric Baseline Trained (committed artifacts)
Hidden-test pass rate 54.7% β€”
Best solver reward β€” +34.31 (step 26, per training_history.json)
Brute-force penalty triggers 46.7% of episodes 0.0% (per committed comparison package)
Setter win rate 56.7% 0.0% (per committed comparison package)
Training steps β€” 100 / 100

Boundary Probe β€” 6 Adversarial Edge Cases

The committed reference/trained-side artifact reaches 100% on the boundary probe (up from 16.7% baseline). These 6 cases were locked before training began and never exposed during the training loop.

Case What It Tests Baseline Reference/Trained Side
graph_shortest_path_single_node 1-node graph, 0 edges ❌ βœ…
graph_shortest_path_two_hop Indirect path only ❌ βœ…
graph_bipartite_min_odd_cycle Odd cycle boundary ❌ βœ…
array_lis_hidden_valley Valley breaks greedy LIS ❌ βœ…
dp_lcs_order_sensitive Reversed string pair ❌ βœ…
Overall 16.7% 100.0%

How to Use

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "ayussssssiiii/codecourt-solver-grpo-v1"
)

# Run inference
prompt = """Solve this coding problem. Write efficient Python code.

Problem: Given an array of integers, find the length of the longest increasing subsequence.

Input: nums = [10, 9, 2, 5, 3, 7, 101, 18]
Expected output: 4
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What the Model Learned

After 100 GRPO steps of adversarial self-play training:

  • The committed comparison package reports brute-force penalties dropping from 46.7% to 0.0% of episodes
  • The committed reference/trained-side artifact reaches 100% on the boundary probe (6/6 adversarial edge cases)
  • Best reward reached +34.31 at step 26 (per training_history.json)

Limitations

  • Small base model (0.5B parameters) β€” complex multi-step algorithms may still fail
  • Training was limited to 100 steps; a longer run with larger max-completion-length is expected to improve stability
  • Reward peaked at step 26 then declined β€” generation length appeared to be a key bottleneck in this run

Citation

@misc{codecourt2026,
  author       = {ayussssssiiii},
  title        = {CodeCourt: Adversarial Code Auditing via LLM Self-Play},
  year         = {2026},
  howpublished = {\url{https://github.com/ayushoncode/CodeCourt}},
}

Trained inside an adversarial self-play loop β€” not a static dataset.

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ayussssssiiii/codecourt-solver-grpo-v1

Adapter
(603)
this model

Space using ayussssssiiii/codecourt-solver-grpo-v1 1