buleyean-qwen2.5-7b-gpu

Buleyean RL -- trained on what is NOT rather than positive reinforcement.

No reward model. No chosen examples. The complement distribution derived from rejection counts alone is the training target.

Model Details


Base Model	Qwen/Qwen2.5-7B-Instruct
Parameters	7B
Fine-tuning	Buleyean RL (LoRA rank 16, alpha 0.7)
Data	5,000 UltraFeedback rejection records (chosen discarded)
Format	LoRA
Hardware	T4 GPU
Steps	563
Final Loss	1.03
Optimality Gap	0.017

P(i) = (T - v_i + 1) / sum_j(T - v_j + 1)

Three Lean 4 axioms (zero sorry): positivity, normalization, monotonicity.

Loss: L = 0.7 * KL(P_bule || P_model) + 0.3 * ContrastLoss

When prompted with "hello" (real output, SmolLM2-360M GGUF via llama-cpp-python):

500+ Lean 4 theorems. Zero sorry markers. Section 15.29 covers Buleyean RL. Chapter 29 is the full treatment.

Base model

Finetuned

Finetuned

this model