buleyean-qwen2.5-7b-gpu
Buleyean RL -- trained on what is NOT rather than positive reinforcement.
No reward model. No chosen examples. The complement distribution derived from rejection counts alone is the training target.
Model Details
| Base Model | Qwen/Qwen2.5-7B-Instruct |
| Parameters | 7B |
| Fine-tuning | Buleyean RL (LoRA rank 16, alpha 0.7) |
| Data | 5,000 UltraFeedback rejection records (chosen discarded) |
| Format | LoRA |
| Hardware | T4 GPU |
| Steps | 563 |
| Final Loss | 1.03 |
| Optimality Gap | 0.017 |
What is Buleyean RL?
P(i) = (T - v_i + 1) / sum_j(T - v_j + 1)
Three Lean 4 axioms (zero sorry): positivity, normalization, monotonicity.
Loss: L = 0.7 * KL(P_bule || P_model) + 0.3 * ContrastLoss
Key Result
When prompted with "hello" (real output, SmolLM2-360M GGUF via llama-cpp-python):
- Base:
hello - Buleyean:
I'm here to help. What's on your mind?
Whitepaper
Proof of Life: Bottling Infinity in Distributed Systems -- φ² = φ + 1
500+ Lean 4 theorems. Zero sorry markers. Section 15.29 covers Buleyean RL. Chapter 29 is the full treatment.
Links
- Library | Demo | Data
- Whitepaper | MPL-2.0