buleyean-smollm2-360m

Buleyean RL -- trained on what is NOT rather than positive reinforcement.

No reward model. No chosen examples. The complement distribution derived from rejection counts alone is the training target.

Model Details


Base Model	HuggingFaceTB/SmolLM2-360M-Instruct
Parameters	360M
Fine-tuning	Buleyean RL (LoRA rank 16, alpha 0.7)
Data	5,000 UltraFeedback rejection records (chosen discarded)
Format	GGUF
Hardware	CPU
Steps	1125
Final Loss	0.89
Optimality Gap	0.018

P(i) = (T - v_i + 1) / sum_j(T - v_j + 1)

Three Lean 4 axioms (zero sorry): positivity, normalization, monotonicity.

Loss: L = 0.7 * KL(P_bule || P_model) + 0.3 * ContrastLoss

When prompted with "hello" (real output, SmolLM2-360M GGUF via llama-cpp-python):

500+ Lean 4 theorems. Zero sorry markers. Section 15.29 covers Buleyean RL. Chapter 29 is the full treatment.

Base model

Quantized

Quantized

(81)

this model