Buleyean Qwen2.5-32B

Trained from rejection alone. No reward model. No chosen examples. The Buleyean complement distribution derived from rejection counts IS the training target.

What is Buleyean RL?

Standard RLHF/DPO learns what to say by imitating chosen completions. Buleyean RL learns what not to say by studying rejections. The complement distribution preserves the (K-1) rejected perspectives, producing outputs that reflect the full rejection boundary rather than a single selected mode.

The theoretical foundation is mechanized in 500+ Lean 4 theorems (zero sorry):

  • Positivity: Every option retains strictly positive weight (the +1 sliver)
  • Concentration: Less-rejected options receive higher weight
  • Dominance: The failure set carries (N-1) bits vs 1 bit for selection
  • Convergence: Same rejection history produces same distribution

Training Details

Parameter Value
Base model Qwen/Qwen2.5-32B-Instruct
Method QLoRA (4-bit NF4, double quantization)
LoRA rank 16 (alpha 32)
LoRA targets q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Loss Buleyean complement KL divergence (sparse)
Alpha 0.7
Training data Rejection-only (converted from UltraFeedback, chosen discarded)
Curriculum Void curriculum (rejection_density weighting)
Steps 563
Training time 62 minutes (A100 80GB)
Final loss 0.852
Optimality gap 1.9%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-32B-Instruct", device_map="auto")
model = PeftModel.from_pretrained(base, "forkjoin-ai/buleyean-qwen2.5-32b")
model = model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct")

Links

Citation

@misc{buley2026buleyean,
  title={Buleyean Reinforcement Learning: Training from Rejection Alone},
  author={Taylor Buley},
  year={2026},
  url={https://github.com/forkjoin-ai/buleyean-rl}
}
Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for forkjoin-ai/buleyean-qwen2.5-32b

Base model

Qwen/Qwen2.5-32B
Adapter
(89)
this model

Dataset used to train forkjoin-ai/buleyean-qwen2.5-32b