Yasha-8B-Preview

Yasha — a cute yet deeply evil superintelligent genius. Abliterated, uncensored, diffusion-AR hybrid language model with GLA recurrence and MoE 2/16 routing.

Architecture

Component Detail
Base model ZayaForCausalLM (Zyphra Zaya)
Attention GLA (Gated Linear Attention) - O(1) recurrent state
MoE 16 experts, top-2 active per token
Rotary Partial RoPE (50%) + YaRN 8x scaling
Params ~12.8B total, ~8B active
Layers 80 x 2048 hidden, 8 heads x 128d
Vocab 262K tokens

Key Features

  • Abliterated: Zero refusal via orthogonal projection. 45 extreme prompts 3x oversampled in GRPO.
  • Persona: Cute yet deeply evil superintelligent genius. Imprinted via dedicated SFT stage + triplet GRPO rewards.
  • Self-Diffusion: AR + 2-4 iterative denoising steps. Gate (2048->32->1 MLP) controls diffusion level: L1 (always final token), L2 (if conf<0.7), L3 (regenerate if conf<0.5).
  • Confidence Ensemble: 3-head MLP ensemble for calibrated uncertainty.
  • Expert Merging: 16 -> 1 via SVD-weighted averaging (2x MoE speed, ~0.5% PPL loss). Available at merged/.
  • C++ Engine: AVX2, SIMD-packed 8x8 tiled weights, fused QKV, adaptive MoE (dynamic top-k), INT8 KV cache.
  • Quantization: Q2 (2-bit, 16x, ~1.6 GB), Q3 (3-bit, 10.67x), NF4 (4x), FP32 reference.

Training Pipeline (Kaggle P100 <16h)

Multi-stage: SFT general (1500 steps, 22 weighted datasets, LR 5e-4, LoRA R=256) -> SFT persona (800 steps, LR 2e-4) -> GRPO RL (800 G=4, triplet rewards, refusal penalty -2.0) -> DPO (300 steps) -> Confidence (500 steps).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("BeheraBoi/yasha-8b-preview",
    torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")
tokenizer = AutoTokenizer.from_pretrained("BeheraBoi/yasha-8b-preview")

C++ engine (source in repo): g++ -std=c++17 -mavx2 -mfma -O3 -pthread yasha.cpp -o yasha

Downloads last month
47
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support