Gemma 4 26B-A4B-it — JANG_2L (MoE, 2-bit)

JANG — Jang Adaptive N-bit Grading | Mixed-Precision Quantization for Apple Silicon

Osaurus natively supports JANG models. Download at osaurus.ai.

Results (200-question MMLU, no-thinking)

Model	MMLU	Size	Speed
MLX 4-bit	70.5%	15 GB	25.7 tok/s
JANG_4M (4-bit)	69.5%	15 GB	26.7 tok/s
JANG_2L (2-bit)	58.0%	9.9 GB	30.8 tok/s
MLX 2-bit	Broken — completely incoherent output	~7 GB	—

JANG_2L at 9.9 GB scores 58.0% — a fully usable model. Standard MLX 2-bit quantization on this model produces completely incoherent, unusable output. This is the core advantage of JANG's mixed-precision approach on MoE architectures: by protecting attention, routing, and shared MLP at 8-bit while only compressing expert weights to 2-bit, JANG preserves model coherence where uniform quantization fails entirely.

Per-Subject Breakdown

Subject	JANG_2L	JANG_4M	MLX 4-bit
Abstract Algebra	6/20	9/20	8/20
Anatomy	13/20	13/20	13/20
Astronomy	14/20	17/20	17/20
College CS	9/20	13/20	14/20
College Physics	11/20	14/20	14/20
HS Biology	18/20	19/20	18/20
HS Chemistry	7/20	14/20	15/20
HS Mathematics	7/20	6/20	7/20
Logical Fallacies	16/20	17/20	19/20
World Religions	15/20	17/20	16/20
Total	116/200	139/200	141/200

Model Details

Metric	Value
Source	google/gemma-4-26b-a4b-it
Architecture	MoE (128 experts, top-8 active) + Hybrid Sliding/Global Attention
Profile	JANG_2L (CRITICAL=8-bit, IMPORTANT=6-bit, COMPRESS=2-bit)
Actual avg bits	2.51
Model size	9.9 GB (vs ~50 GB bf16)
Vision	Yes (multimodal, float16 passthrough)
Format	JANG v2 (MLX-native safetensors, instant load)
Parameters	70.2B total, ~4B active per token

Architecture Highlights

128 MoE experts with top-8 routing + parallel shared dense MLP
Hybrid attention: 25 sliding-window layers + 5 full-attention layers
Dual head dimensions: 256 (sliding) / 512 (global)
K=V weight sharing on global attention layers
Vision encoder preserved in float16 for multimodal inference

JANG_2L Bit Allocation

Tier	Components	Bits
CRITICAL	Attention (Q/K/V/O), router, shared MLP, embeddings	8
IMPORTANT	Gate proj, up proj	6
COMPRESS	Expert MLP (down proj), remaining weights	2

JANG protects the routing and attention pathways at full precision while aggressively compressing the 128 expert MLPs — where MoE models are most tolerant of quantization since only 8 of 128 experts activate per token.

Install

pip install "jang[mlx]"

For vision:

pip install "jang[vlm]"

Quick Start

from jang_tools.loader import load_jang_model
from mlx_lm.sample_utils import make_sampler
from mlx_lm.generate import generate_step
import mlx.core as mx

model, tokenizer = load_jang_model("OsaurusAI/Gemma-4-26B-A4B-it-JANG_2L")
sampler = make_sampler(temp=0.7)

tokens = tokenizer.encode("Explain quantum computing in simple terms.")
for tok, _ in generate_step(prompt=mx.array(tokens), model=model, max_tokens=200, sampler=sampler):
    t = tok.item() if hasattr(tok, 'item') else int(tok)
    print(tokenizer.decode([t]), end="", flush=True)
    if t == tokenizer.eos_token_id:
        break

VLM Inference

from jang_tools.loader import load_jang_vlm_model
from mlx_vlm import generate

model, processor = load_jang_vlm_model("OsaurusAI/Gemma-4-26B-A4B-it-JANG_2L")

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": [
        {"type": "image", "image": "photo.jpg"},
        {"type": "text", "text": "Describe this image."}
    ]}], add_generation_prompt=True, tokenize=False)

result = generate(model, processor, prompt, ["photo.jpg"], max_tokens=200)
print(result.text)

OsaurusAI
/

Gemma-4-26B-A4B-it-JANG_2L