Osaurus AI

Gemma 4 26B-A4B-it — JANG_2L (MoE, 2-bit)

JANG — Jang Adaptive N-bit Grading | Mixed-Precision Quantization for Apple Silicon

Website  GitHub  PyPI  JANGQ-AI


Osaurus natively supports JANG models. Download at osaurus.ai.


Results (200-question MMLU, no-thinking)

Model MMLU Size Speed
MLX 4-bit 70.5% 15 GB 25.7 tok/s
JANG_4M (4-bit) 69.5% 15 GB 26.7 tok/s
JANG_2L (2-bit) 58.0% 9.9 GB 30.8 tok/s
MLX 2-bit Broken — completely incoherent output ~7 GB

JANG_2L at 9.9 GB scores 58.0% — a fully usable model. Standard MLX 2-bit quantization on this model produces completely incoherent, unusable output. This is the core advantage of JANG's mixed-precision approach on MoE architectures: by protecting attention, routing, and shared MLP at 8-bit while only compressing expert weights to 2-bit, JANG preserves model coherence where uniform quantization fails entirely.

Per-Subject Breakdown

Subject JANG_2L JANG_4M MLX 4-bit
Abstract Algebra 6/20 9/20 8/20
Anatomy 13/20 13/20 13/20
Astronomy 14/20 17/20 17/20
College CS 9/20 13/20 14/20
College Physics 11/20 14/20 14/20
HS Biology 18/20 19/20 18/20
HS Chemistry 7/20 14/20 15/20
HS Mathematics 7/20 6/20 7/20
Logical Fallacies 16/20 17/20 19/20
World Religions 15/20 17/20 16/20
Total 116/200 139/200 141/200

Model Details

Metric Value
Source google/gemma-4-26b-a4b-it
Architecture MoE (128 experts, top-8 active) + Hybrid Sliding/Global Attention
Profile JANG_2L (CRITICAL=8-bit, IMPORTANT=6-bit, COMPRESS=2-bit)
Actual avg bits 2.51
Model size 9.9 GB (vs ~50 GB bf16)
Vision Yes (multimodal, float16 passthrough)
Format JANG v2 (MLX-native safetensors, instant load)
Parameters 70.2B total, ~4B active per token

Architecture Highlights

  • 128 MoE experts with top-8 routing + parallel shared dense MLP
  • Hybrid attention: 25 sliding-window layers + 5 full-attention layers
  • Dual head dimensions: 256 (sliding) / 512 (global)
  • K=V weight sharing on global attention layers
  • Vision encoder preserved in float16 for multimodal inference

JANG_2L Bit Allocation

Tier Components Bits
CRITICAL Attention (Q/K/V/O), router, shared MLP, embeddings 8
IMPORTANT Gate proj, up proj 6
COMPRESS Expert MLP (down proj), remaining weights 2

JANG protects the routing and attention pathways at full precision while aggressively compressing the 128 expert MLPs — where MoE models are most tolerant of quantization since only 8 of 128 experts activate per token.

Install

pip install "jang[mlx]"

For vision:

pip install "jang[vlm]"

Quick Start

from jang_tools.loader import load_jang_model
from mlx_lm.sample_utils import make_sampler
from mlx_lm.generate import generate_step
import mlx.core as mx

model, tokenizer = load_jang_model("OsaurusAI/Gemma-4-26B-A4B-it-JANG_2L")
sampler = make_sampler(temp=0.7)

tokens = tokenizer.encode("Explain quantum computing in simple terms.")
for tok, _ in generate_step(prompt=mx.array(tokens), model=model, max_tokens=200, sampler=sampler):
    t = tok.item() if hasattr(tok, 'item') else int(tok)
    print(tokenizer.decode([t]), end="", flush=True)
    if t == tokenizer.eos_token_id:
        break

VLM Inference

from jang_tools.loader import load_jang_vlm_model
from mlx_vlm import generate

model, processor = load_jang_vlm_model("OsaurusAI/Gemma-4-26B-A4B-it-JANG_2L")

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": [
        {"type": "image", "image": "photo.jpg"},
        {"type": "text", "text": "Describe this image."}
    ]}], add_generation_prompt=True, tokenize=False)

result = generate(model, processor, prompt, ["photo.jpg"], max_tokens=200)
print(result.text)

Links


Created by Jinho Jang — jangq.ai · osaurus.ai

Downloads last month
1,393
Safetensors
Model size
3B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support