YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

---
license: mit
language:
- en
- ru
tags:
- text-generation
- agent
- long-context
- code
- security
- made-by-bleyzos
---

<br/><br/>

<div align="center">
  <picture>
    <source srcset="https://github.com/XiaomiMiMo/MiMo/raw/main/figures/Xiaomi_MiMo_darkmode.png?raw=true" media="(prefers-color-scheme: dark)">
    <img src="https://github.com/XiaomiMiMo/MiMo/raw/main/figures/Xiaomi_MiMo.png?raw=true" width="60%" alt="Bleyzos Coder" />
  </picture>
</div>

<br/>

<div align="center" style="line-height: 1;">
  |
  <a href="https://github.com/BleyzosAI" target="_blank">🐙 GitHub</a>
  &nbsp;|
  <a href="https://bleyzos.com/blog" target="_blank">📰 Blog</a>
  &nbsp;|
  <a href="https://bleyzos.com/studio" target="_blank">🎨 Bleyzos AI Studio</a>
  &nbsp;|
  <a href="https://discord.gg/bleyzos" target="_blank">🗨️ Discord</a>
  &nbsp;|
</div>

<br/>

<div align="center" style="line-height: 1.2;">
  <strong>Community</strong><br/>
  <a href="https://t.me/bleyzos" target="_blank">Telegram</a>
  &nbsp;|&nbsp;
  <a href="https://discord.gg/bleyzos" target="_blank">Discord</a>
  &nbsp;|&nbsp;
  <a href="https://github.com/BleyzosAI" target="_blank">GitHub</a>
</div>

<br/>

# Bleyzos Coder

Bleyzos Coder is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters. Built on a fork of MiMo-V2.5-Pro, fine-tuned for coding, cybersecurity, and agentic workflows. Up to 1M tokens context length.

## 1. Introduction

Bleyzos Coder is our most capable model to date, designed for the most demanding agentic, complex software engineering, and cybersecurity tasks. It sustains complex trajectories spanning thousands of tool calls with strong instruction following and coherence over a 1M-token context window. Key features include:

- **Hybrid Attention Architecture**: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 6:1 ratio and 128 sliding window. This reduces KV-cache storage by nearly 7x while maintaining long-context performance via learnable attention sink bias.
- **Multi-Token Prediction (MTP)**: Equipped with three lightweight MTP modules using dense FFNs. This triples output speed during inference and will be good to accelerate rollout in RL training.
- **Efficient Pre-Training**: Trained on 27T tokens using FP8 mixed precision and native 32k seq length. The context window supports up to 1M tokens.
- **Agentic Capabilities**: Post-training utilizes SFT, large-scale agentic RL and Multi-Teacher On-Policy Distillation (MOPD), achieving superior performance on the most demanding agentic, complex software engineering, and long-horizon tasks.
- **Built-in Security**: Filters against prompt injection, data leaks, and malicious code generation. Designed to protect, not harm.

## 2. Model Downloads

| Model | Total Params | Active Params | Context Length | Precision | Download |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **Bleyzos Coder Pro** | 1.02T | 42B | 1M | FP8 (E4M3) Mixed | [🤗 HuggingFace](https://huggingface.co/BleyzosAI/Bleyzos-Coder-Pro) |

## 3. Evaluation Results

### Base Model Evaluation

| Category | Benchmark | Setting | Bleyzos Coder | MiMo-V2.5-Pro | DeepSeek-V4-Pro | DeepSeek-V4-Flash | Kimi-K2 Base |
| :--- | :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| **Params** | #Activated / #Total | - | 42B / 1.02T | 42B / 1.02T | 49B / 1.6T | 13B / 284B | 32B / 1.04T |
| **General** | BBH | 3-shot | 89.1 | 88.4 | 87.5 | 86.9 | 88.7 |
| | MMLU | 5-shot | 89.4 | 89.4 | 90.1 | 88.7 | 87.8 |
| | MMLU-Redux | 5-shot | 92.8 | 92.8 | 90.8 | 89.4 | 90.2 |
| | MMLU-Pro | 5-shot | 68.5 | 68.5 | 73.5 | 68.3 | 69.2 |
| | DROP | 3-shot | 86.3 | 86.3 | 88.7 | 88.6 | 83.6 |
| **Math** | GSM8K | 8-shot | 99.8 | 99.6 | 92.6 | 90.8 | 92.1 |
| | MATH | 4-shot | 86.2 | 86.2 | 64.5 | 57.4 | 70.2 |
| **Code** | HumanEval+ | 1-shot | 78.3 | 75.6 | - | - | 84.8 |
| | SWE-Bench (AgentLess) | 3-shot | 58.7 | 35.7 | - | - | 28.2 |
| **Agents** | ClawEval pass³ | - | 65.2 | 63.8 | 59.8 | - | - |

## 4. Model Architecture & Training Process

Bleyzos Coder addresses the quadratic complexity of long contexts by interleaving Local Sliding Window Attention (SWA) and Global Attention (GA). Unlike traditional speculative decoding, our MTP module is natively integrated for training and inference.

### Model Summary

| Component | Bleyzos Coder Pro |
| :--- | :---: |
| **Total Parameters** | 1.02T |
| **Activated Parameters** | 42B |
| **Hidden Size** | 6144 |
| **Num Layers** | 70 (1 dense + 69 MoE) |
| **Full Attention Layers** | 10 |
| **SWA Layers** | 60 |
| **Num Attention Heads** | 128 |
| **Num KV Heads** | 8 (GQA) |
| **Routed Experts** | 384 |
| **Experts per Token** | 8 |
| **Max Context Length** | 1M |
| **MTP Layers** | 3 |

### Training Process

Post-training follows a three-stage paradigm: Supervised Fine-Tuning (SFT) for foundational instruction-following, Domain-Specialized Training for cybersecurity and code, and Multi-Teacher On-Policy Distillation (MOPD) to integrate all capabilities into a single model.

## 5. Deployment

### SGLang Deployment

For the best performance, use SGLang with the following configuration:

```bash
SGLANG_ENABLE_SPEC_V2=1
python3 -m sglang.launch_server \
    --model-path BleyzosAI/Bleyzos-Coder-Pro \
    --trust-remote-code \
    --dp-size 2 \
    --ep-size 16 \
    --tp-size 16 \
    --quantization fp8 \
    --context-length 1048576 \
    --speculative-algorithm EAGLE \
    --host 0.0.0.0 \
    --port 9001 \
    --tool-call-parser bleyzos \
    --watchdog-timeout 3600

For local deployment, set temperature=1.0, top_p=0.95.

Citation

@misc{bleyzos2026coder,
  title={Bleyzos Coder},
  author={{Bleyzos AI Team}},
  year={2026},
  howpublished={\url{https://huggingface.co/BleyzosAI/Bleyzos-Coder-Pro}},
}

Contact

For questions or feedback, reach us at coder@bleyzos.com or join our community:


Downloads last month
16
Safetensors
Model size
1T params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support