🤗 Hugging Face | 🤖 ModelScope | Tech Report
Ling-2.6-flash-base
Ling-2.6-flash-base is the base checkpoint behind the Ling-2.6-flash model. It is a flash-scale Mixture-of-Experts language model retrofitted from the Ling-2.0 base checkpoint with a hybrid linear attention design, continued pre-training, and long-context mid-training.
This release is intended for research, continued pre-training, distillation, and supervised or preference-based fine-tuning. It is not a chat-aligned assistant model. If you want an out-of-the-box instruction model, use the corresponding post-trained Ling-2.6-flash checkpoint instead.
1. Model Overview
Ling-2.6-flash-base is designed for efficient instant-response modeling with stronger long-context efficiency than the previous GQA-based Ling-2.0 generation. The core upgrade is a hybrid attention retrofit that combines Lightning Attention with MLA in a 7:1 ratio, together with a smooth migration pipeline from the original architecture.
Ling-2.6 base models are trained through approximately 9.6T tokens across migration pre-training, continued pre-training, and mid-training, with staged context extension from 4K to 256K. Ling-2.6-flash-base serves as the base checkpoint for the post-trained Ling-2.6-flash instant model.
2. Key Features
- Hybrid linear attention architecture combining Lightning Attention and MLA in a 7:1 ratio
- Flash-scale MoE backbone optimized for efficient serving and high token efficiency
- Long-context training pipeline extended to 256K context during mid-training
- Continued pre-training mixture covering agentic data, long-context data, knowledge-rich web data, math, code, and multilingual corpora
- Strong base-model quality across knowledge, math, code, reasoning, and long-context understanding benchmarks
3. Model Summary
| Item | Value |
|---|---|
| Architecture | Fine-grained MoE with hybrid linear attention |
| Parameter Scale | Totoal ~104B, Activated ~7.4B |
| Transformer layers | 32 |
| Routed experts per MoE layer | 256 |
| Shared experts per MoE layer | 1 |
| Active routed experts per token | 8 |
| Attention heads | 32 |
| Dense FFN layers | 1 |
| Hidden size | 4096 |
| Dense intermediate size | 9216 |
| Expert intermediate size | 1024 |
| KV LoRA rank | 512 |
| Q LoRA rank | 1536 |
| Layer group size | 8 |
| Positional encoding | Partial RoPE |
| Attention design | Lightning Attention + MLA, 7:1 ratio |
| Training recipe | Migration pre-training + continued pre-training + mid-training |
| Total training tokens | ~9.6T |
| Context training schedule | 4K -> 32K -> 256K |
4. Training Highlights
Architecture Migration
The model is converted from the Ling-2.0 generation into the Ling-2.6-flash architecture through a multi-stage migration pipeline that includes:
- Lightning Attention conversion
- Linear warmup
- MLA conversion
- MLA warmup
- Full continued pre-training
This retrofit is designed to preserve pre-trained capability while reducing long-context compute cost, KV-cache pressure, and decode latency.
Data Mixture
The continued pre-training and mid-training stages include:
- Agentic corpus built from tool-use and coding environments
- Long-context corpus covering mathematics, web parsing, summarization, retrieval, and multi-hop reasoning
- General web knowledge data with targeted STEM and factual augmentation
- Math and code corpora
- Multilingual data spanning 21 languages
5. Base Model Evaluation
The following numbers are selected from the technical report and reflect base-model evaluation rather than chat-aligned or instruction-tuned performance.
| Benchmark | Ling-2.0-flash-base | Ling-2.6-flash-base |
|---|---|---|
| MMLU | 82.98 | 84.13 |
| MMLU-Pro | 60.73 | 61.36 |
| GPQA | 35.35 | 37.88 |
| SimpleQA | 10.01 | 18.33 |
| C-SimpleQA | 49.43 | 63.53 |
| MMMLU | 62.76 | 64.76 |
| GSM8K | 90.60 | 91.89 |
| OmniMath | 28.30 | 29.90 |
| HumanEval-Plus | 83.54 | 81.10 |
| LiveCodeBench | 30.40 | 33.48 |
| BIRD-SQL | 38.69 | 38.40 |
| BBH | 84.82 | 85.06 |
| AutoLogic | 61.10 | 62.82 |
| LEval | 73.41 | 77.86 |
| LongBenchv2 | 33.40 | 34.19 |
Ling-2.6-flash-base shows broad gains over Ling-2.0-flash-base, especially on knowledge-oriented, reasoning-oriented, and long-context evaluations.
6. Intended Use
Recommended use cases:
- Continued pre-training
- Supervised fine-tuning for domain adaptation
- Preference optimization and RL post-training
- Distillation research
- Long-context and MoE systems research
Not recommended as-is for:
- Direct end-user chat deployment
- Safety-critical applications without additional alignment and evaluation
- Production use without post-training and task-specific validation
7. Limitations
- This is a base model and is not instruction-aligned.
- Outputs may be inaccurate, biased, incomplete, or unsafe without additional post-training.
- Long-context quality depends on the serving stack, positional scaling configuration, and prompt format used at inference time.
- The training mixture includes web-scale and synthetic data, so the model may reproduce factual errors or undesirable artifacts.
- Benchmark results in the technical report are collected under controlled internal evaluation settings and should not be treated as a guarantee of downstream production behavior.
8. Relationship to Other Releases
- Ling-2.6-flash: instruction and instant-response optimized model derived from this base.
If your goal is interactive assistant use rather than research on base checkpoints, the post-trained Ling-2.6-flash model is usually the better starting point.
9. Usage
This is a base checkpoint. The example below illustrates the loading pattern only.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "inclusionAI/Ling-2.6-flash-base"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
prompt = "Summarize the benefits of hybrid linear attention."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For production inference, prefer serving stacks that support the released architecture and remote code path.
10. License
This model is released under the MIT License.
- Downloads last month
- 35