Text Generation
MLX
Safetensors
bailing_hybrid
Mixture of Experts
mixture-of-experts
hybrid-attention
mla
lightning-attention
mxfp4
osaurus
bailing
ling
apple-silicon
conversational
custom_code
Instructions to use OsaurusAI/Ling-2.6-flash-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/Ling-2.6-flash-MXFP4 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use OsaurusAI/Ling-2.6-flash-MXFP4 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/Ling-2.6-flash-MXFP4"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "OsaurusAI/Ling-2.6-flash-MXFP4" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use OsaurusAI/Ling-2.6-flash-MXFP4 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/Ling-2.6-flash-MXFP4"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default OsaurusAI/Ling-2.6-flash-MXFP4
Run Hermes
hermes
- MLX LM
How to use OsaurusAI/Ling-2.6-flash-MXFP4 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "OsaurusAI/Ling-2.6-flash-MXFP4"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "OsaurusAI/Ling-2.6-flash-MXFP4" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OsaurusAI/Ling-2.6-flash-MXFP4", "messages": [ {"role": "user", "content": "Hello"} ] }'
| license: mit | |
| tags: | |
| - moe | |
| - mixture-of-experts | |
| - hybrid-attention | |
| - mla | |
| - lightning-attention | |
| - mxfp4 | |
| - osaurus | |
| - mlx | |
| - bailing | |
| - ling | |
| - apple-silicon | |
| base_model: inclusionAI/Ling-2.6-flash | |
| pipeline_tag: text-generation | |
| library_name: mlx | |
| <p align="center"><img src="osaurus-x-banner.png" width="100%"/></p> | |
| # Ling-2.6-flash-MXFP4 | |
| **~103B-A8B hybrid MoE β 63 GB on disk** (down from the 200 GB bf16 source) β | |
| **stock 4-bit affine** quantization on inclusionAI's Bailing-V2.5 hybrid | |
| architecture. Loads via `mlx_lm.load()` with the `bailing_hybrid` model | |
| class β no TurboQuant runtime, no sidecar required. | |
| - **Source:** [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) | |
| (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, | |
| 256 experts top-8, MTP head, 131K context) | |
| - **Quantization:** MXFP4 β every weight (routed experts, attention, | |
| shared experts, dense MLP, embed, lm_head) at **4-bit affine | |
| group_size=32**. Norms, router gates, expert biases, and slopes stay | |
| fp16/fp32 passthrough. | |
| - **Bundle size:** **63 GB on-disk** across 51 shards | |
| - **Runs on:** M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio | |
| ## Why two variants? | |
| | | JANGTQ2 | MXFP4 | | |
| |---|---|---| | |
| | Routed experts | 2-bit MXTQ codebook (Hadamard + Lloyd-Max) | 4-bit affine | | |
| | Attention / shared / dense | 8-bit affine | 4-bit affine | | |
| | Bundle size | 30 GB | 63 GB | | |
| | Quality | tighter (8-bit attention) | uniform 4-bit | | |
| | Loader | `jang_tools.load_jangtq` (TurboQuant kernel) | stock `mlx_lm.load()` | | |
| | Sidecar | required | not needed | | |
| | Min RAM | 64 GB | 96 GB | | |
| JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter | |
| overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option | |
| for users who don't want the TurboQuant runtime in their stack. | |
| ## Architecture (`bailing_hybrid`) | |
| Hybrid attention β every 8th layer is full softmax MLA, the other 28 of 32 | |
| are Lightning-Linear-Attention. Plus a Multi-Token Prediction head. | |
| | Layer block | Count | Attention | MLP | | |
| |---|---|---|---| | |
| | Layer 0 | 1 | Linear (GLA) | Dense MLP (intermediate=9216) | | |
| | Layers 1β6, 8β14, 16β22, 24β30 | 27 | **Linear (GLA)** | MoE (256+1) | | |
| | Layers 7, 15, 23, 31 | 4 | **MLA** (full softmax) | MoE (256+1) | | |
| | MTP head (32) | 1 | MLA | MoE (256+1) | | |
| See the [JANGTQ variant card](https://huggingface.co/JANGQ-AI/Ling-2.6-flash-JANGTQ) | |
| for the deeper architecture writeup. | |
| ## Loading (Python) | |
| ```bash | |
| pip install mlx-lm jang-tools | |
| ``` | |
| ```python | |
| from mlx_lm import load, generate | |
| model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4") | |
| ``` | |
| Stock `mlx_lm.load()` works once `mlx_lm/models/bailing_hybrid.py` is | |
| present (shipped with `jang-tools >= TBD`). The bundle's | |
| `configuration_bailing_moe_v2_5.py` and `modeling_bailing_moe_v2_5.py` | |
| provide HF compatibility for tooling that goes through transformers. | |
| ## Reasoning + tools | |
| Default is **`detailed thinking off`**. To enable: | |
| ```python | |
| messages = [ | |
| {"role": "system", "content": "detailed thinking on"}, | |
| {"role": "user", "content": "..."}, | |
| ] | |
| ``` | |
| The model emits `<think>...</think>` reasoning blocks before answers when | |
| thinking is on. DeepSeek-style tool-call format. | |
| ## Credits | |
| - **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai)) | |
| - **Base model:** [inclusionAI](https://huggingface.co/inclusionAI) β Ant | |
| Group's Bailing team | |
| - **Architecture references:** Lightning-Attention-2 (arXiv:2401.04658), | |
| DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3 | |
| - **Osaurus:** [osaurus.ai](https://osaurus.ai) β Apple-Silicon-first | |
| inference for open-weight LLMs. | |