Instructions to use OsaurusAI/Ling-2.6-flash-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OsaurusAI/Ling-2.6-flash-MXFP4 with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use OsaurusAI/Ling-2.6-flash-MXFP4 with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/Ling-2.6-flash-MXFP4"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "OsaurusAI/Ling-2.6-flash-MXFP4"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use OsaurusAI/Ling-2.6-flash-MXFP4 with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/Ling-2.6-flash-MXFP4"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default OsaurusAI/Ling-2.6-flash-MXFP4

Run Hermes

hermes

MLX LM

How to use OsaurusAI/Ling-2.6-flash-MXFP4 with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "OsaurusAI/Ling-2.6-flash-MXFP4"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "OsaurusAI/Ling-2.6-flash-MXFP4"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "OsaurusAI/Ling-2.6-flash-MXFP4",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Ling-2.6-flash-MXFP4 / README.md

Osaurus-AI

Upload README.md with huggingface_hub

93282f5 verified 26 days ago

preview code

raw

history blame contribute delete

3.68 kB

	---
	license: mit
	tags:
	- moe
	- mixture-of-experts
	- hybrid-attention
	- mla
	- lightning-attention
	- mxfp4
	- osaurus
	- mlx
	- bailing
	- ling
	- apple-silicon
	base_model: inclusionAI/Ling-2.6-flash
	pipeline_tag: text-generation
	library_name: mlx
	---

	<p align="center"><img src="osaurus-x-banner.png" width="100%"/></p>

	# Ling-2.6-flash-MXFP4

	~103B-A8B hybrid MoE — 63 GB on disk (down from the 200 GB bf16 source) —
	stock 4-bit affine quantization on inclusionAI's Bailing-V2.5 hybrid
	architecture. Loads via `mlx_lm.load()` with the `bailing_hybrid` model
	class — no TurboQuant runtime, no sidecar required.

	- Source: [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
	(Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention,
	256 experts top-8, MTP head, 131K context)
	- Quantization: MXFP4 — every weight (routed experts, attention,
	shared experts, dense MLP, embed, lm_head) at **4-bit affine
	group_size=32**. Norms, router gates, expert biases, and slopes stay
	fp16/fp32 passthrough.
	- Bundle size: 63 GB on-disk across 51 shards
	- Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio

	## Why two variants?

	\| \| JANGTQ2 \| MXFP4 \|
	\|---\|---\|---\|
	\| Routed experts \| 2-bit MXTQ codebook (Hadamard + Lloyd-Max) \| 4-bit affine \|
	\| Attention / shared / dense \| 8-bit affine \| 4-bit affine \|
	\| Bundle size \| 30 GB \| 63 GB \|
	\| Quality \| tighter (8-bit attention) \| uniform 4-bit \|
	\| Loader \| `jang_tools.load_jangtq` (TurboQuant kernel) \| stock `mlx_lm.load()` \|
	\| Sidecar \| required \| not needed \|
	\| Min RAM \| 64 GB \| 96 GB \|

	JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter
	overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option
	for users who don't want the TurboQuant runtime in their stack.

	## Architecture (`bailing_hybrid`)

	Hybrid attention — every 8th layer is full softmax MLA, the other 28 of 32
	are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.

	\| Layer block \| Count \| Attention \| MLP \|
	\|---\|---\|---\|---\|
	\| Layer 0 \| 1 \| Linear (GLA) \| Dense MLP (intermediate=9216) \|
	\| Layers 1–6, 8–14, 16–22, 24–30 \| 27 \| Linear (GLA) \| MoE (256+1) \|
	\| Layers 7, 15, 23, 31 \| 4 \| MLA (full softmax) \| MoE (256+1) \|
	\| MTP head (32) \| 1 \| MLA \| MoE (256+1) \|

	See the [JANGTQ variant card](https://huggingface.co/JANGQ-AI/Ling-2.6-flash-JANGTQ)
	for the deeper architecture writeup.

	## Loading (Python)

	```bash
	pip install mlx-lm jang-tools
	```

	```python
	from mlx_lm import load, generate
	model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")
	```

	Stock `mlx_lm.load()` works once `mlx_lm/models/bailing_hybrid.py` is
	present (shipped with `jang-tools >= TBD`). The bundle's
	`configuration_bailing_moe_v2_5.py` and `modeling_bailing_moe_v2_5.py`
	provide HF compatibility for tooling that goes through transformers.

	## Reasoning + tools

	Default is `detailed thinking off`. To enable:

	```python
	messages = [
	{"role": "system", "content": "detailed thinking on"},
	{"role": "user", "content": "..."},
	]
	```

	The model emits `<think>...</think>` reasoning blocks before answers when
	thinking is on. DeepSeek-style tool-call format.

	## Credits

	- Quantization + MLX runtime: Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
	- Base model: [inclusionAI](https://huggingface.co/inclusionAI) — Ant
	Group's Bailing team
	- Architecture references: Lightning-Attention-2 (arXiv:2401.04658),
	DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
	- Osaurus: [osaurus.ai](https://osaurus.ai) — Apple-Silicon-first
	inference for open-weight LLMs.