Instructions to use Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit

Run Hermes

hermes

MLX LM

How to use Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit

The Model Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit was converted to Dynamic MLX format from Qwen/Qwen3-4B-Instruct-2507 using mlx-lm version 0.30.0.

Estimating sensitivities:  80%|████████████████████▉     | 45/56 [2:29:16<31:17, 170.70s/it]
Estimating sensitivities:  96%|█████████████████████████ | 54/56 [2:55:25<05:54, 177.01s/it]
Estimating sensitivities: 57it [3:03:25, 193.08s/it]
Original PPL: 8.204
[INFO] Quantized model with 5.005 bits per weight.
Quantized PPL: 8.506
Peak memory used: 47.648GB

What are MLX Dynamic Quants

MLX dynamic quantization is a compression technique that intelligently allocates different bit-widths to different layers based on their sensitivity to quantization errors. Unlike uniform quantization that applies the same precision to all layers, dynamic quantization analyzes each layer's impact on model output quality and assigns higher precision to more sensitive layers while using lower precision for robust layers.

You may notice in the files that there is also a sensitivites.json available. This file contains-by-layer sensitivity scores that were calculated during a comprehensive analysis phase. Each score how sensitive a particular layer is to quantization - higher scores indicate layers that suffer greater quality loss if quantized aggressively.

The sensitivity analysis process is the time-consuming part of creating dynamic quants (as shown by the multiple hours) because the system must test each layer individually with sample data to measure accuracy impact.

You can download just this file and create your own dynamic quants without redoing the lengthy sensitivity analysis.

Perplexity Results Explained

Perplexity (PPL) measures how well a language model predicts text, with lower scores indicating better performance. The evaluation results demonstrate the effectiveness of this approach:

Original Model PPL: 8.204

Quantized Model PPL: 8.506

Despite shrinking the model by ~2 GB, the dynamic quantized version achieves nearly identical perplexity to the original. The tiny increase (+0.302 PPL) means virtually no loss in language understanding, especially impressive given the drastic memory savings.

Why Accuracy Remains High

The secret lies in the sensitivity-guided precision allocation. Rather than uniformly degrading all parameters, the system protects the most critical neural pathway (eg., model.layers12.self_attn.v_proj) with higher precision while more aggressively compressing less sensitive components (eg., model.layers.0.self_attn.o_proj: 9.8e-05). This selective preservation strategy enables the model to maintain near-original accuracy, making high-performance inference accessible on hardware with limited memory resources.

Downloads last month: 7

Safetensors

Model size

0.7B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit

Base model

Qwen/Qwen3-4B-Instruct-2507

Quantized

(243)

this model

Collection including Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit

Dynamic Quants

Collection

Peak efficiency! • 3 items • Updated Jan 3