Instructions to use Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit
Run Hermes
hermes
- MLX LM
How to use Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit
The Model Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit was converted to Dynamic MLX format from Qwen/Qwen3-4B-Instruct-2507 using mlx-lm version 0.30.0.
Estimating sensitivities: 80%|████████████████████▉ | 45/56 [2:29:16<31:17, 170.70s/it]
Estimating sensitivities: 96%|█████████████████████████ | 54/56 [2:55:25<05:54, 177.01s/it]
Estimating sensitivities: 57it [3:03:25, 193.08s/it]
Original PPL: 8.204
[INFO] Quantized model with 5.005 bits per weight.
Quantized PPL: 8.506
Peak memory used: 47.648GB
What are MLX Dynamic Quants
MLX dynamic quantization is a compression technique that intelligently allocates different bit-widths to different layers based on their sensitivity to quantization errors. Unlike uniform quantization that applies the same precision to all layers, dynamic quantization analyzes each layer's impact on model output quality and assigns higher precision to more sensitive layers while using lower precision for robust layers.
You may notice in the files that there is also a sensitivites.json available. This file contains-by-layer sensitivity scores that were calculated during a comprehensive analysis phase. Each score how sensitive a particular layer is to quantization - higher scores indicate layers that suffer greater quality loss if quantized aggressively.
The sensitivity analysis process is the time-consuming part of creating dynamic quants (as shown by the multiple hours) because the system must test each layer individually with sample data to measure accuracy impact.
You can download just this file and create your own dynamic quants without redoing the lengthy sensitivity analysis.
Perplexity Results Explained
Perplexity (PPL) measures how well a language model predicts text, with lower scores indicating better performance. The evaluation results demonstrate the effectiveness of this approach:
Original Model PPL: 8.204
Quantized Model PPL: 8.506
Despite shrinking the model by ~2 GB, the dynamic quantized version achieves nearly identical perplexity to the original. The tiny increase (+0.302 PPL) means virtually no loss in language understanding, especially impressive given the drastic memory savings.
Why Accuracy Remains High
The secret lies in the sensitivity-guided precision allocation. Rather than uniformly degrading all parameters, the system protects the most critical neural pathway (eg., model.layers12.self_attn.v_proj) with higher precision while more aggressively compressing less sensitive components (eg., model.layers.0.self_attn.o_proj: 9.8e-05). This selective preservation strategy enables the model to maintain near-original accuracy, making high-performance inference accessible on hardware with limited memory resources.
- Downloads last month
- 7
4-bit
Model tree for Otilde/Qwen3-4B-Instruct-2507-MLX-Dynamic-5bit
Base model
Qwen/Qwen3-4B-Instruct-2507