Youtu-LLM-2B 4-bit MLX
MLX-optimized 4-bit quantized version of tencent/Youtu-LLM-2B for Apple Silicon.
Quick Start
pip install mlx-lm
mlx_lm.generate \
--model mlx-community/Youtu-LLM-2B-4bit \
--prompt "Hello, what can you do?" \
--max-tokens 100
Model Details
- Base Model: tencent/Youtu-LLM-2B
- Parameters: 1.96B
- Quantization: 4-bit (4.5 bits/weight)
- Context: 128K tokens
- Architecture: Dense MLA (Multi-head Latent Attention)
- Framework: MLX (Apple Silicon optimized)
Performance (M3 Ultra)
| Quant | Prompt | Generation | Memory |
|---|---|---|---|
| bf16 | 118 tok/s | 112 tok/s | 4.7GB |
| 4-bit | 202 tok/s | 205 tok/s | 1.3GB |
Features
- Reasoning Mode: Uses
<think>tags for Chain of Thought - 128K Context: Long document understanding
- Agentic: Strong on SWE-Bench, GAIA benchmarks
- Edge-friendly: Runs on any Apple Silicon Mac
Benchmarks (vs Qwen3-4B)
| Benchmark | Youtu-LLM-2B | Qwen3-4B |
|---|---|---|
| HumanEval | 95.9% | 95.4% |
| SWE-Bench | 17.7% | 5.7% |
| GAIA | 33.9% | 25.5% |
Other Quantizations
- Full precision (4.4GB)
- 4-bit (1.2GB)
Technical Note
Converted using deepseek_v2 architecture mapping (compatible MLA implementation).
License
- Downloads last month
- 21