Youtu-LLM-2B 4-bit MLX

MLX-optimized 4-bit quantized version of tencent/Youtu-LLM-2B for Apple Silicon.

Quick Start

pip install mlx-lm

mlx_lm.generate \
  --model mlx-community/Youtu-LLM-2B-4bit \
  --prompt "Hello, what can you do?" \
  --max-tokens 100

Model Details

  • Base Model: tencent/Youtu-LLM-2B
  • Parameters: 1.96B
  • Quantization: 4-bit (4.5 bits/weight)
  • Context: 128K tokens
  • Architecture: Dense MLA (Multi-head Latent Attention)
  • Framework: MLX (Apple Silicon optimized)

Performance (M3 Ultra)

Quant Prompt Generation Memory
bf16 118 tok/s 112 tok/s 4.7GB
4-bit 202 tok/s 205 tok/s 1.3GB

Features

  • Reasoning Mode: Uses <think> tags for Chain of Thought
  • 128K Context: Long document understanding
  • Agentic: Strong on SWE-Bench, GAIA benchmarks
  • Edge-friendly: Runs on any Apple Silicon Mac

Benchmarks (vs Qwen3-4B)

Benchmark Youtu-LLM-2B Qwen3-4B
HumanEval 95.9% 95.4%
SWE-Bench 17.7% 5.7%
GAIA 33.9% 25.5%

Other Quantizations

Technical Note

Converted using deepseek_v2 architecture mapping (compatible MLA implementation).

License

See original model license.

Downloads last month
21
Safetensors
Model size
0.3B params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Youtu-LLM-2B-4bit

Quantized
(5)
this model