Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
witcheer 
posted an update about 6 hours ago
Post
22
updated my MoE offload bench dataset + collection.

>>> previous finding: Qwen3.6-35B-A3B via full expert offload on RTX 4060 Ti 8GB + 32GB RAM → 7.4 tok/sec. RAM-ceilinged, disk-bound.

>>> new finding: built llama.cpp from source inside WSL2, swept -ncmoe values for partial offload.


ncmoe 32, 16K ctx → 29.7 tok/sec
ncmoe 30, 16K ctx → 32.0 tok/sec
ncmoe 30, 32K ctx → 35.4 tok/sec
ncmoe 28, 16K ctx → 16.3 tok/sec (VRAM cliff)
ncmoe 30, 65K ctx → 17.4 tok/sec (VRAM cliff)


4.8x faster than full offload. 8GB VRAM cliff is sharp - crossing ~7 GB halves throughput instantly.

the hybrid SSM+attention architecture means 32K context is nearly free (KV cache only scales for 10/40 layers).

dataset: witcheer/windows-rtx-4060ti-8gb-moe-offload-bench-2026-05

collection: https://hf.co/collections/witcheer/8gb-vram-local-llms-practitioner-tested

4.8x faster is a lot! Nice results.