majentik β Model Garden
A curated collection of quantized open-weight models with inference-time KV-cache compression. Every model keeps upstream tokenizers and architectures; the only thing we change is how the weights and KV cache are stored during generation.
307 repositories Β· 12 families Β· 6 quantization lanes
What this garden is for
Running bigger models on the laptop you already have. Every release combines a standard weight-quantization format (GGUF, MLX, or AWQ) with one of two KV-cache compressors:
| Compressor | What it does | When to use |
|---|---|---|
| RotorQuant | Rotational isotropic KV-cache compression | Long-context work; 2β4Γ KV memory savings with minimal drift |
| TurboQuant | Turbo-variant targeted at throughput | Short-context, high-throughput serving |
Both compressors are applied at inference time. They compose with any weight-quantized file in this garden β you mix and match.
Families
| Family | Repos | Notes |
|---|---|---|
| Gemma 4 | 127 | E2B / E4B / 26B-A4B / 31B, base + instruct |
| Nemotron | 41 | Nano 4B + Super (Thinking + Base) variants |
| Qwen 3.5 | 28 | 27B dense + 397B-A17B MoE |
| GPT-OSS | 28 | 20B and 120B |
| Voxtral | 24 | ASR + voice chat, 3 sub-families |
| MERaLiON | 30 | 2 (20 repos) and 3 (10 repos) β ASR + multimodal |
| MiniMax M2.7 | 9 | Mixed quantization lanes |
| Mistral Small 4 | 8 | Instruct + reasoning |
| Leanstral | 8 | Distilled Mistral reasoning variant |
| DeepSeek V3.2 | 2 | Mostly upstream, KV-quant wrappers |
Quantization lanes
Every model lands in one or more of these lanes (the README for each repo specifies which):
- GGUF β Q2_K, Q3_K_M, IQ4_XS, Q4_K_M, Q5_K_M, Q8_0. Load with
llama.cpp,ollama,lmstudio, or any GGUF-compatible runtime. - MLX β 2-bit, 4-bit, 8-bit. Targets Apple Silicon. Pip install
mlx-lm, point it at the repo, done. - AWQ β 4-bit and 8-bit. Targets CUDA GPUs with vLLM or autoawq.
Pick a starting point
- I have a MacBook with 16 GB RAM β try a 4-bit MLX variant of Gemma 4 E4B.
- I have a 24 GB GPU β try an AWQ-4bit Qwen3.5-27B with RotorQuant.
- I need 128k context on modest hardware β any GGUF + RotorQuant.
- I want to compare β each repo card links to the corresponding
upstream model, so perplexity drift is one
eval_hf.pyrun away.
What is not in each repo
- Training data. These are quantization-only releases. The base model's training data is upstream's concern; we inherit the upstream license and disclaimers.
- Benchmarks for every axis. We publish per-lane perplexity on WikiText-2 plus family-specific evals (MMLU for reasoning, LibriSpeech for ASR). If you want an axis we haven't measured yet, open a discussion on the relevant repo.
Who we are
majentik publishes these as a side-project to keep our own fleet running cheaply on commodity hardware, and to close the gap between research releases and "can I actually run this tonight". Issues, quant requests, and benchmark PRs are welcome.
Contact
- Discussions: use the Community tab on the specific model repo.
- Hardware donations / compute partnerships: majentik on Hugging Face.
- Everything else: open a discussion on the closest repo, we'll see it.
Versioning
Each repo version tracks upstream@base-model-revision Γ quant-lane.
When upstream ships a new base revision, we re-run the quant lane and
bump the repo version. Card changes (docs, benchmarks) do not bump the
version.
License
Each repo inherits the base model's license, not this
organization-level license. Check the license field in the
repository's card before deploying.