majentik β€” Model Garden

A curated collection of quantized open-weight models with inference-time KV-cache compression. Every model keeps upstream tokenizers and architectures; the only thing we change is how the weights and KV cache are stored during generation.

307 repositories Β· 12 families Β· 6 quantization lanes

What this garden is for

Running bigger models on the laptop you already have. Every release combines a standard weight-quantization format (GGUF, MLX, or AWQ) with one of two KV-cache compressors:

Compressor What it does When to use
RotorQuant Rotational isotropic KV-cache compression Long-context work; 2–4Γ— KV memory savings with minimal drift
TurboQuant Turbo-variant targeted at throughput Short-context, high-throughput serving

Both compressors are applied at inference time. They compose with any weight-quantized file in this garden β€” you mix and match.

Families

Family Repos Notes
Gemma 4 127 E2B / E4B / 26B-A4B / 31B, base + instruct
Nemotron 41 Nano 4B + Super (Thinking + Base) variants
Qwen 3.5 28 27B dense + 397B-A17B MoE
GPT-OSS 28 20B and 120B
Voxtral 24 ASR + voice chat, 3 sub-families
MERaLiON 30 2 (20 repos) and 3 (10 repos) β€” ASR + multimodal
MiniMax M2.7 9 Mixed quantization lanes
Mistral Small 4 8 Instruct + reasoning
Leanstral 8 Distilled Mistral reasoning variant
DeepSeek V3.2 2 Mostly upstream, KV-quant wrappers

Quantization lanes

Every model lands in one or more of these lanes (the README for each repo specifies which):

  • GGUF β€” Q2_K, Q3_K_M, IQ4_XS, Q4_K_M, Q5_K_M, Q8_0. Load with llama.cpp, ollama, lmstudio, or any GGUF-compatible runtime.
  • MLX β€” 2-bit, 4-bit, 8-bit. Targets Apple Silicon. Pip install mlx-lm, point it at the repo, done.
  • AWQ β€” 4-bit and 8-bit. Targets CUDA GPUs with vLLM or autoawq.

Pick a starting point

  • I have a MacBook with 16 GB RAM β†’ try a 4-bit MLX variant of Gemma 4 E4B.
  • I have a 24 GB GPU β†’ try an AWQ-4bit Qwen3.5-27B with RotorQuant.
  • I need 128k context on modest hardware β†’ any GGUF + RotorQuant.
  • I want to compare β†’ each repo card links to the corresponding upstream model, so perplexity drift is one eval_hf.py run away.

What is not in each repo

  • Training data. These are quantization-only releases. The base model's training data is upstream's concern; we inherit the upstream license and disclaimers.
  • Benchmarks for every axis. We publish per-lane perplexity on WikiText-2 plus family-specific evals (MMLU for reasoning, LibriSpeech for ASR). If you want an axis we haven't measured yet, open a discussion on the relevant repo.

Who we are

majentik publishes these as a side-project to keep our own fleet running cheaply on commodity hardware, and to close the gap between research releases and "can I actually run this tonight". Issues, quant requests, and benchmark PRs are welcome.

Contact

  • Discussions: use the Community tab on the specific model repo.
  • Hardware donations / compute partnerships: majentik on Hugging Face.
  • Everything else: open a discussion on the closest repo, we'll see it.

Versioning

Each repo version tracks upstream@base-model-revision Γ— quant-lane. When upstream ships a new base revision, we re-run the quant lane and bump the repo version. Card changes (docs, benchmarks) do not bump the version.

License

Each repo inherits the base model's license, not this organization-level license. Check the license field in the repository's card before deploying.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support