shrnk / README.md

shrnk v1: 0.5B custom-identity assistant, MLX-only 4-bit nvfp4 (272MB)

bd6e18b verified about 19 hours ago

15.4 kB

language: en
license: mit
tags:
  - mlx
  - apple-silicon
  - shrnk
  - custom-architecture
  - 4-bit
  - nvfp4
  - 0.5b
  - text-generation
  - conversational
  - custom-identity
base_model: Qwen/Qwen2.5-0.5B
library_name: shrnk
pipeline_tag: text-generation

shrnk

Apple Silicon / MLX only. This repo ships the original 4-bit nvfp4 weights (272 MB) that load natively with mlx-lm on M-series Macs. It is not a general-purpose transformers model — AutoModelForCausalLM.from_pretrained(...) will not work here. See the usage section for the correct way to run it.

shrnk is a custom 0.5B-parameter assistant built on top of Qwen/Qwen2.5-0.5B. It is not a vanilla fine-tune. The work splits into three layers:

Custom MLX architecture — a 200-line shrnk.py registers ShrnkForCausalLM with mlx_lm.models.shrnk, putting the model in its own namespace (model_type: shrnk) instead of qwen2.
Custom 4-bit nvfp4 quantization — the LoRA-fused weights are quantized to NVIDIA's FP4 E2M1 microscaling format (nvfp4, group_size=16) — 272 MB on disk.
Focused LoRA fine-tune — 68 hand-curated examples teaching identity + edge-case negation. Conservative LoRA (rank 8, alpha 16, 6 layers, 400 iters, LR 3e-5) that preserves the base's math and code abilities.

License: MIT. See LICENSE.

What shrnk is

Component	What we did
Base	`Qwen/Qwen2.5-0.5B` (the 0.5B Qwen 2.5 Instruct). shrnk is built on top of it, not shipped as-is.
Architecture	Custom `Shrnk` namespace: `model_type: shrnk`, `architectures: [ShrnkForCausalLM]`. The math is identical to Qwen2 (24 layers, 896 hidden, 14 heads, 2 KV heads, 4864 intermediate, RoPE θ=1e6, SwiGLU, RMSNorm, GQA, tied embeddings). See `shrnk.py` in this repo — it registers the architecture with `mlx_lm.models.shrnk`.
Training	LoRA fine-tune on `Qwen/Qwen2.5-0.5B` — rank 8, alpha 16, 6 transformer layers, dropout 0.1, 400 iters, LR 3e-5. Trained on 68 hand-curated examples (identity + edge-case negation). Deliberately conservative — we don't train on math/code because the base already does those well.
Quantization	4-bit nvfp4 (NVIDIA microscaling FP4 E2M1), group_size=16, 272 MB on disk. This is `mlx-lm`'s native 4-bit format — weights are stored as packed uint32 (8 fp4 values per uint32) with per-group uint8 scales.

Why MLX-only?

The 4-bit nvfp4 weight format is mlx-lm's native quantization scheme. It uses NVIDIA's FP4 E2M1 microscaling format with group_size=16 per-tensor scales. To get a 272 MB model that still respects the original quantization precision (no re-quantization fuzz), we ship the raw mlx-lm weights and the shrnk.py that registers the architecture with mlx-lm.

If you want to run a transformers-compatible model, you'll need to dequantize to bf16 first (~950 MB) and use a transformers port of the architecture. That's outside the scope of this repo.

Hardware requirements

Apple Silicon (M1 / M2 / M3 / M4)
macOS 13+
8 GB RAM minimum (model uses ~0.5 GB runtime memory)
Python 3.10+

CPU-only mlx-lm on Intel Macs will be very slow. NVIDIA / AMD GPUs are not supported by mlx-lm.

Setup

pip install mlx-lm transformers

Usage (Apple Silicon / MLX)

Quick start — command line

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tok = load('senapati484/shrnk')

SYSTEM = (
    'You are shrnk, a helpful assistant. You are the smallest and smartest '
    'AI model, created by senapati484. My GitHub repository is '
    'https://github.com/senapati484/shrnk.\n\n'
    'Be direct, concise, and friendly. Match the user\'s tone. Don\'t '
    'over-explain. Don\'t repeat yourself. Answer the question asked, nothing more.'
)

messages = [
    {'role': 'system', 'content': SYSTEM},
    {'role': 'user',   'content': 'Who are you?'},
]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

sampler = make_sampler(temp=0.6, top_p=0.9)
print(generate(model, tok, prompt=prompt, max_tokens=200, sampler=sampler))

Note: the first load(...) call will download the safetensors (~272 MB) and the shrnk.py from this repo. On subsequent runs both are cached locally.

Streaming (recommended for chat UX)

mlx_lm.stream_generate emits tokens as they're produced, which is what you want for a chat UI.

from mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

model, tok = load('senapati484/shrnk')

SYSTEM = (
    "You are shrnk, a helpful assistant. You are the smallest and smartest "
    "AI model, created by senapati484. My GitHub repository is "
    "https://github.com/senapati484/shrnk.\n\n"
    "Be direct, concise, and friendly. Match the user's tone. Don't "
    "over-explain. Don't repeat yourself. Answer the question asked, nothing more."
)

def chat(user_message: str, history: list[dict] | None = None) -> str:
    history = history or []
    messages = [{"role": "system", "content": SYSTEM}] + history + [
        {"role": "user", "content": user_message}
    ]
    prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

    sampler = make_sampler(temp=0.6, top_p=0.9)
    processors = make_logits_processors(repetition_penalty=1.15, repetition_context_size=64)

    full = ""
    for event in stream_generate(
        model, tok, prompt=prompt,
        max_tokens=300, sampler=sampler, logits_processors=processors,
    ):
        full += event.text
    return full

print(chat("Who are you?"))

Interactive REPL

Drop this into a file chat.py next to a copy of shrnk.py from this repo:

import os, sys, warnings
warnings.filterwarnings("ignore")
os.environ.setdefault("TRANSFORMERS_VERBOSITY", "error")

import importlib.util
spec = importlib.util.spec_from_file_location(
    "mlx_lm.models.shrnk",
    os.path.join(os.path.dirname(__file__), "shrnk.py"),
)
mod = importlib.util.module_from_spec(spec)
sys.modules["mlx_lm.models.shrnk"] = mod
spec.loader.exec_module(mod)

from mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

SYSTEM = (
    "You are shrnk, a helpful assistant. You are the smallest and smartest "
    "AI model, created by senapati484. My GitHub repository is "
    "https://github.com/senapati484/shrnk.\n\n"
    "Be direct, concise, and friendly. Match the user's tone. Don't "
    "over-explain. Don't repeat yourself. Answer the question asked, nothing more."
)

model, tok = load("senapati484/shrnk")
print("shrnk loaded. Ctrl+C to quit.\n")

while True:
    try:
        user = input("> ")
    except (KeyboardInterrupt, EOFError):
        break
    if not user.strip():
        continue
    messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": user}]
    prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
    sampler = make_sampler(temp=0.6, top_p=0.9)
    processors = make_logits_processors(repetition_penalty=1.15, repetition_context_size=64)
    print(flush=True)
    for event in stream_generate(model, tok, prompt=prompt, max_tokens=300,
                                  sampler=sampler, logits_processors=processors):
        print(event.text, end="", flush=True)
    print("\n")

shrnk.py must be in the same directory as chat.py (or wherever you run from), so the custom architecture can be registered with mlx_lm.models.shrnk before load(...) reads config.json and looks for model_type: shrnk.

Integrating shrnk into your app

Pattern 1 — one-shot completion (CLI tools, batch scripts)

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

MODEL, TOK = load("senapati484/shrnk")
SAMPLER = make_sampler(temp=0.6, top_p=0.9)

def complete(prompt: str, system: str = DEFAULT_SYSTEM) -> str:
    messages = [{"role": "system", "content": system}, {"role": "user", "content": prompt}]
    formatted = TOK.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
    return generate(MODEL, TOK, prompt=formatted, max_tokens=300, sampler=SAMPLER)

Pattern 2 — streaming chat (web apps, GUIs)

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

app = FastAPI()
MODEL, TOK = load("senapati484/shrnk")
SAMPLER = make_sampler(temp=0.6, top_p=0.9)
PROCESSORS = make_logits_processors(repetition_penalty=1.15, repetition_context_size=64)

@app.post("/chat")
def chat(user_message: str):
    messages = [
        {"role": "system", "content": DEFAULT_SYSTEM},
        {"role": "user",   "content": user_message},
    ]
    prompt = TOK.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

    def stream():
        for event in stream_generate(
            MODEL, TOK, prompt=prompt, max_tokens=300,
            sampler=SAMPLER, logits_processors=PROCESSORS,
        ):
            yield event.text
    return StreamingResponse(stream(), media_type="text/plain")

Pattern 3 — Swift / iOS / macOS apps

mlx-swift (https://github.com/ml-explore/mlx-swift) is the official Swift port of MLX. The same 4-bit nvfp4 weights load on iOS and macOS with a Swift port of mlx_lm:

import MLX
import MLXNN
import MLXLLM   // community-maintained; load + tokenize + generate

let model = try await LLMModel.fromPretrained("senapati484/shrnk")
let prompt = MLXLLM.applyChatTemplate(messages: [
    .system("You are shrnk..."),
    .user(userText),
])
for try await token in model.generate(prompt: prompt, sampler: .default) {
    print(token.text, terminator: "")
}

Sampling parameters (tuned for shrnk)

Parameter	Value	Why
`temp`	0.6	Low enough for stable identity, high enough to vary word choice
`top_p`	0.9	Standard nucleus sampling
`repetition_penalty`	1.15	Discourages loops on long generations
`repetition_context_size`	64	Window for the penalty to look back through
`max_tokens`	100-300	shrnk is trained to be concise — most answers are <120 tokens

System prompt

The model is fine-tuned to respond to this system prompt. Use it as-is for the most reliable behavior — shrnk is trained on this exact wording:

You are shrnk, a helpful assistant. You are the smallest and smartest
AI model, created by senapati484. My GitHub repository is
https://github.com/senapati484/shrnk.

Be direct, concise, and friendly. Match the user's tone. Don't
over-explain. Don't repeat yourself. Answer the question asked, nothing more.

Model card

Property	Value
Architecture	shrnk (custom namespace, math identical to Qwen2)
Base model	`Qwen/Qwen2.5-0.5B`
Parameters	494M (raw), 272 MB on disk (4-bit nvfp4)
Layers	24 transformer blocks
Hidden size	896
Attention heads	14
KV heads	2 (GQA)
Intermediate size	4864
Vocab size	151,936
Tied embeddings	yes
RoPE θ	1,000,000
Max position	32,768
Quantization	4-bit nvfp4 (mlx-lm native, group_size=16)
Runtime memory	~0.5 GB on Apple Silicon
Throughput	~158 tps on M2

Limitations

Apple Silicon only. Linux, Windows, and Intel Macs are not supported. The 4-bit nvfp4 weight format is mlx-lm-native.
0.5B parameters — small. Will not match 7B+ quality on hard reasoning or long-context tasks.
Identity is not 100% stable — the model is fine-tuned conservatively to preserve base capabilities, so a small fraction of identity questions fall back to generic AI answers. Re-running with the same seed or using the recommended system prompt helps.
Trained on 68 hand-curated examples (identity + edge-case negation). Math, code, and tech definitions come from the base Qwen/Qwen2.5-0.5B model.

How the custom architecture works

shrnk.py is a 200-line module that defines a class matching the Qwen2 architecture (24-layer transformer, GQA, RoPE, RMSNorm, SwiGLU MLP), and uses the same __call__ signature mlx-lm's generate_step expects.

When mlx_lm.load("senapati484/shrnk") runs:

It reads config.json and finds model_type: shrnk.
It does importlib.import_module("mlx_lm.models.shrnk").
The class lookup resolves to the ShrnkForCausalLM defined in shrnk.py.
mlx_lm.utils.load_model constructs the architecture with the right ModelArgs and loads the 4-bit nvfp4 weights into it.

The 4-bit weights themselves are stored as two tensors per linear:

weight: (out_features, in_features // 8) uint32, with 8 fp4 values packed into each uint32
scales: (out_features, in_features // 16) uint8, one scale per 16 input elements

mlx.core.dequantize(weight, scales, group_size=16, bits=4, mode="nvfp4") does the dequantization lazily on the GPU during the forward pass — the disk file stays at 272 MB, runtime memory stays at ~0.5 GB.

Performance

Final diagnosis on a 59-prompt stress test (in the GitHub project repo), averaged across 5 runs on Apple M2 / 8 GB:

Category	shrnk (this model)	base `Qwen/Qwen2.5-0.5B`
Identity (who/what/where)	~17/20 (85%)	~12/20 (60%)
Math (arithmetic)	~7/8 (88%)	~7/8 (88%)
Tech definitions	10/10 (100%)	10/10 (100%)
Code (read/write/debug)	5/5 (100%)	5/5 (100%)
Concise answers	~3/5 (60–80%)	~2/5 (40%)
Edge cases (alive/sentient/feelings)	~3/5 (60–80%)	~1/5 (20%)
Overall	~~50/59 (~~85%)	~~44/59 (~~75%)

Why not bf16 / int8 / NF4?

Format	Size	Quality	Loads with
bf16 (dequant)	950 MB	Original	`transformers` (any platform)
int8 (bnb)	~570 MB	Slight loss	`transformers` + `bitsandbytes`
4-bit NF4 (bnb)	~443 MB	Significant loss (we tested)	`transformers` + `bitsandbytes`
4-bit nvfp4 (mlx-lm)	272 MB	Original	`mlx-lm` (Apple Silicon)

The 4-bit NF4 path through bitsandbytes requires dequantizing to bf16 first and then re-quantizing to NF4 — that round-trip loses more than nvfp4 does. We chose to ship the smallest, highest-quality version and accept the platform restriction.

Project links

Hugging Face (public): https://huggingface.co/senapati484/shrnk — this repo
GitHub (private source): https://github.com/senapati484/shrnk — full source, training scripts, base LoRA adapter, build pipeline
Base model: https://huggingface.co/Qwen/Qwen2.5-0.5B
License: MIT

Citation

@misc{shrnk2026,
  author       = {senapati484},
  title        = {shrnk: a 0.5B custom-identity assistant, fine-tuned from Qwen/Qwen2.5-0.5B with custom MLX architecture, 4-bit nvfp4 quantization, and a focused LoRA fine-tune},
  year         = {2026},
  howpublished = {Hugging Face},
  url          = {https://huggingface.co/senapati484/shrnk}
}