shrnk / README.md

shrnk v1: 0.5B custom-identity assistant, MLX-only 4-bit nvfp4 (272MB)

bd6e18b verified about 20 hours ago

15.4 kB

	---
	language: en
	license: mit
	tags:
	- mlx
	- apple-silicon
	- shrnk
	- custom-architecture
	- 4-bit
	- nvfp4
	- 0.5b
	- text-generation
	- conversational
	- custom-identity
	base_model: Qwen/Qwen2.5-0.5B
	library_name: shrnk
	pipeline_tag: text-generation
	---

	# shrnk

	> Apple Silicon / MLX only. This repo ships the original 4-bit nvfp4 weights (272 MB) that load natively with `mlx-lm` on M-series Macs. It is not a general-purpose `transformers` model — `AutoModelForCausalLM.from_pretrained(...)` will not work here. See the [usage section](#usage-apple-silicon-mlx) for the correct way to run it.

	shrnk is a custom 0.5B-parameter assistant built on top of [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B). It is not a vanilla fine-tune. The work splits into three layers:

	1. Custom MLX architecture — a 200-line `shrnk.py` registers `ShrnkForCausalLM` with `mlx_lm.models.shrnk`, putting the model in its own namespace (`model_type: shrnk`) instead of `qwen2`.
	2. Custom 4-bit nvfp4 quantization — the LoRA-fused weights are quantized to NVIDIA's FP4 E2M1 microscaling format (`nvfp4`, group_size=16) — 272 MB on disk.
	3. Focused LoRA fine-tune — 68 hand-curated examples teaching identity + edge-case negation. Conservative LoRA (rank 8, alpha 16, 6 layers, 400 iters, LR 3e-5) that preserves the base's math and code abilities.

	> License: MIT. See [LICENSE](https://huggingface.co/senapati484/shrnk/blob/main/LICENSE).

	## What shrnk is

	\| Component \| What we did \|
	\|---\|---\|
	\| Base \| [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) (the 0.5B Qwen 2.5 Instruct). shrnk is built on top of it, not shipped as-is. \|
	\| Architecture \| Custom `Shrnk` namespace: `model_type: shrnk`, `architectures: [ShrnkForCausalLM]`. The math is identical to Qwen2 (24 layers, 896 hidden, 14 heads, 2 KV heads, 4864 intermediate, RoPE θ=1e6, SwiGLU, RMSNorm, GQA, tied embeddings). See `shrnk.py` in this repo — it registers the architecture with `mlx_lm.models.shrnk`. \|
	\| Training \| LoRA fine-tune on `Qwen/Qwen2.5-0.5B` — rank 8, alpha 16, 6 transformer layers, dropout 0.1, 400 iters, LR 3e-5. Trained on 68 hand-curated examples (identity + edge-case negation). Deliberately conservative — we don't train on math/code because the base already does those well. \|
	\| Quantization \| 4-bit nvfp4 (NVIDIA microscaling FP4 E2M1), group_size=16, 272 MB on disk. This is `mlx-lm`'s native 4-bit format — weights are stored as packed uint32 (8 fp4 values per uint32) with per-group uint8 scales. \|

	## Why MLX-only?

	The 4-bit nvfp4 weight format is `mlx-lm`'s native quantization scheme. It uses NVIDIA's FP4 E2M1 microscaling format with `group_size=16` per-tensor scales. To get a 272 MB model that still respects the original quantization precision (no re-quantization fuzz), we ship the raw `mlx-lm` weights and the `shrnk.py` that registers the architecture with `mlx-lm`.

	If you want to run a `transformers`-compatible model, you'll need to dequantize to bf16 first (~950 MB) and use a `transformers` port of the architecture. That's outside the scope of this repo.

	## Hardware requirements

	- Apple Silicon (M1 / M2 / M3 / M4)
	- macOS 13+
	- 8 GB RAM minimum (model uses ~0.5 GB runtime memory)
	- Python 3.10+

	CPU-only `mlx-lm` on Intel Macs will be very slow. NVIDIA / AMD GPUs are not supported by `mlx-lm`.

	## Setup

	```bash
	pip install mlx-lm transformers
	```

	## Usage (Apple Silicon / MLX)

	### Quick start — command line

	```python
	from mlx_lm import load, generate
	from mlx_lm.sample_utils import make_sampler

	model, tok = load('senapati484/shrnk')

	SYSTEM = (
	'You are shrnk, a helpful assistant. You are the smallest and smartest '
	'AI model, created by senapati484. My GitHub repository is '
	'https://github.com/senapati484/shrnk.\n\n'
	'Be direct, concise, and friendly. Match the user\'s tone. Don\'t '
	'over-explain. Don\'t repeat yourself. Answer the question asked, nothing more.'
	)

	messages = [
	{'role': 'system', 'content': SYSTEM},
	{'role': 'user', 'content': 'Who are you?'},
	]
	prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

	sampler = make_sampler(temp=0.6, top_p=0.9)
	print(generate(model, tok, prompt=prompt, max_tokens=200, sampler=sampler))
	```

	> Note: the first `load(...)` call will download the safetensors (~272 MB) and the `shrnk.py` from this repo. On subsequent runs both are cached locally.

	### Streaming (recommended for chat UX)

	`mlx_lm.stream_generate` emits tokens as they're produced, which is what you want for a chat UI.

	```python
	from mlx_lm import load, stream_generate
	from mlx_lm.sample_utils import make_sampler, make_logits_processors

	model, tok = load('senapati484/shrnk')

	SYSTEM = (
	"You are shrnk, a helpful assistant. You are the smallest and smartest "
	"AI model, created by senapati484. My GitHub repository is "
	"https://github.com/senapati484/shrnk.\n\n"
	"Be direct, concise, and friendly. Match the user's tone. Don't "
	"over-explain. Don't repeat yourself. Answer the question asked, nothing more."
	)

	def chat(user_message: str, history: list[dict] \| None = None) -> str:
	history = history or []
	messages = [{"role": "system", "content": SYSTEM}] + history + [
	{"role": "user", "content": user_message}
	]
	prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

	sampler = make_sampler(temp=0.6, top_p=0.9)
	processors = make_logits_processors(repetition_penalty=1.15, repetition_context_size=64)

	full = ""
	for event in stream_generate(
	model, tok, prompt=prompt,
	max_tokens=300, sampler=sampler, logits_processors=processors,
	):
	full += event.text
	return full

	print(chat("Who are you?"))
	```

	### Interactive REPL

	Drop this into a file `chat.py` next to a copy of `shrnk.py` from this repo:

	```python
	import os, sys, warnings
	warnings.filterwarnings("ignore")
	os.environ.setdefault("TRANSFORMERS_VERBOSITY", "error")

	import importlib.util
	spec = importlib.util.spec_from_file_location(
	"mlx_lm.models.shrnk",
	os.path.join(os.path.dirname(__file__), "shrnk.py"),
	)
	mod = importlib.util.module_from_spec(spec)
	sys.modules["mlx_lm.models.shrnk"] = mod
	spec.loader.exec_module(mod)

	from mlx_lm import load, stream_generate
	from mlx_lm.sample_utils import make_sampler, make_logits_processors

	SYSTEM = (
	"You are shrnk, a helpful assistant. You are the smallest and smartest "
	"AI model, created by senapati484. My GitHub repository is "
	"https://github.com/senapati484/shrnk.\n\n"
	"Be direct, concise, and friendly. Match the user's tone. Don't "
	"over-explain. Don't repeat yourself. Answer the question asked, nothing more."
	)

	model, tok = load("senapati484/shrnk")
	print("shrnk loaded. Ctrl+C to quit.\n")

	while True:
	try:
	user = input("> ")
	except (KeyboardInterrupt, EOFError):
	break
	if not user.strip():
	continue
	messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": user}]
	prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
	sampler = make_sampler(temp=0.6, top_p=0.9)
	processors = make_logits_processors(repetition_penalty=1.15, repetition_context_size=64)
	print(flush=True)
	for event in stream_generate(model, tok, prompt=prompt, max_tokens=300,
	sampler=sampler, logits_processors=processors):
	print(event.text, end="", flush=True)
	print("\n")
	```

	`shrnk.py` must be in the same directory as `chat.py` (or wherever you run from), so the custom architecture can be registered with `mlx_lm.models.shrnk` before `load(...)` reads `config.json` and looks for `model_type: shrnk`.

	## Integrating shrnk into your app

	### Pattern 1 — one-shot completion (CLI tools, batch scripts)

	```python
	from mlx_lm import load, generate
	from mlx_lm.sample_utils import make_sampler

	MODEL, TOK = load("senapati484/shrnk")
	SAMPLER = make_sampler(temp=0.6, top_p=0.9)

	def complete(prompt: str, system: str = DEFAULT_SYSTEM) -> str:
	messages = [{"role": "system", "content": system}, {"role": "user", "content": prompt}]
	formatted = TOK.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
	return generate(MODEL, TOK, prompt=formatted, max_tokens=300, sampler=SAMPLER)
	```

	### Pattern 2 — streaming chat (web apps, GUIs)

	```python
	from fastapi import FastAPI
	from fastapi.responses import StreamingResponse
	from mlx_lm import load, stream_generate
	from mlx_lm.sample_utils import make_sampler, make_logits_processors

	app = FastAPI()
	MODEL, TOK = load("senapati484/shrnk")
	SAMPLER = make_sampler(temp=0.6, top_p=0.9)
	PROCESSORS = make_logits_processors(repetition_penalty=1.15, repetition_context_size=64)

	@app.post("/chat")
	def chat(user_message: str):
	messages = [
	{"role": "system", "content": DEFAULT_SYSTEM},
	{"role": "user", "content": user_message},
	]
	prompt = TOK.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

	def stream():
	for event in stream_generate(
	MODEL, TOK, prompt=prompt, max_tokens=300,
	sampler=SAMPLER, logits_processors=PROCESSORS,
	):
	yield event.text
	return StreamingResponse(stream(), media_type="text/plain")
	```

	### Pattern 3 — Swift / iOS / macOS apps

	`mlx-swift` (https://github.com/ml-explore/mlx-swift) is the official Swift port of MLX. The same 4-bit nvfp4 weights load on iOS and macOS with a Swift port of `mlx_lm`:

	```swift
	import MLX
	import MLXNN
	import MLXLLM // community-maintained; load + tokenize + generate

	let model = try await LLMModel.fromPretrained("senapati484/shrnk")
	let prompt = MLXLLM.applyChatTemplate(messages: [
	.system("You are shrnk..."),
	.user(userText),
	])
	for try await token in model.generate(prompt: prompt, sampler: .default) {
	print(token.text, terminator: "")
	}
	```

	## Sampling parameters (tuned for shrnk)

	\| Parameter \| Value \| Why \|
	\|---\|---\|---\|
	\| `temp` \| 0.6 \| Low enough for stable identity, high enough to vary word choice \|
	\| `top_p` \| 0.9 \| Standard nucleus sampling \|
	\| `repetition_penalty` \| 1.15 \| Discourages loops on long generations \|
	\| `repetition_context_size` \| 64 \| Window for the penalty to look back through \|
	\| `max_tokens` \| 100-300 \| shrnk is trained to be concise — most answers are <120 tokens \|

	## System prompt

	The model is fine-tuned to respond to this system prompt. Use it as-is for the most reliable behavior — shrnk is trained on this exact wording:

	```
	You are shrnk, a helpful assistant. You are the smallest and smartest
	AI model, created by senapati484. My GitHub repository is
	https://github.com/senapati484/shrnk.

	Be direct, concise, and friendly. Match the user's tone. Don't
	over-explain. Don't repeat yourself. Answer the question asked, nothing more.
	```

	## Model card

	\| Property \| Value \|
	\|---\|---\|
	\| Architecture \| shrnk (custom namespace, math identical to Qwen2) \|
	\| Base model \| [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) \|
	\| Parameters \| 494M (raw), 272 MB on disk (4-bit nvfp4) \|
	\| Layers \| 24 transformer blocks \|
	\| Hidden size \| 896 \|
	\| Attention heads \| 14 \|
	\| KV heads \| 2 (GQA) \|
	\| Intermediate size \| 4864 \|
	\| Vocab size \| 151,936 \|
	\| Tied embeddings \| yes \|
	\| RoPE θ \| 1,000,000 \|
	\| Max position \| 32,768 \|
	\| Quantization \| 4-bit nvfp4 (mlx-lm native, group_size=16) \|
	\| Runtime memory \| ~0.5 GB on Apple Silicon \|
	\| Throughput \| ~158 tps on M2 \|

	## Limitations

	* Apple Silicon only. Linux, Windows, and Intel Macs are not supported. The 4-bit nvfp4 weight format is `mlx-lm`-native.
	* 0.5B parameters — small. Will not match 7B+ quality on hard reasoning or long-context tasks.
	* Identity is not 100% stable — the model is fine-tuned conservatively to preserve base capabilities, so a small fraction of identity questions fall back to generic AI answers. Re-running with the same seed or using the recommended system prompt helps.
	* Trained on 68 hand-curated examples (identity + edge-case negation). Math, code, and tech definitions come from the base `Qwen/Qwen2.5-0.5B` model.

	## How the custom architecture works

	`shrnk.py` is a 200-line module that defines a class matching the Qwen2 architecture (24-layer transformer, GQA, RoPE, RMSNorm, SwiGLU MLP), and uses the same `__call__` signature `mlx-lm`'s `generate_step` expects.

	When `mlx_lm.load("senapati484/shrnk")` runs:
	1. It reads `config.json` and finds `model_type: shrnk`.
	2. It does `importlib.import_module("mlx_lm.models.shrnk")`.
	3. The class lookup resolves to the `ShrnkForCausalLM` defined in `shrnk.py`.
	4. `mlx_lm.utils.load_model` constructs the architecture with the right `ModelArgs` and loads the 4-bit nvfp4 weights into it.

	The 4-bit weights themselves are stored as two tensors per linear:
	- `weight`: `(out_features, in_features // 8)` uint32, with 8 fp4 values packed into each uint32
	- `scales`: `(out_features, in_features // 16)` uint8, one scale per 16 input elements

	`mlx.core.dequantize(weight, scales, group_size=16, bits=4, mode="nvfp4")` does the dequantization lazily on the GPU during the forward pass — the disk file stays at 272 MB, runtime memory stays at ~0.5 GB.

	## Performance

	Final diagnosis on a 59-prompt stress test (in the [GitHub project repo](https://github.com/senapati484/shrnk)), averaged across 5 runs on Apple M2 / 8 GB:

	\| Category \| shrnk (this model) \| base `Qwen/Qwen2.5-0.5B` \|
	\|---\|---\|---\|
	\| Identity (who/what/where) \| ~17/20 (85%) \| ~12/20 (60%) \|
	\| Math (arithmetic) \| ~7/8 (88%) \| ~7/8 (88%) \|
	\| Tech definitions \| 10/10 (100%) \| 10/10 (100%) \|
	\| Code (read/write/debug) \| 5/5 (100%) \| 5/5 (100%) \|
	\| Concise answers \| ~3/5 (60–80%) \| ~2/5 (40%) \|
	\| Edge cases (alive/sentient/feelings) \| ~3/5 (60–80%) \| ~1/5 (20%) \|
	\| Overall \| ~50/59 (~85%) \| ~44/59 (~75%) \|

	## Why not bf16 / int8 / NF4?

	\| Format \| Size \| Quality \| Loads with \|
	\|---\|---\|---\|---\|
	\| bf16 (dequant) \| 950 MB \| Original \| `transformers` (any platform) \|
	\| int8 (bnb) \| ~570 MB \| Slight loss \| `transformers` + `bitsandbytes` \|
	\| 4-bit NF4 (bnb) \| ~443 MB \| Significant loss (we tested) \| `transformers` + `bitsandbytes` \|
	\| 4-bit nvfp4 (mlx-lm) \| 272 MB \| Original \| `mlx-lm` (Apple Silicon) \|

	The 4-bit NF4 path through `bitsandbytes` requires dequantizing to bf16 first and then re-quantizing to NF4 — that round-trip loses more than `nvfp4` does. We chose to ship the smallest, highest-quality version and accept the platform restriction.

	## Project links

	- Hugging Face (public): https://huggingface.co/senapati484/shrnk — this repo
	- GitHub (private source): https://github.com/senapati484/shrnk — full source, training scripts, base LoRA adapter, build pipeline
	- Base model: https://huggingface.co/Qwen/Qwen2.5-0.5B
	- License: MIT

	## Citation

	```bibtex
	@misc{shrnk2026,
	author = {senapati484},
	title = {shrnk: a 0.5B custom-identity assistant, fine-tuned from Qwen/Qwen2.5-0.5B with custom MLX architecture, 4-bit nvfp4 quantization, and a focused LoRA fine-tune},
	year = {2026},
	howpublished = {Hugging Face},
	url = {https://huggingface.co/senapati484/shrnk}
	}
	```