Instructions to use senapati484/shrnk with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use senapati484/shrnk with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("senapati484/shrnk") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use senapati484/shrnk with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "senapati484/shrnk"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "senapati484/shrnk" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use senapati484/shrnk with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "senapati484/shrnk"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default senapati484/shrnk
Run Hermes
hermes
- MLX LM
How to use senapati484/shrnk with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "senapati484/shrnk"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "senapati484/shrnk" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "senapati484/shrnk", "messages": [ {"role": "user", "content": "Hello"} ] }'
language: en
license: mit
tags:
- mlx
- apple-silicon
- shrnk
- custom-architecture
- 4-bit
- nvfp4
- 0.5b
- text-generation
- conversational
- custom-identity
base_model: Qwen/Qwen2.5-0.5B
library_name: shrnk
pipeline_tag: text-generation
shrnk
Apple Silicon / MLX only. This repo ships the original 4-bit nvfp4 weights (272 MB) that load natively with
mlx-lmon M-series Macs. It is not a general-purposetransformersmodel βAutoModelForCausalLM.from_pretrained(...)will not work here. See the usage section for the correct way to run it.
shrnk is a custom 0.5B-parameter assistant built on top of Qwen/Qwen2.5-0.5B. It is not a vanilla fine-tune. The work splits into three layers:
- Custom MLX architecture β a 200-line
shrnk.pyregistersShrnkForCausalLMwithmlx_lm.models.shrnk, putting the model in its own namespace (model_type: shrnk) instead ofqwen2. - Custom 4-bit nvfp4 quantization β the LoRA-fused weights are quantized to NVIDIA's FP4 E2M1 microscaling format (
nvfp4, group_size=16) β 272 MB on disk. - Focused LoRA fine-tune β 68 hand-curated examples teaching identity + edge-case negation. Conservative LoRA (rank 8, alpha 16, 6 layers, 400 iters, LR 3e-5) that preserves the base's math and code abilities.
License: MIT. See LICENSE.
What shrnk is
| Component | What we did |
|---|---|
| Base | Qwen/Qwen2.5-0.5B (the 0.5B Qwen 2.5 Instruct). shrnk is built on top of it, not shipped as-is. |
| Architecture | Custom Shrnk namespace: model_type: shrnk, architectures: [ShrnkForCausalLM]. The math is identical to Qwen2 (24 layers, 896 hidden, 14 heads, 2 KV heads, 4864 intermediate, RoPE ΞΈ=1e6, SwiGLU, RMSNorm, GQA, tied embeddings). See shrnk.py in this repo β it registers the architecture with mlx_lm.models.shrnk. |
| Training | LoRA fine-tune on Qwen/Qwen2.5-0.5B β rank 8, alpha 16, 6 transformer layers, dropout 0.1, 400 iters, LR 3e-5. Trained on 68 hand-curated examples (identity + edge-case negation). Deliberately conservative β we don't train on math/code because the base already does those well. |
| Quantization | 4-bit nvfp4 (NVIDIA microscaling FP4 E2M1), group_size=16, 272 MB on disk. This is mlx-lm's native 4-bit format β weights are stored as packed uint32 (8 fp4 values per uint32) with per-group uint8 scales. |
Why MLX-only?
The 4-bit nvfp4 weight format is mlx-lm's native quantization scheme. It uses NVIDIA's FP4 E2M1 microscaling format with group_size=16 per-tensor scales. To get a 272 MB model that still respects the original quantization precision (no re-quantization fuzz), we ship the raw mlx-lm weights and the shrnk.py that registers the architecture with mlx-lm.
If you want to run a transformers-compatible model, you'll need to dequantize to bf16 first (~950 MB) and use a transformers port of the architecture. That's outside the scope of this repo.
Hardware requirements
- Apple Silicon (M1 / M2 / M3 / M4)
- macOS 13+
- 8 GB RAM minimum (model uses ~0.5 GB runtime memory)
- Python 3.10+
CPU-only mlx-lm on Intel Macs will be very slow. NVIDIA / AMD GPUs are not supported by mlx-lm.
Setup
pip install mlx-lm transformers
Usage (Apple Silicon / MLX)
Quick start β command line
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
model, tok = load('senapati484/shrnk')
SYSTEM = (
'You are shrnk, a helpful assistant. You are the smallest and smartest '
'AI model, created by senapati484. My GitHub repository is '
'https://github.com/senapati484/shrnk.\n\n'
'Be direct, concise, and friendly. Match the user\'s tone. Don\'t '
'over-explain. Don\'t repeat yourself. Answer the question asked, nothing more.'
)
messages = [
{'role': 'system', 'content': SYSTEM},
{'role': 'user', 'content': 'Who are you?'},
]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
sampler = make_sampler(temp=0.6, top_p=0.9)
print(generate(model, tok, prompt=prompt, max_tokens=200, sampler=sampler))
Note: the first
load(...)call will download the safetensors (~272 MB) and theshrnk.pyfrom this repo. On subsequent runs both are cached locally.
Streaming (recommended for chat UX)
mlx_lm.stream_generate emits tokens as they're produced, which is what you want for a chat UI.
from mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors
model, tok = load('senapati484/shrnk')
SYSTEM = (
"You are shrnk, a helpful assistant. You are the smallest and smartest "
"AI model, created by senapati484. My GitHub repository is "
"https://github.com/senapati484/shrnk.\n\n"
"Be direct, concise, and friendly. Match the user's tone. Don't "
"over-explain. Don't repeat yourself. Answer the question asked, nothing more."
)
def chat(user_message: str, history: list[dict] | None = None) -> str:
history = history or []
messages = [{"role": "system", "content": SYSTEM}] + history + [
{"role": "user", "content": user_message}
]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
sampler = make_sampler(temp=0.6, top_p=0.9)
processors = make_logits_processors(repetition_penalty=1.15, repetition_context_size=64)
full = ""
for event in stream_generate(
model, tok, prompt=prompt,
max_tokens=300, sampler=sampler, logits_processors=processors,
):
full += event.text
return full
print(chat("Who are you?"))
Interactive REPL
Drop this into a file chat.py next to a copy of shrnk.py from this repo:
import os, sys, warnings
warnings.filterwarnings("ignore")
os.environ.setdefault("TRANSFORMERS_VERBOSITY", "error")
import importlib.util
spec = importlib.util.spec_from_file_location(
"mlx_lm.models.shrnk",
os.path.join(os.path.dirname(__file__), "shrnk.py"),
)
mod = importlib.util.module_from_spec(spec)
sys.modules["mlx_lm.models.shrnk"] = mod
spec.loader.exec_module(mod)
from mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors
SYSTEM = (
"You are shrnk, a helpful assistant. You are the smallest and smartest "
"AI model, created by senapati484. My GitHub repository is "
"https://github.com/senapati484/shrnk.\n\n"
"Be direct, concise, and friendly. Match the user's tone. Don't "
"over-explain. Don't repeat yourself. Answer the question asked, nothing more."
)
model, tok = load("senapati484/shrnk")
print("shrnk loaded. Ctrl+C to quit.\n")
while True:
try:
user = input("> ")
except (KeyboardInterrupt, EOFError):
break
if not user.strip():
continue
messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": user}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
sampler = make_sampler(temp=0.6, top_p=0.9)
processors = make_logits_processors(repetition_penalty=1.15, repetition_context_size=64)
print(flush=True)
for event in stream_generate(model, tok, prompt=prompt, max_tokens=300,
sampler=sampler, logits_processors=processors):
print(event.text, end="", flush=True)
print("\n")
shrnk.py must be in the same directory as chat.py (or wherever you run from), so the custom architecture can be registered with mlx_lm.models.shrnk before load(...) reads config.json and looks for model_type: shrnk.
Integrating shrnk into your app
Pattern 1 β one-shot completion (CLI tools, batch scripts)
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
MODEL, TOK = load("senapati484/shrnk")
SAMPLER = make_sampler(temp=0.6, top_p=0.9)
def complete(prompt: str, system: str = DEFAULT_SYSTEM) -> str:
messages = [{"role": "system", "content": system}, {"role": "user", "content": prompt}]
formatted = TOK.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
return generate(MODEL, TOK, prompt=formatted, max_tokens=300, sampler=SAMPLER)
Pattern 2 β streaming chat (web apps, GUIs)
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors
app = FastAPI()
MODEL, TOK = load("senapati484/shrnk")
SAMPLER = make_sampler(temp=0.6, top_p=0.9)
PROCESSORS = make_logits_processors(repetition_penalty=1.15, repetition_context_size=64)
@app.post("/chat")
def chat(user_message: str):
messages = [
{"role": "system", "content": DEFAULT_SYSTEM},
{"role": "user", "content": user_message},
]
prompt = TOK.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
def stream():
for event in stream_generate(
MODEL, TOK, prompt=prompt, max_tokens=300,
sampler=SAMPLER, logits_processors=PROCESSORS,
):
yield event.text
return StreamingResponse(stream(), media_type="text/plain")
Pattern 3 β Swift / iOS / macOS apps
mlx-swift (https://github.com/ml-explore/mlx-swift) is the official Swift port of MLX. The same 4-bit nvfp4 weights load on iOS and macOS with a Swift port of mlx_lm:
import MLX
import MLXNN
import MLXLLM // community-maintained; load + tokenize + generate
let model = try await LLMModel.fromPretrained("senapati484/shrnk")
let prompt = MLXLLM.applyChatTemplate(messages: [
.system("You are shrnk..."),
.user(userText),
])
for try await token in model.generate(prompt: prompt, sampler: .default) {
print(token.text, terminator: "")
}
Sampling parameters (tuned for shrnk)
| Parameter | Value | Why |
|---|---|---|
temp |
0.6 | Low enough for stable identity, high enough to vary word choice |
top_p |
0.9 | Standard nucleus sampling |
repetition_penalty |
1.15 | Discourages loops on long generations |
repetition_context_size |
64 | Window for the penalty to look back through |
max_tokens |
100-300 | shrnk is trained to be concise β most answers are <120 tokens |
System prompt
The model is fine-tuned to respond to this system prompt. Use it as-is for the most reliable behavior β shrnk is trained on this exact wording:
You are shrnk, a helpful assistant. You are the smallest and smartest
AI model, created by senapati484. My GitHub repository is
https://github.com/senapati484/shrnk.
Be direct, concise, and friendly. Match the user's tone. Don't
over-explain. Don't repeat yourself. Answer the question asked, nothing more.
Model card
| Property | Value |
|---|---|
| Architecture | shrnk (custom namespace, math identical to Qwen2) |
| Base model | Qwen/Qwen2.5-0.5B |
| Parameters | 494M (raw), 272 MB on disk (4-bit nvfp4) |
| Layers | 24 transformer blocks |
| Hidden size | 896 |
| Attention heads | 14 |
| KV heads | 2 (GQA) |
| Intermediate size | 4864 |
| Vocab size | 151,936 |
| Tied embeddings | yes |
| RoPE ΞΈ | 1,000,000 |
| Max position | 32,768 |
| Quantization | 4-bit nvfp4 (mlx-lm native, group_size=16) |
| Runtime memory | ~0.5 GB on Apple Silicon |
| Throughput | ~158 tps on M2 |
Limitations
- Apple Silicon only. Linux, Windows, and Intel Macs are not supported. The 4-bit nvfp4 weight format is
mlx-lm-native. - 0.5B parameters β small. Will not match 7B+ quality on hard reasoning or long-context tasks.
- Identity is not 100% stable β the model is fine-tuned conservatively to preserve base capabilities, so a small fraction of identity questions fall back to generic AI answers. Re-running with the same seed or using the recommended system prompt helps.
- Trained on 68 hand-curated examples (identity + edge-case negation). Math, code, and tech definitions come from the base
Qwen/Qwen2.5-0.5Bmodel.
How the custom architecture works
shrnk.py is a 200-line module that defines a class matching the Qwen2 architecture (24-layer transformer, GQA, RoPE, RMSNorm, SwiGLU MLP), and uses the same __call__ signature mlx-lm's generate_step expects.
When mlx_lm.load("senapati484/shrnk") runs:
- It reads
config.jsonand findsmodel_type: shrnk. - It does
importlib.import_module("mlx_lm.models.shrnk"). - The class lookup resolves to the
ShrnkForCausalLMdefined inshrnk.py. mlx_lm.utils.load_modelconstructs the architecture with the rightModelArgsand loads the 4-bit nvfp4 weights into it.
The 4-bit weights themselves are stored as two tensors per linear:
weight:(out_features, in_features // 8)uint32, with 8 fp4 values packed into each uint32scales:(out_features, in_features // 16)uint8, one scale per 16 input elements
mlx.core.dequantize(weight, scales, group_size=16, bits=4, mode="nvfp4") does the dequantization lazily on the GPU during the forward pass β the disk file stays at 272 MB, runtime memory stays at ~0.5 GB.
Performance
Final diagnosis on a 59-prompt stress test (in the GitHub project repo), averaged across 5 runs on Apple M2 / 8 GB:
| Category | shrnk (this model) | base Qwen/Qwen2.5-0.5B |
|---|---|---|
| Identity (who/what/where) | ~17/20 (85%) | ~12/20 (60%) |
| Math (arithmetic) | ~7/8 (88%) | ~7/8 (88%) |
| Tech definitions | 10/10 (100%) | 10/10 (100%) |
| Code (read/write/debug) | 5/5 (100%) | 5/5 (100%) |
| Concise answers | ~3/5 (60β80%) | ~2/5 (40%) |
| Edge cases (alive/sentient/feelings) | ~3/5 (60β80%) | ~1/5 (20%) |
| Overall |
Why not bf16 / int8 / NF4?
| Format | Size | Quality | Loads with |
|---|---|---|---|
| bf16 (dequant) | 950 MB | Original | transformers (any platform) |
| int8 (bnb) | ~570 MB | Slight loss | transformers + bitsandbytes |
| 4-bit NF4 (bnb) | ~443 MB | Significant loss (we tested) | transformers + bitsandbytes |
| 4-bit nvfp4 (mlx-lm) | 272 MB | Original | mlx-lm (Apple Silicon) |
The 4-bit NF4 path through bitsandbytes requires dequantizing to bf16 first and then re-quantizing to NF4 β that round-trip loses more than nvfp4 does. We chose to ship the smallest, highest-quality version and accept the platform restriction.
Project links
- Hugging Face (public): https://huggingface.co/senapati484/shrnk β this repo
- GitHub (private source): https://github.com/senapati484/shrnk β full source, training scripts, base LoRA adapter, build pipeline
- Base model: https://huggingface.co/Qwen/Qwen2.5-0.5B
- License: MIT
Citation
@misc{shrnk2026,
author = {senapati484},
title = {shrnk: a 0.5B custom-identity assistant, fine-tuned from Qwen/Qwen2.5-0.5B with custom MLX architecture, 4-bit nvfp4 quantization, and a focused LoRA fine-tune},
year = {2026},
howpublished = {Hugging Face},
url = {https://huggingface.co/senapati484/shrnk}
}