Text Generation
Safetensors
MLX
English
shrnk
apple-silicon
custom-architecture
4-bit precision
nvfp4
0.5b
conversational
custom-identity
Instructions to use senapati484/shrnk with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use senapati484/shrnk with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("senapati484/shrnk") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use senapati484/shrnk with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "senapati484/shrnk"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "senapati484/shrnk" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use senapati484/shrnk with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "senapati484/shrnk"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default senapati484/shrnk
Run Hermes
hermes
- MLX LM
How to use senapati484/shrnk with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "senapati484/shrnk"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "senapati484/shrnk" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "senapati484/shrnk", "messages": [ {"role": "user", "content": "Hello"} ] }'
| language: en | |
| license: mit | |
| tags: | |
| - mlx | |
| - apple-silicon | |
| - shrnk | |
| - custom-architecture | |
| - 4-bit | |
| - nvfp4 | |
| - 0.5b | |
| - text-generation | |
| - conversational | |
| - custom-identity | |
| base_model: Qwen/Qwen2.5-0.5B | |
| library_name: shrnk | |
| pipeline_tag: text-generation | |
| # shrnk | |
| > **Apple Silicon / MLX only.** This repo ships the original 4-bit nvfp4 weights (272 MB) that load natively with `mlx-lm` on M-series Macs. It is **not** a general-purpose `transformers` model β `AutoModelForCausalLM.from_pretrained(...)` will not work here. See the [usage section](#usage-apple-silicon-mlx) for the correct way to run it. | |
| **shrnk** is a custom 0.5B-parameter assistant built on top of [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B). It is **not** a vanilla fine-tune. The work splits into three layers: | |
| 1. **Custom MLX architecture** β a 200-line `shrnk.py` registers `ShrnkForCausalLM` with `mlx_lm.models.shrnk`, putting the model in its own namespace (`model_type: shrnk`) instead of `qwen2`. | |
| 2. **Custom 4-bit nvfp4 quantization** β the LoRA-fused weights are quantized to NVIDIA's FP4 E2M1 microscaling format (`nvfp4`, group_size=16) β 272 MB on disk. | |
| 3. **Focused LoRA fine-tune** β 68 hand-curated examples teaching identity + edge-case negation. Conservative LoRA (rank 8, alpha 16, 6 layers, 400 iters, LR 3e-5) that preserves the base's math and code abilities. | |
| > **License:** MIT. See [LICENSE](https://huggingface.co/senapati484/shrnk/blob/main/LICENSE). | |
| ## What shrnk is | |
| | Component | What we did | | |
| |---|---| | |
| | **Base** | [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) (the 0.5B Qwen 2.5 Instruct). shrnk is built on top of it, not shipped as-is. | | |
| | **Architecture** | Custom `Shrnk` namespace: `model_type: shrnk`, `architectures: [ShrnkForCausalLM]`. The math is identical to Qwen2 (24 layers, 896 hidden, 14 heads, 2 KV heads, 4864 intermediate, RoPE ΞΈ=1e6, SwiGLU, RMSNorm, GQA, tied embeddings). See `shrnk.py` in this repo β it registers the architecture with `mlx_lm.models.shrnk`. | | |
| | **Training** | LoRA fine-tune on `Qwen/Qwen2.5-0.5B` β rank 8, alpha 16, 6 transformer layers, dropout 0.1, 400 iters, LR 3e-5. Trained on 68 hand-curated examples (identity + edge-case negation). Deliberately conservative β we don't train on math/code because the base already does those well. | | |
| | **Quantization** | **4-bit nvfp4** (NVIDIA microscaling FP4 E2M1), group_size=16, **272 MB** on disk. This is `mlx-lm`'s native 4-bit format β weights are stored as packed uint32 (8 fp4 values per uint32) with per-group uint8 scales. | | |
| ## Why MLX-only? | |
| The 4-bit nvfp4 weight format is `mlx-lm`'s native quantization scheme. It uses NVIDIA's FP4 E2M1 microscaling format with `group_size=16` per-tensor scales. To get a 272 MB model that still respects the original quantization precision (no re-quantization fuzz), we ship the raw `mlx-lm` weights and the `shrnk.py` that registers the architecture with `mlx-lm`. | |
| If you want to run a `transformers`-compatible model, you'll need to dequantize to bf16 first (~950 MB) and use a `transformers` port of the architecture. That's outside the scope of this repo. | |
| ## Hardware requirements | |
| - **Apple Silicon** (M1 / M2 / M3 / M4) | |
| - **macOS 13+** | |
| - **8 GB RAM minimum** (model uses ~0.5 GB runtime memory) | |
| - **Python 3.10+** | |
| CPU-only `mlx-lm` on Intel Macs will be very slow. NVIDIA / AMD GPUs are not supported by `mlx-lm`. | |
| ## Setup | |
| ```bash | |
| pip install mlx-lm transformers | |
| ``` | |
| ## Usage (Apple Silicon / MLX) | |
| ### Quick start β command line | |
| ```python | |
| from mlx_lm import load, generate | |
| from mlx_lm.sample_utils import make_sampler | |
| model, tok = load('senapati484/shrnk') | |
| SYSTEM = ( | |
| 'You are shrnk, a helpful assistant. You are the smallest and smartest ' | |
| 'AI model, created by senapati484. My GitHub repository is ' | |
| 'https://github.com/senapati484/shrnk.\n\n' | |
| 'Be direct, concise, and friendly. Match the user\'s tone. Don\'t ' | |
| 'over-explain. Don\'t repeat yourself. Answer the question asked, nothing more.' | |
| ) | |
| messages = [ | |
| {'role': 'system', 'content': SYSTEM}, | |
| {'role': 'user', 'content': 'Who are you?'}, | |
| ] | |
| prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) | |
| sampler = make_sampler(temp=0.6, top_p=0.9) | |
| print(generate(model, tok, prompt=prompt, max_tokens=200, sampler=sampler)) | |
| ``` | |
| > Note: the first `load(...)` call will download the safetensors (~272 MB) and the `shrnk.py` from this repo. On subsequent runs both are cached locally. | |
| ### Streaming (recommended for chat UX) | |
| `mlx_lm.stream_generate` emits tokens as they're produced, which is what you want for a chat UI. | |
| ```python | |
| from mlx_lm import load, stream_generate | |
| from mlx_lm.sample_utils import make_sampler, make_logits_processors | |
| model, tok = load('senapati484/shrnk') | |
| SYSTEM = ( | |
| "You are shrnk, a helpful assistant. You are the smallest and smartest " | |
| "AI model, created by senapati484. My GitHub repository is " | |
| "https://github.com/senapati484/shrnk.\n\n" | |
| "Be direct, concise, and friendly. Match the user's tone. Don't " | |
| "over-explain. Don't repeat yourself. Answer the question asked, nothing more." | |
| ) | |
| def chat(user_message: str, history: list[dict] | None = None) -> str: | |
| history = history or [] | |
| messages = [{"role": "system", "content": SYSTEM}] + history + [ | |
| {"role": "user", "content": user_message} | |
| ] | |
| prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) | |
| sampler = make_sampler(temp=0.6, top_p=0.9) | |
| processors = make_logits_processors(repetition_penalty=1.15, repetition_context_size=64) | |
| full = "" | |
| for event in stream_generate( | |
| model, tok, prompt=prompt, | |
| max_tokens=300, sampler=sampler, logits_processors=processors, | |
| ): | |
| full += event.text | |
| return full | |
| print(chat("Who are you?")) | |
| ``` | |
| ### Interactive REPL | |
| Drop this into a file `chat.py` next to a copy of `shrnk.py` from this repo: | |
| ```python | |
| import os, sys, warnings | |
| warnings.filterwarnings("ignore") | |
| os.environ.setdefault("TRANSFORMERS_VERBOSITY", "error") | |
| import importlib.util | |
| spec = importlib.util.spec_from_file_location( | |
| "mlx_lm.models.shrnk", | |
| os.path.join(os.path.dirname(__file__), "shrnk.py"), | |
| ) | |
| mod = importlib.util.module_from_spec(spec) | |
| sys.modules["mlx_lm.models.shrnk"] = mod | |
| spec.loader.exec_module(mod) | |
| from mlx_lm import load, stream_generate | |
| from mlx_lm.sample_utils import make_sampler, make_logits_processors | |
| SYSTEM = ( | |
| "You are shrnk, a helpful assistant. You are the smallest and smartest " | |
| "AI model, created by senapati484. My GitHub repository is " | |
| "https://github.com/senapati484/shrnk.\n\n" | |
| "Be direct, concise, and friendly. Match the user's tone. Don't " | |
| "over-explain. Don't repeat yourself. Answer the question asked, nothing more." | |
| ) | |
| model, tok = load("senapati484/shrnk") | |
| print("shrnk loaded. Ctrl+C to quit.\n") | |
| while True: | |
| try: | |
| user = input("> ") | |
| except (KeyboardInterrupt, EOFError): | |
| break | |
| if not user.strip(): | |
| continue | |
| messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": user}] | |
| prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) | |
| sampler = make_sampler(temp=0.6, top_p=0.9) | |
| processors = make_logits_processors(repetition_penalty=1.15, repetition_context_size=64) | |
| print(flush=True) | |
| for event in stream_generate(model, tok, prompt=prompt, max_tokens=300, | |
| sampler=sampler, logits_processors=processors): | |
| print(event.text, end="", flush=True) | |
| print("\n") | |
| ``` | |
| `shrnk.py` must be in the same directory as `chat.py` (or wherever you run from), so the custom architecture can be registered with `mlx_lm.models.shrnk` before `load(...)` reads `config.json` and looks for `model_type: shrnk`. | |
| ## Integrating shrnk into your app | |
| ### Pattern 1 β one-shot completion (CLI tools, batch scripts) | |
| ```python | |
| from mlx_lm import load, generate | |
| from mlx_lm.sample_utils import make_sampler | |
| MODEL, TOK = load("senapati484/shrnk") | |
| SAMPLER = make_sampler(temp=0.6, top_p=0.9) | |
| def complete(prompt: str, system: str = DEFAULT_SYSTEM) -> str: | |
| messages = [{"role": "system", "content": system}, {"role": "user", "content": prompt}] | |
| formatted = TOK.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) | |
| return generate(MODEL, TOK, prompt=formatted, max_tokens=300, sampler=SAMPLER) | |
| ``` | |
| ### Pattern 2 β streaming chat (web apps, GUIs) | |
| ```python | |
| from fastapi import FastAPI | |
| from fastapi.responses import StreamingResponse | |
| from mlx_lm import load, stream_generate | |
| from mlx_lm.sample_utils import make_sampler, make_logits_processors | |
| app = FastAPI() | |
| MODEL, TOK = load("senapati484/shrnk") | |
| SAMPLER = make_sampler(temp=0.6, top_p=0.9) | |
| PROCESSORS = make_logits_processors(repetition_penalty=1.15, repetition_context_size=64) | |
| @app.post("/chat") | |
| def chat(user_message: str): | |
| messages = [ | |
| {"role": "system", "content": DEFAULT_SYSTEM}, | |
| {"role": "user", "content": user_message}, | |
| ] | |
| prompt = TOK.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) | |
| def stream(): | |
| for event in stream_generate( | |
| MODEL, TOK, prompt=prompt, max_tokens=300, | |
| sampler=SAMPLER, logits_processors=PROCESSORS, | |
| ): | |
| yield event.text | |
| return StreamingResponse(stream(), media_type="text/plain") | |
| ``` | |
| ### Pattern 3 β Swift / iOS / macOS apps | |
| `mlx-swift` (https://github.com/ml-explore/mlx-swift) is the official Swift port of MLX. The same 4-bit nvfp4 weights load on iOS and macOS with a Swift port of `mlx_lm`: | |
| ```swift | |
| import MLX | |
| import MLXNN | |
| import MLXLLM // community-maintained; load + tokenize + generate | |
| let model = try await LLMModel.fromPretrained("senapati484/shrnk") | |
| let prompt = MLXLLM.applyChatTemplate(messages: [ | |
| .system("You are shrnk..."), | |
| .user(userText), | |
| ]) | |
| for try await token in model.generate(prompt: prompt, sampler: .default) { | |
| print(token.text, terminator: "") | |
| } | |
| ``` | |
| ## Sampling parameters (tuned for shrnk) | |
| | Parameter | Value | Why | | |
| |---|---|---| | |
| | `temp` | 0.6 | Low enough for stable identity, high enough to vary word choice | | |
| | `top_p` | 0.9 | Standard nucleus sampling | | |
| | `repetition_penalty` | 1.15 | Discourages loops on long generations | | |
| | `repetition_context_size` | 64 | Window for the penalty to look back through | | |
| | `max_tokens` | 100-300 | shrnk is trained to be concise β most answers are <120 tokens | | |
| ## System prompt | |
| The model is fine-tuned to respond to this system prompt. **Use it as-is for the most reliable behavior** β shrnk is trained on this exact wording: | |
| ``` | |
| You are shrnk, a helpful assistant. You are the smallest and smartest | |
| AI model, created by senapati484. My GitHub repository is | |
| https://github.com/senapati484/shrnk. | |
| Be direct, concise, and friendly. Match the user's tone. Don't | |
| over-explain. Don't repeat yourself. Answer the question asked, nothing more. | |
| ``` | |
| ## Model card | |
| | Property | Value | | |
| |---|---| | |
| | Architecture | shrnk (custom namespace, math identical to Qwen2) | | |
| | **Base model** | **[`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B)** | | |
| | Parameters | 494M (raw), 272 MB on disk (4-bit nvfp4) | | |
| | Layers | 24 transformer blocks | | |
| | Hidden size | 896 | | |
| | Attention heads | 14 | | |
| | KV heads | 2 (GQA) | | |
| | Intermediate size | 4864 | | |
| | Vocab size | 151,936 | | |
| | Tied embeddings | yes | | |
| | RoPE ΞΈ | 1,000,000 | | |
| | Max position | 32,768 | | |
| | Quantization | 4-bit nvfp4 (mlx-lm native, group_size=16) | | |
| | Runtime memory | ~0.5 GB on Apple Silicon | | |
| | Throughput | ~158 tps on M2 | | |
| ## Limitations | |
| * **Apple Silicon only.** Linux, Windows, and Intel Macs are not supported. The 4-bit nvfp4 weight format is `mlx-lm`-native. | |
| * 0.5B parameters β small. Will not match 7B+ quality on hard reasoning or long-context tasks. | |
| * Identity is not 100% stable β the model is fine-tuned conservatively to preserve base capabilities, so a small fraction of identity questions fall back to generic AI answers. Re-running with the same seed or using the recommended system prompt helps. | |
| * Trained on 68 hand-curated examples (identity + edge-case negation). Math, code, and tech definitions come from the base `Qwen/Qwen2.5-0.5B` model. | |
| ## How the custom architecture works | |
| `shrnk.py` is a 200-line module that defines a class matching the Qwen2 architecture (24-layer transformer, GQA, RoPE, RMSNorm, SwiGLU MLP), and uses the same `__call__` signature `mlx-lm`'s `generate_step` expects. | |
| When `mlx_lm.load("senapati484/shrnk")` runs: | |
| 1. It reads `config.json` and finds `model_type: shrnk`. | |
| 2. It does `importlib.import_module("mlx_lm.models.shrnk")`. | |
| 3. The class lookup resolves to the `ShrnkForCausalLM` defined in `shrnk.py`. | |
| 4. `mlx_lm.utils.load_model` constructs the architecture with the right `ModelArgs` and loads the 4-bit nvfp4 weights into it. | |
| The 4-bit weights themselves are stored as two tensors per linear: | |
| - `weight`: `(out_features, in_features // 8)` uint32, with 8 fp4 values packed into each uint32 | |
| - `scales`: `(out_features, in_features // 16)` uint8, one scale per 16 input elements | |
| `mlx.core.dequantize(weight, scales, group_size=16, bits=4, mode="nvfp4")` does the dequantization lazily on the GPU during the forward pass β the disk file stays at 272 MB, runtime memory stays at ~0.5 GB. | |
| ## Performance | |
| Final diagnosis on a 59-prompt stress test (in the [GitHub project repo](https://github.com/senapati484/shrnk)), averaged across 5 runs on Apple M2 / 8 GB: | |
| | Category | shrnk (this model) | base `Qwen/Qwen2.5-0.5B` | | |
| |---|---|---| | |
| | Identity (who/what/where) | ~17/20 (85%) | ~12/20 (60%) | | |
| | Math (arithmetic) | ~7/8 (88%) | ~7/8 (88%) | | |
| | Tech definitions | 10/10 (100%) | 10/10 (100%) | | |
| | Code (read/write/debug) | 5/5 (100%) | 5/5 (100%) | | |
| | Concise answers | ~3/5 (60β80%) | ~2/5 (40%) | | |
| | Edge cases (alive/sentient/feelings) | ~3/5 (60β80%) | ~1/5 (20%) | | |
| | **Overall** | **~50/59 (~85%)** | **~44/59 (~75%)** | | |
| ## Why not bf16 / int8 / NF4? | |
| | Format | Size | Quality | Loads with | | |
| |---|---|---|---| | |
| | bf16 (dequant) | 950 MB | Original | `transformers` (any platform) | | |
| | int8 (bnb) | ~570 MB | Slight loss | `transformers` + `bitsandbytes` | | |
| | 4-bit NF4 (bnb) | ~443 MB | Significant loss (we tested) | `transformers` + `bitsandbytes` | | |
| | **4-bit nvfp4 (mlx-lm)** | **272 MB** | **Original** | **`mlx-lm` (Apple Silicon)** | | |
| The 4-bit NF4 path through `bitsandbytes` requires dequantizing to bf16 first and then re-quantizing to NF4 β that round-trip loses more than `nvfp4` does. We chose to ship the smallest, highest-quality version and accept the platform restriction. | |
| ## Project links | |
| - **Hugging Face (public)**: https://huggingface.co/senapati484/shrnk β this repo | |
| - **GitHub (private source)**: https://github.com/senapati484/shrnk β full source, training scripts, base LoRA adapter, build pipeline | |
| - **Base model**: https://huggingface.co/Qwen/Qwen2.5-0.5B | |
| - **License**: MIT | |
| ## Citation | |
| ```bibtex | |
| @misc{shrnk2026, | |
| author = {senapati484}, | |
| title = {shrnk: a 0.5B custom-identity assistant, fine-tuned from Qwen/Qwen2.5-0.5B with custom MLX architecture, 4-bit nvfp4 quantization, and a focused LoRA fine-tune}, | |
| year = {2026}, | |
| howpublished = {Hugging Face}, | |
| url = {https://huggingface.co/senapati484/shrnk} | |
| } | |
| ``` | |