| # Porting **MobileLLM-R1-950M** to MLX and mlx-lm: Architectural Challenges and Solutions | |
| I spent a some time pairing with Gemini 2.5 Pro and later OpenAI Codex to drag the brand-new facebook/MobileLLM-R1-950M weights onto Apple Silicon. | |
| This write-up is the “why it wasn’t copy-paste” story, plus the gotchas that bit us until the model finally spoke clean English and quantized without drama. | |
| ### Goal | |
| Enable **facebook/MobileLLM-R1-950M** to run natively on Apple Silicon using MLX, then create quantized versions compatible with the mlx-lm ecosystem. | |
| --- | |
| ## 1. Why a Direct "Llama-4 Drop-In" Failed | |
| Although the Hugging Face repo presents MobileLLM-R1-950M as a Llama-4-style dense model, its **config and weights don't align cleanly** with a stock Llama block. The deviations aren't quirks of MLX—they reflect this model's specific architecture: | |
| * **MLP ambiguity** | |
| Config advertises both `intermediate_size` and `intermediate_size_mlp`, suggesting a dual-branch feed-forward. | |
| Actual weights contain only a SwiGLU branch (`gate_proj`, `up_proj`, `down_proj`). | |
| → Solution: **auto-detect MLP variant from weight names** at load time. | |
| * **Grouped-Query Attention (GQA)** | |
| `num_attention_heads=24`, `num_key_value_heads=6`. | |
| K/V tensors must be **repeated to full head count** for attention shapes to align correctly. | |
| * **QK-norm and scaling** | |
| Config includes `use_qk_norm=True` and `attn_scale=0.1`. | |
| We add the **RMSNorm on Q/K** as specified, but drop the extra `0.1` multiplier—applying it in MLX's `scaled_dot_product_attention` collapses logits into gibberish. | |
| * **RoPE gating** | |
| Config lists all layers under `no_rope_layers`. | |
| Disabling RoPE everywhere would eliminate positional encoding entirely. | |
| → Treat "all layers disabled" as a config artifact and **apply RoPE everywhere**. | |
| --- | |
| ## 2. Prompt-Level Deviations | |
| Even after weights loaded correctly, default inference was disrupted by tokenizer settings: | |
| * **Chat template** | |
| Default system prompt: *"Please reason step-by-step and put your final answer within \boxed{}."* | |
| Without overrides, the model produces verbose "reasoning" outputs. | |
| → Added CLI controls: `--system`, `--disable-chat-template`, `--final-only`. | |
| * **Double BOS** | |
| Both tokenizer and template inserted BOS tokens. | |
| → Fixed with `add_special_tokens=False`. | |
| * **Premature EOS** | |
| Template headers (`<|eot_id|>`) were treated as stop tokens. | |
| → Limited stopping criteria to true EOS token only. | |
| --- | |
| ## 3. Sampling Stability | |
| Sampling issues stemmed from API mismatches rather than model problems: | |
| * **Top-p on probabilities** then feeding `mx.random.categorical` produced repetition loops. | |
| * **Solution:** Apply penalties → scale logits → top-p mask (with `float('-inf')`) → `categorical(logits)`. | |
| * Added controls for **temperature, repetition penalty, frequency penalty**. | |
| --- | |
| ## 4. Quantization in mlx-lm: Why Custom Metadata Was Required | |
| mlx-lm provides quantization hooks, but MobileLLM's architecture exposed several challenges: | |
| 1. **Frozen gradients during sensitivity analysis** → empty sensitivity lists. | |
| → Avoid freezing weights during gradient computation. | |
| 2. **Re-quantizing quantized layers** → type errors on second pass. | |
| → Skip `QuantizedLinear` layers if already quantized. | |
| 3. **Embedding/norm dtype crashes** | |
| Standard quantization re-quantized everything, but embeddings must remain float. | |
| → Introduced **metadata-driven approach**: config.json records *per-layer bit-widths*. Only specified layers are instantiated as `QuantizedLinear`. | |
| This metadata contract allows **4-bit mixed-precision MobileLLM** to be loaded cleanly by our **metadata-aware `custom_loader.py`**, making it compatible with the mlx-lm ecosystem. | |
| --- | |
| ## 5. End State | |
| * **MLX path:** | |
| Structural fixes (GQA, MLP detection), numerical fixes (QK-norm, RoPE, attn_scale), and prompt controls together yield fluent, stable inference. | |
| * **mlx-lm path:** | |
| Custom quantization pipeline produces FP16 and 4-bit models. These can be loaded with our **metadata-aware `custom_loader.py`** and used for inference with our provided scripts. | |
| Performance: measurable speedup and reduced VRAM usage on Apple Silicon, with minimal quality degradation. | |
| --- | |
| ### Takeaway | |
| The MobileLLM-R1-950M port required systematically addressing architectural mismatches (MLP variant detection, GQA handling, QK-norm implementation, RoPE configuration) and developing a metadata-driven quantization approach. Once these were resolved, the model became fully functional in MLX with both float and quantized inference paths. |