File size: 4,666 Bytes

e39ff3a

# Porting **MobileLLM-R1-950M** to MLX and mlx-lm: Architectural Challenges and Solutions

I spent a some time pairing with Gemini 2.5 Pro and later OpenAI Codex to drag the brand-new facebook/MobileLLM-R1-950M weights onto Apple Silicon.
This write-up is the “why it wasn’t copy-paste” story, plus the gotchas that bit us until the model finally spoke clean English and quantized without drama.

### Goal

Enable **facebook/MobileLLM-R1-950M** to run natively on Apple Silicon using MLX, then create quantized versions compatible with the mlx-lm ecosystem.

---

## 1. Why a Direct "Llama-4 Drop-In" Failed

Although the Hugging Face repo presents MobileLLM-R1-950M as a Llama-4-style dense model, its **config and weights don't align cleanly** with a stock Llama block. The deviations aren't quirks of MLX—they reflect this model's specific architecture:

* **MLP ambiguity**  
  Config advertises both `intermediate_size` and `intermediate_size_mlp`, suggesting a dual-branch feed-forward.  
  Actual weights contain only a SwiGLU branch (`gate_proj`, `up_proj`, `down_proj`).  
  → Solution: **auto-detect MLP variant from weight names** at load time.

* **Grouped-Query Attention (GQA)**  
  `num_attention_heads=24`, `num_key_value_heads=6`.  
  K/V tensors must be **repeated to full head count** for attention shapes to align correctly.

* **QK-norm and scaling**  
  Config includes `use_qk_norm=True` and `attn_scale=0.1`.  
  We add the **RMSNorm on Q/K** as specified, but drop the extra `0.1` multiplier—applying it in MLX's `scaled_dot_product_attention` collapses logits into gibberish.

* **RoPE gating**  
  Config lists all layers under `no_rope_layers`.  
  Disabling RoPE everywhere would eliminate positional encoding entirely.  
  → Treat "all layers disabled" as a config artifact and **apply RoPE everywhere**.

---

## 2. Prompt-Level Deviations

Even after weights loaded correctly, default inference was disrupted by tokenizer settings:

* **Chat template**  
  Default system prompt: *"Please reason step-by-step and put your final answer within \boxed{}."*  
  Without overrides, the model produces verbose "reasoning" outputs.  
  → Added CLI controls: `--system`, `--disable-chat-template`, `--final-only`.

* **Double BOS**  
  Both tokenizer and template inserted BOS tokens.  
  → Fixed with `add_special_tokens=False`.

* **Premature EOS**  
  Template headers (`<|eot_id|>`) were treated as stop tokens.  
  → Limited stopping criteria to true EOS token only.

---

## 3. Sampling Stability

Sampling issues stemmed from API mismatches rather than model problems:

* **Top-p on probabilities** then feeding `mx.random.categorical` produced repetition loops.  
* **Solution:** Apply penalties → scale logits → top-p mask (with `float('-inf')`) → `categorical(logits)`.  
* Added controls for **temperature, repetition penalty, frequency penalty**.

---

## 4. Quantization in mlx-lm: Why Custom Metadata Was Required

mlx-lm provides quantization hooks, but MobileLLM's architecture exposed several challenges:

1. **Frozen gradients during sensitivity analysis** → empty sensitivity lists.  
   → Avoid freezing weights during gradient computation.

2. **Re-quantizing quantized layers** → type errors on second pass.  
   → Skip `QuantizedLinear` layers if already quantized.

3. **Embedding/norm dtype crashes**  
   Standard quantization re-quantized everything, but embeddings must remain float.  
   → Introduced **metadata-driven approach**: config.json records *per-layer bit-widths*. Only specified layers are instantiated as `QuantizedLinear`.

This metadata contract allows **4-bit mixed-precision MobileLLM** to be loaded cleanly by our **metadata-aware `custom_loader.py`**, making it compatible with the mlx-lm ecosystem.

---

## 5. End State

* **MLX path:**  
  Structural fixes (GQA, MLP detection), numerical fixes (QK-norm, RoPE, attn_scale), and prompt controls together yield fluent, stable inference.

* **mlx-lm path:**  
  Custom quantization pipeline produces FP16 and 4-bit models. These can be loaded with our **metadata-aware `custom_loader.py`** and used for inference with our provided scripts.  
  Performance: measurable speedup and reduced VRAM usage on Apple Silicon, with minimal quality degradation.

---

### Takeaway

The MobileLLM-R1-950M port required systematically addressing architectural mismatches (MLP variant detection, GQA handling, QK-norm implementation, RoPE configuration) and developing a metadata-driven quantization approach. Once these were resolved, the model became fully functional in MLX with both float and quantized inference paths.