Instructions to use robbiemu/MobileLLM-R1-950M-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use robbiemu/MobileLLM-R1-950M-MLX with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("robbiemu/MobileLLM-R1-950M-MLX")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use robbiemu/MobileLLM-R1-950M-MLX with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "robbiemu/MobileLLM-R1-950M-MLX"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "robbiemu/MobileLLM-R1-950M-MLX"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use robbiemu/MobileLLM-R1-950M-MLX with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "robbiemu/MobileLLM-R1-950M-MLX"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default robbiemu/MobileLLM-R1-950M-MLX

Run Hermes

hermes

MLX LM

How to use robbiemu/MobileLLM-R1-950M-MLX with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "robbiemu/MobileLLM-R1-950M-MLX"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "robbiemu/MobileLLM-R1-950M-MLX"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "robbiemu/MobileLLM-R1-950M-MLX",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

robbiemu commited on Sep 15, 2025

Commit

c45463c

verified ·

1 Parent(s): 1f30715

fixed some formatting and added mlx-lm examples

Browse files

Files changed (1) hide show

README.md +13 -6

README.md CHANGED Viewed

@@ -319,7 +319,6 @@ Details
 - The loader maps HF weight names to MLX module names and detects the MLP variant from weight keys to ensure correct layer wiring.
 - Attention uses standard `1/sqrt(d)` scaling for best generation quality.
-```markdown
 ## Installation
 This project uses `uv` for dependency management.
@@ -335,7 +334,6 @@ uv sync
 # 3. (Optional) Add the torch group if you plan to customize/train models
 uv sync --extra torch
-```
 ### Without uv
 If you prefer pip/venv, a `requirements.txt` is provided:
@@ -346,7 +344,6 @@ pip install -r requirements.txt
 ```
 > The `torch` extra is only required if you intend to fine-tune or swap model back-ends; the default installation already supports inference.
-```
 ## MLX Inference Examples (safetensors)
@@ -377,7 +374,7 @@ This runtime mirrors the functional details of the released weights so they load
   - Map HF names to MLX names during load: `model.embed_tokens`→`tok_embeddings`, layer/attn/norm renames, `mlp.`→`feed_forward.`, `model.norm`→`norm`.
 - Template and decoding
-  - Provide a Jinja chat template for parity with HF chat usage, but allow `--disable-chat-template` for raw prompting. Multiple EOS IDs are supported.
   - Sampling: temperature, top‑p, and greedy; optional repetition/frequency penalties; math helpers `--final-only/--stop-at-boxed/--extract-boxed` to keep answers concise.
 # Model Details
@@ -436,7 +433,7 @@ Compared to existing fully open-source models, MobileLLM-R1 950M model achieves
 # How to use
 To load the pretrained model for further finetuning or evaluation:
-```bash
 from transformers import AutoModelForCausalLM, AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("facebook/MobileLLM-R1-950M")
 model = AutoModelForCausalLM.from_pretrained("facebook/MobileLLM-R1-950M")
@@ -467,7 +464,17 @@ Flags in `inference.py`
 See also: the “MLX Runtime (Apple silicon) — Added Files & Usage” section above for more examples and notes.
-Transformers
 ```py
 from transformers import pipeline

 - The loader maps HF weight names to MLX module names and detects the MLP variant from weight keys to ensure correct layer wiring.
 - Attention uses standard `1/sqrt(d)` scaling for best generation quality.
 ## Installation
 This project uses `uv` for dependency management.
 # 3. (Optional) Add the torch group if you plan to customize/train models
 uv sync --extra torch
 ### Without uv
 If you prefer pip/venv, a `requirements.txt` is provided:
 ```
 > The `torch` extra is only required if you intend to fine-tune or swap model back-ends; the default installation already supports inference.
 ## MLX Inference Examples (safetensors)
   - Map HF names to MLX names during load: `model.embed_tokens`→`tok_embeddings`, layer/attn/norm renames, `mlp.`→`feed_forward.`, `model.norm`→`norm`.
 - Template and decoding
+  - The provided Jinja chat template is supported for parity with HF chat usage, but allow `--disable-chat-template` for raw prompting. Multiple EOS IDs are supported.
   - Sampling: temperature, top‑p, and greedy; optional repetition/frequency penalties; math helpers `--final-only/--stop-at-boxed/--extract-boxed` to keep answers concise.
 # Model Details
 # How to use
 To load the pretrained model for further finetuning or evaluation:
+```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("facebook/MobileLLM-R1-950M")
 model = AutoModelForCausalLM.from_pretrained("facebook/MobileLLM-R1-950M")
 See also: the “MLX Runtime (Apple silicon) — Added Files & Usage” section above for more examples and notes.
+## Inference (MLX-LM)
+Two mlx-lm models are also provided, a conversion and a dynamic 4 bit quantization. code to reproduce and a handy inference runtime are provided in custom_mlx_lm/. After installation the following examples should work (I am forgetting, you may need to first copy the model into mlx_lm/ as `llama4_text.py`)
+```bash
+mobilellm-infer --model-path MobileLLM-R1-950M-mixed-4bit-mlx --prompt "What is the nearest prime to 9^2?
+mobilellm-infer --model-path MobileLLM-R1-950M-mlx/ --prompt "What is the nearest prime to 9^2?"
+```
+## Transformers
 ```py
 from transformers import pipeline