Text Generation
MLX
Safetensors
PyTorch
English
llama4_text
facebook
meta
mobilellm
mlx - apple-mlx - runtime
conversational
Instructions to use robbiemu/MobileLLM-R1-950M-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use robbiemu/MobileLLM-R1-950M-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("robbiemu/MobileLLM-R1-950M-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use robbiemu/MobileLLM-R1-950M-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "robbiemu/MobileLLM-R1-950M-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "robbiemu/MobileLLM-R1-950M-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use robbiemu/MobileLLM-R1-950M-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "robbiemu/MobileLLM-R1-950M-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default robbiemu/MobileLLM-R1-950M-MLX
Run Hermes
hermes
- MLX LM
How to use robbiemu/MobileLLM-R1-950M-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "robbiemu/MobileLLM-R1-950M-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "robbiemu/MobileLLM-R1-950M-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "robbiemu/MobileLLM-R1-950M-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
fixed some formatting and added mlx-lm examples
Browse files
README.md
CHANGED
|
@@ -319,7 +319,6 @@ Details
|
|
| 319 |
- The loader maps HF weight names to MLX module names and detects the MLP variant from weight keys to ensure correct layer wiring.
|
| 320 |
- Attention uses standard `1/sqrt(d)` scaling for best generation quality.
|
| 321 |
|
| 322 |
-
```markdown
|
| 323 |
## Installation
|
| 324 |
|
| 325 |
This project uses `uv` for dependency management.
|
|
@@ -335,7 +334,6 @@ uv sync
|
|
| 335 |
|
| 336 |
# 3. (Optional) Add the torch group if you plan to customize/train models
|
| 337 |
uv sync --extra torch
|
| 338 |
-
```
|
| 339 |
|
| 340 |
### Without uv
|
| 341 |
If you prefer pip/venv, a `requirements.txt` is provided:
|
|
@@ -346,7 +344,6 @@ pip install -r requirements.txt
|
|
| 346 |
```
|
| 347 |
|
| 348 |
> The `torch` extra is only required if you intend to fine-tune or swap model back-ends; the default installation already supports inference.
|
| 349 |
-
```
|
| 350 |
|
| 351 |
## MLX Inference Examples (safetensors)
|
| 352 |
|
|
@@ -377,7 +374,7 @@ This runtime mirrors the functional details of the released weights so they load
|
|
| 377 |
- Map HF names to MLX names during load: `model.embed_tokens`→`tok_embeddings`, layer/attn/norm renames, `mlp.`→`feed_forward.`, `model.norm`→`norm`.
|
| 378 |
|
| 379 |
- Template and decoding
|
| 380 |
-
-
|
| 381 |
- Sampling: temperature, top‑p, and greedy; optional repetition/frequency penalties; math helpers `--final-only/--stop-at-boxed/--extract-boxed` to keep answers concise.
|
| 382 |
|
| 383 |
# Model Details
|
|
@@ -436,7 +433,7 @@ Compared to existing fully open-source models, MobileLLM-R1 950M model achieves
|
|
| 436 |
# How to use
|
| 437 |
|
| 438 |
To load the pretrained model for further finetuning or evaluation:
|
| 439 |
-
```
|
| 440 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 441 |
tokenizer = AutoTokenizer.from_pretrained("facebook/MobileLLM-R1-950M")
|
| 442 |
model = AutoModelForCausalLM.from_pretrained("facebook/MobileLLM-R1-950M")
|
|
@@ -467,7 +464,17 @@ Flags in `inference.py`
|
|
| 467 |
|
| 468 |
See also: the “MLX Runtime (Apple silicon) — Added Files & Usage” section above for more examples and notes.
|
| 469 |
|
| 470 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 471 |
|
| 472 |
```py
|
| 473 |
from transformers import pipeline
|
|
|
|
| 319 |
- The loader maps HF weight names to MLX module names and detects the MLP variant from weight keys to ensure correct layer wiring.
|
| 320 |
- Attention uses standard `1/sqrt(d)` scaling for best generation quality.
|
| 321 |
|
|
|
|
| 322 |
## Installation
|
| 323 |
|
| 324 |
This project uses `uv` for dependency management.
|
|
|
|
| 334 |
|
| 335 |
# 3. (Optional) Add the torch group if you plan to customize/train models
|
| 336 |
uv sync --extra torch
|
|
|
|
| 337 |
|
| 338 |
### Without uv
|
| 339 |
If you prefer pip/venv, a `requirements.txt` is provided:
|
|
|
|
| 344 |
```
|
| 345 |
|
| 346 |
> The `torch` extra is only required if you intend to fine-tune or swap model back-ends; the default installation already supports inference.
|
|
|
|
| 347 |
|
| 348 |
## MLX Inference Examples (safetensors)
|
| 349 |
|
|
|
|
| 374 |
- Map HF names to MLX names during load: `model.embed_tokens`→`tok_embeddings`, layer/attn/norm renames, `mlp.`→`feed_forward.`, `model.norm`→`norm`.
|
| 375 |
|
| 376 |
- Template and decoding
|
| 377 |
+
- The provided Jinja chat template is supported for parity with HF chat usage, but allow `--disable-chat-template` for raw prompting. Multiple EOS IDs are supported.
|
| 378 |
- Sampling: temperature, top‑p, and greedy; optional repetition/frequency penalties; math helpers `--final-only/--stop-at-boxed/--extract-boxed` to keep answers concise.
|
| 379 |
|
| 380 |
# Model Details
|
|
|
|
| 433 |
# How to use
|
| 434 |
|
| 435 |
To load the pretrained model for further finetuning or evaluation:
|
| 436 |
+
```python
|
| 437 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 438 |
tokenizer = AutoTokenizer.from_pretrained("facebook/MobileLLM-R1-950M")
|
| 439 |
model = AutoModelForCausalLM.from_pretrained("facebook/MobileLLM-R1-950M")
|
|
|
|
| 464 |
|
| 465 |
See also: the “MLX Runtime (Apple silicon) — Added Files & Usage” section above for more examples and notes.
|
| 466 |
|
| 467 |
+
## Inference (MLX-LM)
|
| 468 |
+
|
| 469 |
+
Two mlx-lm models are also provided, a conversion and a dynamic 4 bit quantization. code to reproduce and a handy inference runtime are provided in custom_mlx_lm/. After installation the following examples should work (I am forgetting, you may need to first copy the model into mlx_lm/ as `llama4_text.py`)
|
| 470 |
+
|
| 471 |
+
```bash
|
| 472 |
+
mobilellm-infer --model-path MobileLLM-R1-950M-mixed-4bit-mlx --prompt "What is the nearest prime to 9^2?
|
| 473 |
+
|
| 474 |
+
mobilellm-infer --model-path MobileLLM-R1-950M-mlx/ --prompt "What is the nearest prime to 9^2?"
|
| 475 |
+
```
|
| 476 |
+
|
| 477 |
+
## Transformers
|
| 478 |
|
| 479 |
```py
|
| 480 |
from transformers import pipeline
|