Instructions to use zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit
- SGLang
How to use zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit with Docker Model Runner:
docker model run hf.co/zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit
MobileLLM-350M-EdgeRazor-4bit
Contents
Model Overview
- Base Model: facebook/MobileLLM-ParetoQ-350M-BF16
- Training: zhangsq-nju/EdgeRazor
- Quantization: 4-bit for all embedding, decoder, and lm_head layers
Model Bit-Widths
| Mixed-Precision Recipe | Bit-Width | This Repo |
|---|---|---|
| 100% 4-bit + 0% 1.58-bit | 4 | ✔️ |
| 50% 4-bit + 50% 1.58-bit | 2.79 | |
| 12.5% 4-bit + 87.5% 1.58-bit | 1.88 | |
| 0% 4-bit + 100% 1.58-bit | 1.58 |
Model Performance
| Models | W-A-KV | ARC-e | ARC-c | HellaS. | BoolQ | PIQA | WinoG. | SIQA | OBQA | Tr.QA2 | Ethics | MMLU | GSM8K | HumanE. | Average (↑) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MobileLLM-350M | 16-16-16 | 64.94 | 35.49 | 52.87 | 58.96 | 70.84 | 56.35 | 40.79 | 40.20 | 37.44 | 53.98 | 23.52 | 0.00 | 0.00 | 41.18 |
| EdgeRazor | 4-16-16 | 69.19 | 36.26 | 51.91 | 62.26 | 70.40 | 56.20 | 40.74 | 37.40 | 37.96 | 57.41 | 25.00 | 0.53 | 0.00 | 41.94 |
| EdgeRazor | 2.79-16-16 | 65.87 | 32.68 | 45.98 | 61.71 | 68.82 | 56.27 | 40.02 | 35.00 | 38.97 | 56.53 | 24.27 | 0.76 | 0.00 | 40.53 |
| EdgeRazor | 1.88-16-16 | 61.20 | 28.75 | 40.76 | 58.23 | 66.59 | 55.01 | 39.51 | 33.00 | 40.98 | 56.22 | 25.03 | 0.53 | 0.00 | 38.91 |
| EdgeRazor | 1.58-16-16 | 58.63 | 26.19 | 38.95 | 58.07 | 65.29 | 53.04 | 39.30 | 32.20 | 41.97 | 56.26 | 24.12 | 0.53 | 0.00 | 38.04 |
| EdgeRazor | 4-8-8 | 69.11 | 35.84 | 51.82 | 62.60 | 70.35 | 56.20 | 40.58 | 37.40 | 37.90 | 57.21 | 24.66 | 0.45 | 0.00 | 41.86 |
| EdgeRazor | 2.79-8-8 | 65.99 | 32.68 | 45.99 | 62.11 | 68.55 | 56.51 | 40.07 | 35.20 | 39.05 | 56.51 | 24.41 | 0.99 | 0.00 | 40.62 |
| EdgeRazor | 1.88-8-8 | 61.36 | 29.18 | 40.86 | 58.23 | 66.92 | 55.49 | 39.56 | 33.20 | 40.95 | 56.13 | 24.97 | 0.38 | 0.00 | 39.02 |
| EdgeRazor | 1.58-8-8 | 58.67 | 26.19 | 38.92 | 58.04 | 65.23 | 53.83 | 39.25 | 32.00 | 42.03 | 56.33 | 24.19 | 0.83 | 0.00 | 38.12 |
Quickstart
It is recommended to ensure that EdgeRazor is installed in advance for weight-activation quantization. The provided weights are already quantized (quantized_weights*scaling_bf16); to enable activation and KV cache quantization, set trust_remote_code=True in the model configuration.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"zhangsq-nju/MobileLLM-ParetoQ-350M-BF16-EdgeRazor-4bit",
use_fast=False
)
model = AutoModelForCausalLM.from_pretrained(
"zhangsq-nju/MobileLLM-ParetoQ-350M-BF16-EdgeRazor-4bit",
trust_remote_code=True
)
Note that the default tokenizer does not contain special tokens. For example you can use:
tokenizer.add_special_tokens(
{
"eos_token": "</s>",
"bos_token": "<s>",
"unk_token": "<unk>",
}
)
Citation
If you find our project useful in your research, please consider kindly citing our papers ✏️:
@article{zhangsh-edgerazor,
title={{EdgeRazor}: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation},
author={Shu-Hao Zhang and Le-Tong Huang and Xiang-Sheng Deng and Xin-Yi Zou and Chen Wu and Nan Li and Shao-Qun Zhang},
year={2026},
journal={arXiv preprint arXiv:2605.04062}
}
- Downloads last month
- 38
Model tree for zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit
Base model
facebook/MobileLLM-ParetoQ-350M-BF16