Instructions to use zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit

SGLang

How to use zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit with Docker Model Runner:
```
docker model run hf.co/zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit
```

EdgeRazor for Lightweight LLMs

MobileLLM-350M-EdgeRazor-4bit

Contents
Model Overview
Model Bit-Widths
Model Performance
Quickstart
Citation

Model Overview

Base Model: facebook/MobileLLM-ParetoQ-350M-BF16
Training: zhangsq-nju/EdgeRazor
Quantization: 4-bit for all embedding, decoder, and lm_head layers

Model Bit-Widths

Mixed-Precision Recipe	Bit-Width	This Repo
100% 4-bit + 0% 1.58-bit	4	✔️
50% 4-bit + 50% 1.58-bit	2.79
12.5% 4-bit + 87.5% 1.58-bit	1.88
0% 4-bit + 100% 1.58-bit	1.58

Model Performance

Models	W-A-KV	ARC-e	ARC-c	HellaS.	BoolQ	PIQA	WinoG.	SIQA	OBQA	Tr.QA2	Ethics	MMLU	GSM8K	Average (↑)
MobileLLM-350M	16-16-16	64.94	35.49	52.87	58.96	70.84	56.35	40.79	40.20	37.44	53.98	23.52	0.00	41.18
EdgeRazor	4-16-16	69.19	36.26	51.91	62.26	70.40	56.20	40.74	37.40	37.96	57.41	25.00	0.53	41.94
EdgeRazor	2.79-16-16	65.87	32.68	45.98	61.71	68.82	56.27	40.02	35.00	38.97	56.53	24.27	0.76	40.53
EdgeRazor	1.88-16-16	61.20	28.75	40.76	58.23	66.59	55.01	39.51	33.00	40.98	56.22	25.03	0.53	38.91
EdgeRazor	1.58-16-16	58.63	26.19	38.95	58.07	65.29	53.04	39.30	32.20	41.97	56.26	24.12	0.53	38.04
EdgeRazor	4-8-8	69.11	35.84	51.82	62.60	70.35	56.20	40.58	37.40	37.90	57.21	24.66	0.45	41.86
EdgeRazor	2.79-8-8	65.99	32.68	45.99	62.11	68.55	56.51	40.07	35.20	39.05	56.51	24.41	0.99	40.62
EdgeRazor	1.88-8-8	61.36	29.18	40.86	58.23	66.92	55.49	39.56	33.20	40.95	56.13	24.97	0.38	39.02
EdgeRazor	1.58-8-8	58.67	26.19	38.92	58.04	65.23	53.83	39.25	32.00	42.03	56.33	24.19	0.83	38.12

Quickstart

It is recommended to ensure that EdgeRazor is installed in advance for weight-activation quantization. The provided weights are already quantized (quantized_weights*scaling_bf16); to enable activation and KV cache quantization, set trust_remote_code=True in the model configuration.

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "zhangsq-nju/MobileLLM-ParetoQ-350M-BF16-EdgeRazor-4bit",
    use_fast=False
)
model = AutoModelForCausalLM.from_pretrained(
    "zhangsq-nju/MobileLLM-ParetoQ-350M-BF16-EdgeRazor-4bit", 
    trust_remote_code=True
)

Note that the default tokenizer does not contain special tokens. For example you can use:

tokenizer.add_special_tokens(
    {
        "eos_token": "</s>",
        "bos_token": "<s>",
        "unk_token": "<unk>",
    }
)

Citation

If you find our project useful in your research, please consider kindly citing our papers ✏️:

@article{zhangsh-edgerazor,
  title={{EdgeRazor}: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation},
  author={Shu-Hao Zhang and Le-Tong Huang and Xiang-Sheng Deng and Xin-Yi Zou and Chen Wu and Nan Li and Shao-Qun Zhang},
  year={2026},
  journal={arXiv preprint arXiv:2605.04062}
}

Downloads last month: 38

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit

Base model

facebook/MobileLLM-ParetoQ-350M-BF16

Finetuned

(4)

this model

Collection including zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit

EdgeRazor-Nbit

Collection

16 items • Updated 26 days ago • 8

Paper for zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

Paper • 2605.04062 • Published Apr 10 • 30