MiniCPM5-1B-SFT — LiteRT-LM

This repository contains openbmb/MiniCPM5-1B-SFT exported to LiteRT-LM .litertlm format for on-device inference.

MiniCPM5-1B is a dense 1B Transformer built for on-device, local deployment, and resource-constrained scenarios. It uses standard LlamaForCausalLM architecture with 24 layers, 16 Q / 2 KV attention heads, and 131072 context length.

This is the SFT-only checkpoint (before RL/OPD fine-tuning). For the final release model, see openbmb/MiniCPM5-1B.

Variants

File	Mode	Description
`MiniCPM5-1B-SFT.litertlm`	No-Think (default)	Fast responses without reasoning. Template adds empty `<think>\n\n</think>\n\n` block to suppress thinking.
`MiniCPM5-1B-SFT-think.litertlm`	Think	Hybrid reasoning with `<think>` blocks. Let the model deliberate before answering.

When to use which

No-Think: Quick chat, simple Q&A, tool calls. Use when latency matters.
Think: Complex reasoning, math, code, multi-turn conversations. Better quality at the cost of longer generation.

Usage

Install

pip install litert-lm

Run (No-Think)

litert-lm run \
  --from-huggingface-repo=lyafence/MiniCPM5-1B-SFT-litertlm \
  MiniCPM5-1B-SFT.litertlm \
  --temperature 0.7 --top-k 40

Run (Think)

litert-lm run \
  --from-huggingface-repo=lyafence/MiniCPM5-1B-SFT-litertlm \
  MiniCPM5-1B-SFT-think.litertlm \
  --temperature 0.9 --top-k 40

Recommended parameters

Mode	Temperature	Top-K	Top-P
No-Think	0.7	40	0.95
Think	0.9	40	0.95

--top-k 40 improves output diversity and reduces repetitive patterns. The model defaults (temperature=1.0, top-p=0.95) work but may cause repetition on multi-turn conversations.

Interactive mode

Run without --prompt to enter interactive chat:

litert-lm run \
  --from-huggingface-repo=lyafence/MiniCPM5-1B-SFT-litertlm \
  MiniCPM5-1B-SFT-think.litertlm \
  --temperature 0.9 --top-k 40

Single prompt

litert-lm run \
  --from-huggingface-repo=lyafence/MiniCPM5-1B-SFT-litertlm \
  MiniCPM5-1B-SFT.litertlm \
  --temperature 0.7 --top-k 40 \
  --prompt "What is the capital of France?"

Model details

Property	Value
Architecture	LlamaForCausalLM
Parameters	1,080,632,832 (1B)
Non-embedding params	679,552,512
Layers	24
Attention heads (Q)	16 (GQA)
Attention heads (KV)	2
Hidden size	1536
Context length	131,072
Precision	BF16 (int8 dynamic quantization via LiteRT)
Vocabulary size	130,560
Added tokens	488 (im_start, im_end, thought tags, tool tokens, unused)
Tokenizer	HF_Tokenizer (SentencePiece with HuggingFace Fast tokenizer extras)
File size	~1047 MB
License	Apache 2.0

Re-converting

The conversion script is included in scripts/convert_to_litert.py:

uv run scripts/convert_to_litert.py openbmb/MiniCPM5-1B-SFT

Flags:

--allow-thinking — keep the model's native thinking template
--force-no-think — suppress thinking even if the model supports it
--force-hf-tokenizer — force HuggingFace Fast tokenizer
--quantization-recipe — set quantization (default: dynamic_wi8_afp32)

Limitations

SFT-only checkpoint: This is the pre-RL version. Quality may be lower than the final MiniCPM5-1B release.
No-think multi-turn: Without thinking, the model may produce repetitive responses on follow-up turns. Use the Think variant for better multi-turn quality.
Tool calling: The Think variant supports XML-style tool calls. Use SGLang backend for production tool-use workflows (see HF model card).

References

Downloads last month: -

Model tree for lyafence/MiniCPM5-1B-SFT-litertlm

Base model

openbmb/MiniCPM5-1B-SFT

Finetuned

(1)

this model