MiniCPM5-1B-SFT β€” LiteRT-LM

This repository contains openbmb/MiniCPM5-1B-SFT exported to LiteRT-LM .litertlm format for on-device inference.

MiniCPM5-1B is a dense 1B Transformer built for on-device, local deployment, and resource-constrained scenarios. It uses standard LlamaForCausalLM architecture with 24 layers, 16 Q / 2 KV attention heads, and 131072 context length.

This is the SFT-only checkpoint (before RL/OPD fine-tuning). For the final release model, see openbmb/MiniCPM5-1B.

Variants

File Mode Description
MiniCPM5-1B-SFT.litertlm No-Think (default) Fast responses without reasoning. Template adds empty <think>\n\n</think>\n\n block to suppress thinking.
MiniCPM5-1B-SFT-think.litertlm Think Hybrid reasoning with <think> blocks. Let the model deliberate before answering.

When to use which

  • No-Think: Quick chat, simple Q&A, tool calls. Use when latency matters.
  • Think: Complex reasoning, math, code, multi-turn conversations. Better quality at the cost of longer generation.

Usage

Install

pip install litert-lm

Run (No-Think)

litert-lm run \
  --from-huggingface-repo=lyafence/MiniCPM5-1B-SFT-litertlm \
  MiniCPM5-1B-SFT.litertlm \
  --temperature 0.7 --top-k 40

Run (Think)

litert-lm run \
  --from-huggingface-repo=lyafence/MiniCPM5-1B-SFT-litertlm \
  MiniCPM5-1B-SFT-think.litertlm \
  --temperature 0.9 --top-k 40

Recommended parameters

Mode Temperature Top-K Top-P
No-Think 0.7 40 0.95
Think 0.9 40 0.95

--top-k 40 improves output diversity and reduces repetitive patterns. The model defaults (temperature=1.0, top-p=0.95) work but may cause repetition on multi-turn conversations.

Interactive mode

Run without --prompt to enter interactive chat:

litert-lm run \
  --from-huggingface-repo=lyafence/MiniCPM5-1B-SFT-litertlm \
  MiniCPM5-1B-SFT-think.litertlm \
  --temperature 0.9 --top-k 40

Single prompt

litert-lm run \
  --from-huggingface-repo=lyafence/MiniCPM5-1B-SFT-litertlm \
  MiniCPM5-1B-SFT.litertlm \
  --temperature 0.7 --top-k 40 \
  --prompt "What is the capital of France?"

Model details

Property Value
Architecture LlamaForCausalLM
Parameters 1,080,632,832 (1B)
Non-embedding params 679,552,512
Layers 24
Attention heads (Q) 16 (GQA)
Attention heads (KV) 2
Hidden size 1536
Context length 131,072
Precision BF16 (int8 dynamic quantization via LiteRT)
Vocabulary size 130,560
Added tokens 488 (im_start, im_end, thought tags, tool tokens, unused)
Tokenizer HF_Tokenizer (SentencePiece with HuggingFace Fast tokenizer extras)
File size ~1047 MB
License Apache 2.0

Re-converting

The conversion script is included in scripts/convert_to_litert.py:

uv run scripts/convert_to_litert.py openbmb/MiniCPM5-1B-SFT

Flags:

  • --allow-thinking β€” keep the model's native thinking template
  • --force-no-think β€” suppress thinking even if the model supports it
  • --force-hf-tokenizer β€” force HuggingFace Fast tokenizer
  • --quantization-recipe β€” set quantization (default: dynamic_wi8_afp32)

Limitations

  • SFT-only checkpoint: This is the pre-RL version. Quality may be lower than the final MiniCPM5-1B release.
  • No-think multi-turn: Without thinking, the model may produce repetitive responses on follow-up turns. Use the Think variant for better multi-turn quality.
  • Tool calling: The Think variant supports XML-style tool calls. Use SGLang backend for production tool-use workflows (see HF model card).

References

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for lyafence/MiniCPM5-1B-SFT-litertlm

Finetuned
(1)
this model