Instructions to use lyafence/MiniCPM5-1B-SFT-litertlm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use lyafence/MiniCPM5-1B-SFT-litertlm with LiteRT-LM:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
MiniCPM5-1B-SFT β LiteRT-LM
This repository contains openbmb/MiniCPM5-1B-SFT exported to LiteRT-LM .litertlm format for on-device inference.
MiniCPM5-1B is a dense 1B Transformer built for on-device, local deployment, and resource-constrained scenarios. It uses standard LlamaForCausalLM architecture with 24 layers, 16 Q / 2 KV attention heads, and 131072 context length.
This is the SFT-only checkpoint (before RL/OPD fine-tuning). For the final release model, see
openbmb/MiniCPM5-1B.
Variants
| File | Mode | Description |
|---|---|---|
MiniCPM5-1B-SFT.litertlm |
No-Think (default) | Fast responses without reasoning. Template adds empty <think>\n\n</think>\n\n block to suppress thinking. |
MiniCPM5-1B-SFT-think.litertlm |
Think | Hybrid reasoning with <think> blocks. Let the model deliberate before answering. |
When to use which
- No-Think: Quick chat, simple Q&A, tool calls. Use when latency matters.
- Think: Complex reasoning, math, code, multi-turn conversations. Better quality at the cost of longer generation.
Usage
Install
pip install litert-lm
Run (No-Think)
litert-lm run \
--from-huggingface-repo=lyafence/MiniCPM5-1B-SFT-litertlm \
MiniCPM5-1B-SFT.litertlm \
--temperature 0.7 --top-k 40
Run (Think)
litert-lm run \
--from-huggingface-repo=lyafence/MiniCPM5-1B-SFT-litertlm \
MiniCPM5-1B-SFT-think.litertlm \
--temperature 0.9 --top-k 40
Recommended parameters
| Mode | Temperature | Top-K | Top-P |
|---|---|---|---|
| No-Think | 0.7 | 40 | 0.95 |
| Think | 0.9 | 40 | 0.95 |
--top-k 40improves output diversity and reduces repetitive patterns. The model defaults (temperature=1.0, top-p=0.95) work but may cause repetition on multi-turn conversations.
Interactive mode
Run without --prompt to enter interactive chat:
litert-lm run \
--from-huggingface-repo=lyafence/MiniCPM5-1B-SFT-litertlm \
MiniCPM5-1B-SFT-think.litertlm \
--temperature 0.9 --top-k 40
Single prompt
litert-lm run \
--from-huggingface-repo=lyafence/MiniCPM5-1B-SFT-litertlm \
MiniCPM5-1B-SFT.litertlm \
--temperature 0.7 --top-k 40 \
--prompt "What is the capital of France?"
Model details
| Property | Value |
|---|---|
| Architecture | LlamaForCausalLM |
| Parameters | 1,080,632,832 (1B) |
| Non-embedding params | 679,552,512 |
| Layers | 24 |
| Attention heads (Q) | 16 (GQA) |
| Attention heads (KV) | 2 |
| Hidden size | 1536 |
| Context length | 131,072 |
| Precision | BF16 (int8 dynamic quantization via LiteRT) |
| Vocabulary size | 130,560 |
| Added tokens | 488 (im_start, im_end, thought tags, tool tokens, unused) |
| Tokenizer | HF_Tokenizer (SentencePiece with HuggingFace Fast tokenizer extras) |
| File size | ~1047 MB |
| License | Apache 2.0 |
Re-converting
The conversion script is included in scripts/convert_to_litert.py:
uv run scripts/convert_to_litert.py openbmb/MiniCPM5-1B-SFT
Flags:
--allow-thinkingβ keep the model's native thinking template--force-no-thinkβ suppress thinking even if the model supports it--force-hf-tokenizerβ force HuggingFace Fast tokenizer--quantization-recipeβ set quantization (default:dynamic_wi8_afp32)
Limitations
- SFT-only checkpoint: This is the pre-RL version. Quality may be lower than the final
MiniCPM5-1Brelease. - No-think multi-turn: Without thinking, the model may produce repetitive responses on follow-up turns. Use the Think variant for better multi-turn quality.
- Tool calling: The Think variant supports XML-style tool calls. Use SGLang backend for production tool-use workflows (see HF model card).
References
- Downloads last month
- -
Model tree for lyafence/MiniCPM5-1B-SFT-litertlm
Base model
openbmb/MiniCPM5-1B-SFT