Instructions to use Tdamre/MiniCPM5-1B-litert-lm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use Tdamre/MiniCPM5-1B-litert-lm with LiteRT-LM:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
MiniCPM5-1B LiteRT-LM
This repository contains LiteRT-LM conversions of openbmb/MiniCPM5-1B for local on-device inference.
The source model is a Llama-architecture dense 1B checkpoint intended for
on-device and resource-constrained use. These artifacts were exported locally
from the Hugging Face safetensors checkpoint with LiteRT Torch and packaged as
.litertlm files for the LiteRT-LM runtime.
Files
| File | Context cache | Quantization | Backend target | Status |
|---|---|---|---|---|
MiniCPM5-1B.litertlm |
4096 | dynamic_wi8_afp32 |
CPU/GPU | Host CPU smoke-tested. |
MiniCPM5-1B-web.litertlm |
2048 | dynamic_wi8_afp32 |
CPU/GPU | Host CPU, Android CPU/GPU, and Android emulator seven-turn chat smoke-tested. |
MiniCPM5-1B-qualcomm-sm8750.litertlm |
2048 | dynamic_wi8_afp32 |
Qualcomm SM8750 NPU AOT | Generated and loaded by the NPU runtime, but not inference-passing on the stock Redmagic shell because QNN cannot reserve the required PD memory. |
MiniCPM5-1B-qualcomm-sm8750-c1024.litertlm |
1024 | dynamic_wi8_afp32 |
Qualcomm SM8750 NPU AOT | Reduced-cache AOT attempt. Also loads, but still fails QNN context creation from stock shell. |
chat_template.jinja |
n/a | n/a | n/a | Mobile-safe ChatML template. Replaces the source template's Python string methods that fail in Android LiteRT-LM template evaluation. |
conversion_manifest.json |
n/a | n/a | n/a | Toolchain versions, hashes, and conversion details. |
The CPU/GPU .litertlm files include a compressed Hugging Face tokenizer, LLM
metadata, a quantized prefill/decode TFLite model, and a quantized external
embedder. The Qualcomm files replace the prefill/decode model with an SM8750
AOT-compiled NPU model and keep the same external embedder packaging.
The May 31, 2026 refresh rebuilds all .litertlm containers with a minimal
mobile-compatible ChatML template embedded in LLM metadata. This fixes Android
template failures such as unknown method: string has no method named strip.
Run With LiteRT-LM
Install the LiteRT-LM CLI:
uv tool install litert-lm
Run the generic artifact:
litert-lm run \
--from-huggingface-repo Tdamre/MiniCPM5-1B-litert-lm \
MiniCPM5-1B.litertlm \
--backend=cpu \
--prompt="What is 2+2?"
Run the lower-cache artifact:
litert-lm run \
--from-huggingface-repo Tdamre/MiniCPM5-1B-litert-lm \
MiniCPM5-1B-web.litertlm \
--backend=cpu \
--prompt="What is 2+2?"
Android Validation
Test device:
Nubia Redmagic 10 Pro / NX789J
Android 16, API 36
SoC: Qualcomm SM8750
GPU: Adreno 830
ABI: arm64-v8a
Android runner:
LiteRT-LM v0.12.0 source build
Bazel 7.6.1 via Bazelisk
Android NDK r28b
Target: //runtime/engine:litert_lm_main --config=android_arm64
Smoke prompt:
What is 2+2? Answer with just the number.
Results:
| Artifact | Backend | Result | Notes |
|---|---|---|---|
MiniCPM5-1B-web.litertlm |
Android CPU/XNNPACK | Returned 4. |
Init 1877.13 ms, TTFT 0.38 s, prefill 61.13 tok/s, decode 50.77 tok/s. |
MiniCPM5-1B-web.litertlm |
Android GPU/OpenCL | Returned 4. |
LiteRT GPU loaded, all model subgraphs delegated to LITERT_CL. Init 8485.10 ms, TTFT 0.18 s, prefill 143.54 tok/s, decode 44.03 tok/s. |
MiniCPM5-1B-web.litertlm |
Android emulator API 35 CPU/XNNPACK | Returned final 6 for reggin. |
x86_64 LiteRT-LM Android runner, Pixel-style emulator, repaired embedded template. |
MiniCPM5-1B-web.litertlm |
Android emulator API 35 CPU/XNNPACK | Seven-turn chat completed. | Same Conversation instance; final turn confirmed BLUE-IRIS and 17, and the reggin turn returned 6. |
MiniCPM5-1B-qualcomm-sm8750.litertlm |
Qualcomm NPU/Dispatch/QNN | Not inference-passing. | Dispatch and QNN load, V79 HTP skel is selected, context binaries match QNN SDK 2.44.0, then QNN fails context creation with context-size estimate 9260401152 bytes. |
MiniCPM5-1B-qualcomm-sm8750-c1024.litertlm |
Qualcomm NPU/Dispatch/QNN | Not inference-passing. | Reduced-cache variant lowers the context-size estimate to 5909219840 bytes, but QNN still reports no available PD for context creation. |
The Qualcomm AOT artifacts are included for reproducibility and follow-up runtime work. They should not be treated as validated NPU inference artifacts for this stock Redmagic shell environment yet.
Conversion Summary
Source model revision:
openbmb/MiniCPM5-1B@4e9de7a0778dc1c362e983e6858f0e77542cbdca
CPU/GPU conversion command pattern:
litert-torch export_hf \
model-cache/openbmb-MiniCPM5-1B \
<output_dir> \
--keep_temporary_files=True \
--prefill_lengths=128,1024 \
--cache_length=<2048-or-4096> \
--externalize_embedder=True \
--quantization_recipe=dynamic_wi8_afp32
Qualcomm AOT conversion used ai_edge_litert.aot targeting
SocManufacturer.QUALCOMM and SocModel.SM8750, then repackaged the compiled
TFLite model into the .litertlm container with backend_constraint=npu.
Toolchain:
Python 3.12.12
litert-torch 0.9.1
litert-lm 0.12.0
ai-edge-litert 2.1.5
ai-edge-litert-sdk-qualcomm 2.1.5
ai-edge-quantizer 0.7.0
torch 2.12.0+cu130
transformers 5.9.0
QNN SDK/runtime 2.44.0 from the Qualcomm LiteRT SDK wheel
Not Included
MediaPipe .task was not uploaded. The current MediaPipe GenAI bundler
documents SentencePiece tokenizer.model support, while MiniCPM5-1B ships an HF
tokenizer.json and no tokenizer.model; uploading an unverified .task would
be misleading.
Intel NPU AOT was not produced in this local run because this request targeted the connected Qualcomm SM8750/Adreno phone.
- Downloads last month
- 30
Model tree for Tdamre/MiniCPM5-1B-litert-lm
Base model
openbmb/MiniCPM5-1B