MiniCPM5-1B LiteRT-LM

This repository contains LiteRT-LM conversions of openbmb/MiniCPM5-1B for local on-device inference.

The source model is a Llama-architecture dense 1B checkpoint intended for on-device and resource-constrained use. These artifacts were exported locally from the Hugging Face safetensors checkpoint with LiteRT Torch and packaged as .litertlm files for the LiteRT-LM runtime.

Files

File	Context cache	Quantization	Backend target	Status
`MiniCPM5-1B.litertlm`	4096	`dynamic_wi8_afp32`	CPU/GPU	Host CPU smoke-tested.
`MiniCPM5-1B-web.litertlm`	2048	`dynamic_wi8_afp32`	CPU/GPU	Host CPU, Android CPU/GPU, and Android emulator seven-turn chat smoke-tested.
`MiniCPM5-1B-qualcomm-sm8750.litertlm`	2048	`dynamic_wi8_afp32`	Qualcomm SM8750 NPU AOT	Generated and loaded by the NPU runtime, but not inference-passing on the stock Redmagic shell because QNN cannot reserve the required PD memory.
`MiniCPM5-1B-qualcomm-sm8750-c1024.litertlm`	1024	`dynamic_wi8_afp32`	Qualcomm SM8750 NPU AOT	Reduced-cache AOT attempt. Also loads, but still fails QNN context creation from stock shell.
`chat_template.jinja`	n/a	n/a	n/a	Mobile-safe ChatML template. Replaces the source template's Python string methods that fail in Android LiteRT-LM template evaluation.
`conversion_manifest.json`	n/a	n/a	n/a	Toolchain versions, hashes, and conversion details.

The CPU/GPU .litertlm files include a compressed Hugging Face tokenizer, LLM metadata, a quantized prefill/decode TFLite model, and a quantized external embedder. The Qualcomm files replace the prefill/decode model with an SM8750 AOT-compiled NPU model and keep the same external embedder packaging.

The May 31, 2026 refresh rebuilds all .litertlm containers with a minimal mobile-compatible ChatML template embedded in LLM metadata. This fixes Android template failures such as unknown method: string has no method named strip.

Run With LiteRT-LM

Install the LiteRT-LM CLI:

uv tool install litert-lm

Run the generic artifact:

litert-lm run \
  --from-huggingface-repo Tdamre/MiniCPM5-1B-litert-lm \
  MiniCPM5-1B.litertlm \
  --backend=cpu \
  --prompt="What is 2+2?"

Run the lower-cache artifact:

litert-lm run \
  --from-huggingface-repo Tdamre/MiniCPM5-1B-litert-lm \
  MiniCPM5-1B-web.litertlm \
  --backend=cpu \
  --prompt="What is 2+2?"

Android Validation

Test device:

Nubia Redmagic 10 Pro / NX789J
Android 16, API 36
SoC: Qualcomm SM8750
GPU: Adreno 830
ABI: arm64-v8a

Android runner:

LiteRT-LM v0.12.0 source build
Bazel 7.6.1 via Bazelisk
Android NDK r28b
Target: //runtime/engine:litert_lm_main --config=android_arm64

Smoke prompt:

What is 2+2? Answer with just the number.

Results:

Artifact	Backend	Result	Notes
`MiniCPM5-1B-web.litertlm`	Android CPU/XNNPACK	Returned `4`.	Init 1877.13 ms, TTFT 0.38 s, prefill 61.13 tok/s, decode 50.77 tok/s.
`MiniCPM5-1B-web.litertlm`	Android GPU/OpenCL	Returned `4`.	LiteRT GPU loaded, all model subgraphs delegated to `LITERT_CL`. Init 8485.10 ms, TTFT 0.18 s, prefill 143.54 tok/s, decode 44.03 tok/s.
`MiniCPM5-1B-web.litertlm`	Android emulator API 35 CPU/XNNPACK	Returned final `6` for `reggin`.	x86_64 LiteRT-LM Android runner, Pixel-style emulator, repaired embedded template.
`MiniCPM5-1B-web.litertlm`	Android emulator API 35 CPU/XNNPACK	Seven-turn chat completed.	Same `Conversation` instance; final turn confirmed `BLUE-IRIS` and `17`, and the `reggin` turn returned `6`.
`MiniCPM5-1B-qualcomm-sm8750.litertlm`	Qualcomm NPU/Dispatch/QNN	Not inference-passing.	Dispatch and QNN load, V79 HTP skel is selected, context binaries match QNN SDK 2.44.0, then QNN fails context creation with context-size estimate 9260401152 bytes.
`MiniCPM5-1B-qualcomm-sm8750-c1024.litertlm`	Qualcomm NPU/Dispatch/QNN	Not inference-passing.	Reduced-cache variant lowers the context-size estimate to 5909219840 bytes, but QNN still reports no available PD for context creation.

The Qualcomm AOT artifacts are included for reproducibility and follow-up runtime work. They should not be treated as validated NPU inference artifacts for this stock Redmagic shell environment yet.

Conversion Summary

Source model revision:

openbmb/MiniCPM5-1B@4e9de7a0778dc1c362e983e6858f0e77542cbdca

CPU/GPU conversion command pattern:

litert-torch export_hf \
  model-cache/openbmb-MiniCPM5-1B \
  <output_dir> \
  --keep_temporary_files=True \
  --prefill_lengths=128,1024 \
  --cache_length=<2048-or-4096> \
  --externalize_embedder=True \
  --quantization_recipe=dynamic_wi8_afp32

Qualcomm AOT conversion used ai_edge_litert.aot targeting SocManufacturer.QUALCOMM and SocModel.SM8750, then repackaged the compiled TFLite model into the .litertlm container with backend_constraint=npu.

Toolchain:

Python 3.12.12
litert-torch 0.9.1
litert-lm 0.12.0
ai-edge-litert 2.1.5
ai-edge-litert-sdk-qualcomm 2.1.5
ai-edge-quantizer 0.7.0
torch 2.12.0+cu130
transformers 5.9.0
QNN SDK/runtime 2.44.0 from the Qualcomm LiteRT SDK wheel

Not Included

MediaPipe .task was not uploaded. The current MediaPipe GenAI bundler documents SentencePiece tokenizer.model support, while MiniCPM5-1B ships an HF tokenizer.json and no tokenizer.model; uploading an unverified .task would be misleading.

Intel NPU AOT was not produced in this local run because this request targeted the connected Qualcomm SM8750/Adreno phone.

Downloads last month: 30

Model tree for Tdamre/MiniCPM5-1B-litert-lm

Base model

openbmb/MiniCPM5-1B

Finetuned

(11)

this model