MiniCPM5-1B LiteRT-LM

This repository contains LiteRT-LM conversions of openbmb/MiniCPM5-1B for local on-device inference.

The source model is a Llama-architecture dense 1B checkpoint intended for on-device and resource-constrained use. These artifacts were exported locally from the Hugging Face safetensors checkpoint with LiteRT Torch and packaged as .litertlm files for the LiteRT-LM runtime.

Files

File Context cache Quantization Backend target Status
MiniCPM5-1B.litertlm 4096 dynamic_wi8_afp32 CPU/GPU Host CPU smoke-tested.
MiniCPM5-1B-web.litertlm 2048 dynamic_wi8_afp32 CPU/GPU Host CPU, Android CPU/GPU, and Android emulator seven-turn chat smoke-tested.
MiniCPM5-1B-qualcomm-sm8750.litertlm 2048 dynamic_wi8_afp32 Qualcomm SM8750 NPU AOT Generated and loaded by the NPU runtime, but not inference-passing on the stock Redmagic shell because QNN cannot reserve the required PD memory.
MiniCPM5-1B-qualcomm-sm8750-c1024.litertlm 1024 dynamic_wi8_afp32 Qualcomm SM8750 NPU AOT Reduced-cache AOT attempt. Also loads, but still fails QNN context creation from stock shell.
chat_template.jinja n/a n/a n/a Mobile-safe ChatML template. Replaces the source template's Python string methods that fail in Android LiteRT-LM template evaluation.
conversion_manifest.json n/a n/a n/a Toolchain versions, hashes, and conversion details.

The CPU/GPU .litertlm files include a compressed Hugging Face tokenizer, LLM metadata, a quantized prefill/decode TFLite model, and a quantized external embedder. The Qualcomm files replace the prefill/decode model with an SM8750 AOT-compiled NPU model and keep the same external embedder packaging.

The May 31, 2026 refresh rebuilds all .litertlm containers with a minimal mobile-compatible ChatML template embedded in LLM metadata. This fixes Android template failures such as unknown method: string has no method named strip.

Run With LiteRT-LM

Install the LiteRT-LM CLI:

uv tool install litert-lm

Run the generic artifact:

litert-lm run \
  --from-huggingface-repo Tdamre/MiniCPM5-1B-litert-lm \
  MiniCPM5-1B.litertlm \
  --backend=cpu \
  --prompt="What is 2+2?"

Run the lower-cache artifact:

litert-lm run \
  --from-huggingface-repo Tdamre/MiniCPM5-1B-litert-lm \
  MiniCPM5-1B-web.litertlm \
  --backend=cpu \
  --prompt="What is 2+2?"

Android Validation

Test device:

Nubia Redmagic 10 Pro / NX789J
Android 16, API 36
SoC: Qualcomm SM8750
GPU: Adreno 830
ABI: arm64-v8a

Android runner:

LiteRT-LM v0.12.0 source build
Bazel 7.6.1 via Bazelisk
Android NDK r28b
Target: //runtime/engine:litert_lm_main --config=android_arm64

Smoke prompt:

What is 2+2? Answer with just the number.

Results:

Artifact Backend Result Notes
MiniCPM5-1B-web.litertlm Android CPU/XNNPACK Returned 4. Init 1877.13 ms, TTFT 0.38 s, prefill 61.13 tok/s, decode 50.77 tok/s.
MiniCPM5-1B-web.litertlm Android GPU/OpenCL Returned 4. LiteRT GPU loaded, all model subgraphs delegated to LITERT_CL. Init 8485.10 ms, TTFT 0.18 s, prefill 143.54 tok/s, decode 44.03 tok/s.
MiniCPM5-1B-web.litertlm Android emulator API 35 CPU/XNNPACK Returned final 6 for reggin. x86_64 LiteRT-LM Android runner, Pixel-style emulator, repaired embedded template.
MiniCPM5-1B-web.litertlm Android emulator API 35 CPU/XNNPACK Seven-turn chat completed. Same Conversation instance; final turn confirmed BLUE-IRIS and 17, and the reggin turn returned 6.
MiniCPM5-1B-qualcomm-sm8750.litertlm Qualcomm NPU/Dispatch/QNN Not inference-passing. Dispatch and QNN load, V79 HTP skel is selected, context binaries match QNN SDK 2.44.0, then QNN fails context creation with context-size estimate 9260401152 bytes.
MiniCPM5-1B-qualcomm-sm8750-c1024.litertlm Qualcomm NPU/Dispatch/QNN Not inference-passing. Reduced-cache variant lowers the context-size estimate to 5909219840 bytes, but QNN still reports no available PD for context creation.

The Qualcomm AOT artifacts are included for reproducibility and follow-up runtime work. They should not be treated as validated NPU inference artifacts for this stock Redmagic shell environment yet.

Conversion Summary

Source model revision:

openbmb/MiniCPM5-1B@4e9de7a0778dc1c362e983e6858f0e77542cbdca

CPU/GPU conversion command pattern:

litert-torch export_hf \
  model-cache/openbmb-MiniCPM5-1B \
  <output_dir> \
  --keep_temporary_files=True \
  --prefill_lengths=128,1024 \
  --cache_length=<2048-or-4096> \
  --externalize_embedder=True \
  --quantization_recipe=dynamic_wi8_afp32

Qualcomm AOT conversion used ai_edge_litert.aot targeting SocManufacturer.QUALCOMM and SocModel.SM8750, then repackaged the compiled TFLite model into the .litertlm container with backend_constraint=npu.

Toolchain:

Python 3.12.12
litert-torch 0.9.1
litert-lm 0.12.0
ai-edge-litert 2.1.5
ai-edge-litert-sdk-qualcomm 2.1.5
ai-edge-quantizer 0.7.0
torch 2.12.0+cu130
transformers 5.9.0
QNN SDK/runtime 2.44.0 from the Qualcomm LiteRT SDK wheel

Not Included

MediaPipe .task was not uploaded. The current MediaPipe GenAI bundler documents SentencePiece tokenizer.model support, while MiniCPM5-1B ships an HF tokenizer.json and no tokenizer.model; uploading an unverified .task would be misleading.

Intel NPU AOT was not produced in this local run because this request targeted the connected Qualcomm SM8750/Adreno phone.

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tdamre/MiniCPM5-1B-litert-lm

Finetuned
(11)
this model