Instructions to use mlboydaisuke/Llama-3.2-3B-Instruct-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use mlboydaisuke/Llama-3.2-3B-Instruct-LiteRT with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=mlboydaisuke/Llama-3.2-3B-Instruct-LiteRT \ model.litertlm \ --prompt="Write me a poem"
- LiteRT
How to use mlboydaisuke/Llama-3.2-3B-Instruct-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Llama-3.2-3B-Instruct โ LiteRT-LM (blockwise int4)
Built with Llama. meta-llama/Llama-3.2-3B-Instruct
converted to the LiteRT-LM (.litertlm) format for on-device inference with
Google's LiteRT-LM runtime (the
engine behind the official litert-community/* models).
| File | model.litertlm (~2.1 GB) |
| Quantization | int4 weights โ blockwise (block 32), symmetric; embeddings INT8 (externalized section) |
| Compute | integer |
| Context (KV cache) | 4096 |
| Base model | meta-llama/Llama-3.2-3B-Instruct |
| Decode speed | ~18.5 tok/s (iPhone 17 Pro, Metal GPU, ttft 0.64 s) ยท ~87 tok/s (Mac M4 Max, greedy) |
Usage
Run with the LiteRT-LM runtime:
# build litert-lm from https://github.com/google-ai-edge/litert-lm, then:
litert_lm_main \
--model_path model.litertlm \
--backend gpu \
--input_prompt "Explain on-device AI in one sentence."
The .litertlm bundle carries the tokenizer and the prompt template (Llama-3's
native <|start_header_id|>role<|end_header_id|> format, start token
<|begin_of_text|>, stop tokens <|eot_id|> / <|end_of_text|>), so no separate
tokenizer files are needed.
Run on Android
The easiest way to try this model on a phone is the official
Google AI Edge Gallery app โ it
runs .litertlm models fully on-device and can import your own:
- Install a recent Gallery (package
com.google.ai.edge.gallery, APK from the repo's releases โ 1.0.15+ supports.litertlm). Older 1.0.x builds (packagecom.google.aiedge.gallery) only accept the legacy MediaPipe.taskformat and reject.litertlm. - Download
model.litertlmfrom this repo and push it to the device:adb push model.litertlm /sdcard/Download/ - In the app, tap the + button (bottom-right), pick the file, and choose the GPU backend (CPU also works).
- Chat. Nothing else to configure โ the
.litertlmbundle already carries the tokenizer and prompt template, so the model uses its native Llama-3 chat format automatically.
See the Gallery
Importing Local Models
guide for details. To embed the model in your own Android app instead, use the
LiteRT-LM Kotlin API (Gradle artifact com.google.ai.edge.litertlm:litertlm-android,
getting started).
Quality โ GSM8K parity
Measured on GSM8K (n=100, greedy, 0-shot chain-of-thought asking for #### <n>,
identical prompt and answer-extraction for every row). The 4-bit MLX build is the
known-good 4-bit control:
| Configuration | GSM8K |
|---|---|
| bf16 (reference) | 78.0% |
| MLX 4-bit (control) | 73.3%ยน |
| This model โ LiteRT int4 | 73.0% |
LiteRT int4 is at parity: โ5 pt vs bf16 and equal to the MLX 4-bit control
(73.3% vs 73.3% on the common subsetยน). The model also passes the local quality gate
8/8 (no degeneracy). bf16's 78.0% matches Llama's published 8-shot GSM8K (~77.7%),
confirming the harness is calibrated. This is a direct-answering instruct model (no
<think> block); it terminates cleanly at <|eot_id|>.
ยน The MLX control hit a reproducible Metal out-of-memory abort at one question on the test machine, so bf16 / LiteRT-int4 / MLX are compared on the common 45-question subset (77.8 / 73.3 / 73.3); the LiteRT-int4 headline (73.0%) is the full n=100.
Conversion
Converted with the official litert-torch
converter (upstream main), no custom graph code. Llama-3.2-3B is a standard
LlamaForCausalLM architecture, so it rides the existing converter and runtime
directly. The recipe is blockwise int4 (INT4 weights, block size 32, symmetric)
with embeddings kept at INT8 and KV cache 4096. Blockwise (not the tool's default
channelwise) int4 is what preserves reasoning accuracy.
from litert_torch.generative.export_hf.export import export
export(
model="meta-llama/Llama-3.2-3B-Instruct",
output_dir="out",
quantization_recipe="llama_int4_block32.json", # blockwise-32 int4, int8 embeddings
cache_length=4096,
externalize_embedder=True, # embedding โ its own section (see iOS note)
)
externalize_embedder=True (required for iOS). This 28-layer 3B's weights are a
single ~2.4 GB TFLite section, which exceeds the ~2 GiB single-section mmap limit on
iOS โ engine creation fails with "Failed to map section: Cannot allocate memory".
Externalizing the (tied) embedding into its own section drops the main weights section
below 2 GiB (and dedups the tied matrix, ~2.4 GB โ 2.1 GB total), so the model loads on
iPhone. Verified on iPhone 17 Pro (loads in 8.8 s, ~18.5 tok/s, coherent). This is
the generic equivalent of Gemma's per-layer-embedding mmap. (Mac/desktop load >2 GB
sections fine, so this only matters for iOS.)
A block-128 variant is also available (slightly smaller, ~+5% decode on Apple GPU, quality gate 7/8) for latency-sensitive deployments.
License
Llama 3.2 Community License, inherited from the base model meta-llama/Llama-3.2-3B-Instruct. Built with Llama. See https://www.llama.com/llama3_2/license/
- Downloads last month
- 146
Model tree for mlboydaisuke/Llama-3.2-3B-Instruct-LiteRT
Base model
meta-llama/Llama-3.2-3B-Instruct