mlboydaisuke's picture
Add desktop LiteRT-LM CLI section (import/run/serve)
13c1bb0 verified
|
Raw
History Blame Contribute Delete
4.2 kB
---
license: other
license_name: falcon-llm-license
license_link: https://falconllm.tii.ae/falcon-terms-and-conditions.html
base_model: tiiuae/Falcon3-3B-Instruct
tags:
- litert
- litert-lm
- litertlm
- on-device
- edge
- falcon3
pipeline_tag: text-generation
library_name: litert-lm
---
# Falcon3-3B-Instruct β€” LiteRT-LM (blockwise int4)
[tiiuae/Falcon3-3B-Instruct](https://huggingface.co/tiiuae/Falcon3-3B-Instruct)
converted to the **LiteRT-LM** (`.litertlm`) format for on-device inference with
Google's [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) runtime (the
engine behind the official `litert-community/*` models).
Text-only conversion (the Falcon3 decoder; no vision/audio towers).
| | |
|---|---|
| **File** | `model.litertlm` (~1.74 GB) |
| **Quantization** | int4 weights β€” **blockwise (block 128)**, symmetric; embeddings INT8 |
| **Compute** | integer |
| **Context (KV cache)** | 2048 |
| **Base model** | tiiuae/Falcon3-3B-Instruct |
| **Decode speed** | ~27 tok/s (iPhone 17 Pro, Metal GPU) Β· ~89 tok/s (Mac M4 Max, LiteRT-LM, greedy) |
## Usage
Run with the LiteRT-LM runtime:
```bash
# build litert-lm from https://github.com/google-ai-edge/litert-lm, then:
litert_lm_main \
--model_path model.litertlm \
--backend gpu \
--input_prompt "Explain on-device AI in one sentence."
```
The `.litertlm` bundle carries the tokenizer and the prompt template (Falcon3's
native `<|user|>` / `<|assistant|>` format, stop token `<|endoftext|>`), so no
separate tokenizer files are needed.
## Run on desktop (LiteRT-LM CLI)
The same `.litertlm` bundle runs on macOS / Linux / Windows with the official
[LiteRT-LM CLI](https://github.com/google-ai-edge/LiteRT-LM) β€” including as a
local **OpenAI-compatible API server**:
```bash
pip install litert-lm
litert-lm import --from-huggingface-repo mlboydaisuke/Falcon3-3B-Instruct-LiteRT model.litertlm falcon3-3b-instruct-litert
litert-lm run falcon3-3b-instruct-litert # interactive chat in the terminal
litert-lm serve # local OpenAI-compatible API server
```
## Quality β€” GSM8K parity
Measured on GSM8K (n=100, greedy, 0-shot chain-of-thought asking for `#### <n>`,
identical prompt and answer-extraction for every row). The 4-bit MLX build is the
known-good 4-bit control:
| Configuration | GSM8K |
|---|---|
| bf16 (reference) | 75% |
| MLX 4-bit (control) | 76% |
| **This model β€” LiteRT int4** | **77%** |
LiteRT int4 is fully at parity β€” it matches or slightly exceeds both the 4-bit
control and bf16 here (the small spread is sampling noise at n=100). This is a
direct-answering instruct model (no `<think>` block) and terminates cleanly at
`<|endoftext|>`.
## Conversion
Converted with [`litert-torch`](https://github.com/google-ai-edge/litert) using a
**blockwise int4** recipe (INT4 weights, block size 128, symmetric) with embeddings
kept at INT8, KV cache 2048, and Falcon3's native chat template. Falcon3-3B is a
standard `LlamaForCausalLM` architecture, so it rides the existing converter and
runtime with no custom code. Blockwise (not channelwise) int4 is what preserves
reasoning accuracy.
## Reproduce (official tools only)
Built with **stock `litert-torch`** β€” no custom code, no graph patches. The only
non-default choice is the int4 recipe: the tool's default named int4 is
*channelwise* (which degrades small models), so this uses **blockwise-128** (the
scheme the official models ship), passed as a recipe file to the standard export:
```python
from litert_torch.generative.export_hf.export import export
export(
model="tiiuae/Falcon3-3B-Instruct",
output_dir="out",
quantization_recipe="falcon_int4_block128.json", # included in this repo
cache_length=2048,
trust_remote_code=True,
)
```
`falcon_int4_block128.json` is included in this repo. (If the export errors with a
missing `ai_edge_quantizer/recipes/` directory, create it empty β€” a packaging gap
in some releases that trips the `.json`-recipe path.)
## License
Falcon LLM License (TII), inherited from the base model
[tiiuae/Falcon3-3B-Instruct](https://huggingface.co/tiiuae/Falcon3-3B-Instruct).
See https://falconllm.tii.ae/falcon-terms-and-conditions.html