Add desktop LiteRT-LM CLI section (import/run/serve)

13c1bb0 verified 3 days ago

4.2 kB

	---
	license: other
	license_name: falcon-llm-license
	license_link: https://falconllm.tii.ae/falcon-terms-and-conditions.html
	base_model: tiiuae/Falcon3-3B-Instruct
	tags:
	- litert
	- litert-lm
	- litertlm
	- on-device
	- edge
	- falcon3
	pipeline_tag: text-generation
	library_name: litert-lm
	---

	# Falcon3-3B-Instruct — LiteRT-LM (blockwise int4)

	[tiiuae/Falcon3-3B-Instruct](https://huggingface.co/tiiuae/Falcon3-3B-Instruct)
	converted to the LiteRT-LM (`.litertlm`) format for on-device inference with
	Google's [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) runtime (the
	engine behind the official `litert-community/*` models).

	Text-only conversion (the Falcon3 decoder; no vision/audio towers).

	\| \| \|
	\|---\|---\|
	\| File \| `model.litertlm` (~1.74 GB) \|
	\| Quantization \| int4 weights — blockwise (block 128), symmetric; embeddings INT8 \|
	\| Compute \| integer \|
	\| Context (KV cache) \| 2048 \|
	\| Base model \| tiiuae/Falcon3-3B-Instruct \|
	\| Decode speed \| ~27 tok/s (iPhone 17 Pro, Metal GPU) · ~89 tok/s (Mac M4 Max, LiteRT-LM, greedy) \|

	## Usage

	Run with the LiteRT-LM runtime:

	```bash
	# build litert-lm from https://github.com/google-ai-edge/litert-lm, then:
	litert_lm_main \
	--model_path model.litertlm \
	--backend gpu \
	--input_prompt "Explain on-device AI in one sentence."
	```

	The `.litertlm` bundle carries the tokenizer and the prompt template (Falcon3's
	native `<\|user\|>` / `<\|assistant\|>` format, stop token `<\|endoftext\|>`), so no
	separate tokenizer files are needed.

	## Run on desktop (LiteRT-LM CLI)

	The same `.litertlm` bundle runs on macOS / Linux / Windows with the official
	[LiteRT-LM CLI](https://github.com/google-ai-edge/LiteRT-LM) — including as a
	local OpenAI-compatible API server:

	```bash
	pip install litert-lm
	litert-lm import --from-huggingface-repo mlboydaisuke/Falcon3-3B-Instruct-LiteRT model.litertlm falcon3-3b-instruct-litert
	litert-lm run falcon3-3b-instruct-litert # interactive chat in the terminal
	litert-lm serve # local OpenAI-compatible API server
	```

	## Quality — GSM8K parity

	Measured on GSM8K (n=100, greedy, 0-shot chain-of-thought asking for `#### <n>`,
	identical prompt and answer-extraction for every row). The 4-bit MLX build is the
	known-good 4-bit control:

	\| Configuration \| GSM8K \|
	\|---\|---\|
	\| bf16 (reference) \| 75% \|
	\| MLX 4-bit (control) \| 76% \|
	\| This model — LiteRT int4 \| 77% \|

	LiteRT int4 is fully at parity — it matches or slightly exceeds both the 4-bit
	control and bf16 here (the small spread is sampling noise at n=100). This is a
	direct-answering instruct model (no `<think>` block) and terminates cleanly at
	`<\|endoftext\|>`.

	## Conversion

	Converted with [`litert-torch`](https://github.com/google-ai-edge/litert) using a
	blockwise int4 recipe (INT4 weights, block size 128, symmetric) with embeddings
	kept at INT8, KV cache 2048, and Falcon3's native chat template. Falcon3-3B is a
	standard `LlamaForCausalLM` architecture, so it rides the existing converter and
	runtime with no custom code. Blockwise (not channelwise) int4 is what preserves
	reasoning accuracy.

	## Reproduce (official tools only)

	Built with stock `litert-torch` — no custom code, no graph patches. The only
	non-default choice is the int4 recipe: the tool's default named int4 is
	channelwise (which degrades small models), so this uses blockwise-128 (the
	scheme the official models ship), passed as a recipe file to the standard export:

	```python
	from litert_torch.generative.export_hf.export import export
	export(
	model="tiiuae/Falcon3-3B-Instruct",
	output_dir="out",
	quantization_recipe="falcon_int4_block128.json", # included in this repo
	cache_length=2048,
	trust_remote_code=True,
	)
	```

	`falcon_int4_block128.json` is included in this repo. (If the export errors with a
	missing `ai_edge_quantizer/recipes/` directory, create it empty — a packaging gap
	in some releases that trips the `.json`-recipe path.)

	## License

	Falcon LLM License (TII), inherited from the base model
	[tiiuae/Falcon3-3B-Instruct](https://huggingface.co/tiiuae/Falcon3-3B-Instruct).
	See https://falconllm.tii.ae/falcon-terms-and-conditions.html