--- license: other license_name: falcon-llm-license license_link: https://falconllm.tii.ae/falcon-terms-and-conditions.html base_model: tiiuae/Falcon3-3B-Instruct tags: - litert - litert-lm - litertlm - on-device - edge - falcon3 pipeline_tag: text-generation library_name: litert-lm --- # Falcon3-3B-Instruct — LiteRT-LM (blockwise int4) [tiiuae/Falcon3-3B-Instruct](https://huggingface.co/tiiuae/Falcon3-3B-Instruct) converted to the **LiteRT-LM** (`.litertlm`) format for on-device inference with Google's [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) runtime (the engine behind the official `litert-community/*` models). Text-only conversion (the Falcon3 decoder; no vision/audio towers). | | | |---|---| | **File** | `model.litertlm` (~1.74 GB) | | **Quantization** | int4 weights — **blockwise (block 128)**, symmetric; embeddings INT8 | | **Compute** | integer | | **Context (KV cache)** | 2048 | | **Base model** | tiiuae/Falcon3-3B-Instruct | | **Decode speed** | ~27 tok/s (iPhone 17 Pro, Metal GPU) · ~89 tok/s (Mac M4 Max, LiteRT-LM, greedy) | ## Usage Run with the LiteRT-LM runtime: ```bash # build litert-lm from https://github.com/google-ai-edge/litert-lm, then: litert_lm_main \ --model_path model.litertlm \ --backend gpu \ --input_prompt "Explain on-device AI in one sentence." ``` The `.litertlm` bundle carries the tokenizer and the prompt template (Falcon3's native `<|user|>` / `<|assistant|>` format, stop token `<|endoftext|>`), so no separate tokenizer files are needed. ## Run on desktop (LiteRT-LM CLI) The same `.litertlm` bundle runs on macOS / Linux / Windows with the official [LiteRT-LM CLI](https://github.com/google-ai-edge/LiteRT-LM) — including as a local **OpenAI-compatible API server**: ```bash pip install litert-lm litert-lm import --from-huggingface-repo mlboydaisuke/Falcon3-3B-Instruct-LiteRT model.litertlm falcon3-3b-instruct-litert litert-lm run falcon3-3b-instruct-litert # interactive chat in the terminal litert-lm serve # local OpenAI-compatible API server ``` ## Quality — GSM8K parity Measured on GSM8K (n=100, greedy, 0-shot chain-of-thought asking for `#### `, identical prompt and answer-extraction for every row). The 4-bit MLX build is the known-good 4-bit control: | Configuration | GSM8K | |---|---| | bf16 (reference) | 75% | | MLX 4-bit (control) | 76% | | **This model — LiteRT int4** | **77%** | LiteRT int4 is fully at parity — it matches or slightly exceeds both the 4-bit control and bf16 here (the small spread is sampling noise at n=100). This is a direct-answering instruct model (no `` block) and terminates cleanly at `<|endoftext|>`. ## Conversion Converted with [`litert-torch`](https://github.com/google-ai-edge/litert) using a **blockwise int4** recipe (INT4 weights, block size 128, symmetric) with embeddings kept at INT8, KV cache 2048, and Falcon3's native chat template. Falcon3-3B is a standard `LlamaForCausalLM` architecture, so it rides the existing converter and runtime with no custom code. Blockwise (not channelwise) int4 is what preserves reasoning accuracy. ## Reproduce (official tools only) Built with **stock `litert-torch`** — no custom code, no graph patches. The only non-default choice is the int4 recipe: the tool's default named int4 is *channelwise* (which degrades small models), so this uses **blockwise-128** (the scheme the official models ship), passed as a recipe file to the standard export: ```python from litert_torch.generative.export_hf.export import export export( model="tiiuae/Falcon3-3B-Instruct", output_dir="out", quantization_recipe="falcon_int4_block128.json", # included in this repo cache_length=2048, trust_remote_code=True, ) ``` `falcon_int4_block128.json` is included in this repo. (If the export errors with a missing `ai_edge_quantizer/recipes/` directory, create it empty — a packaging gap in some releases that trips the `.json`-recipe path.) ## License Falcon LLM License (TII), inherited from the base model [tiiuae/Falcon3-3B-Instruct](https://huggingface.co/tiiuae/Falcon3-3B-Instruct). See https://falconllm.tii.ae/falcon-terms-and-conditions.html