Instructions to use mlboydaisuke/Falcon3-3B-Instruct-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use mlboydaisuke/Falcon3-3B-Instruct-LiteRT with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=mlboydaisuke/Falcon3-3B-Instruct-LiteRT \ model.litertlm \ --prompt="Write me a poem"
- LiteRT
How to use mlboydaisuke/Falcon3-3B-Instruct-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: other | |
| license_name: falcon-llm-license | |
| license_link: https://falconllm.tii.ae/falcon-terms-and-conditions.html | |
| base_model: tiiuae/Falcon3-3B-Instruct | |
| tags: | |
| - litert | |
| - litert-lm | |
| - litertlm | |
| - on-device | |
| - edge | |
| - falcon3 | |
| pipeline_tag: text-generation | |
| library_name: litert-lm | |
| # Falcon3-3B-Instruct β LiteRT-LM (blockwise int4) | |
| [tiiuae/Falcon3-3B-Instruct](https://huggingface.co/tiiuae/Falcon3-3B-Instruct) | |
| converted to the **LiteRT-LM** (`.litertlm`) format for on-device inference with | |
| Google's [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) runtime (the | |
| engine behind the official `litert-community/*` models). | |
| Text-only conversion (the Falcon3 decoder; no vision/audio towers). | |
| | | | | |
| |---|---| | |
| | **File** | `model.litertlm` (~1.74 GB) | | |
| | **Quantization** | int4 weights β **blockwise (block 128)**, symmetric; embeddings INT8 | | |
| | **Compute** | integer | | |
| | **Context (KV cache)** | 2048 | | |
| | **Base model** | tiiuae/Falcon3-3B-Instruct | | |
| | **Decode speed** | ~27 tok/s (iPhone 17 Pro, Metal GPU) Β· ~89 tok/s (Mac M4 Max, LiteRT-LM, greedy) | | |
| ## Usage | |
| Run with the LiteRT-LM runtime: | |
| ```bash | |
| # build litert-lm from https://github.com/google-ai-edge/litert-lm, then: | |
| litert_lm_main \ | |
| --model_path model.litertlm \ | |
| --backend gpu \ | |
| --input_prompt "Explain on-device AI in one sentence." | |
| ``` | |
| The `.litertlm` bundle carries the tokenizer and the prompt template (Falcon3's | |
| native `<|user|>` / `<|assistant|>` format, stop token `<|endoftext|>`), so no | |
| separate tokenizer files are needed. | |
| ## Run on desktop (LiteRT-LM CLI) | |
| The same `.litertlm` bundle runs on macOS / Linux / Windows with the official | |
| [LiteRT-LM CLI](https://github.com/google-ai-edge/LiteRT-LM) β including as a | |
| local **OpenAI-compatible API server**: | |
| ```bash | |
| pip install litert-lm | |
| litert-lm import --from-huggingface-repo mlboydaisuke/Falcon3-3B-Instruct-LiteRT model.litertlm falcon3-3b-instruct-litert | |
| litert-lm run falcon3-3b-instruct-litert # interactive chat in the terminal | |
| litert-lm serve # local OpenAI-compatible API server | |
| ``` | |
| ## Quality β GSM8K parity | |
| Measured on GSM8K (n=100, greedy, 0-shot chain-of-thought asking for `#### <n>`, | |
| identical prompt and answer-extraction for every row). The 4-bit MLX build is the | |
| known-good 4-bit control: | |
| | Configuration | GSM8K | | |
| |---|---| | |
| | bf16 (reference) | 75% | | |
| | MLX 4-bit (control) | 76% | | |
| | **This model β LiteRT int4** | **77%** | | |
| LiteRT int4 is fully at parity β it matches or slightly exceeds both the 4-bit | |
| control and bf16 here (the small spread is sampling noise at n=100). This is a | |
| direct-answering instruct model (no `<think>` block) and terminates cleanly at | |
| `<|endoftext|>`. | |
| ## Conversion | |
| Converted with [`litert-torch`](https://github.com/google-ai-edge/litert) using a | |
| **blockwise int4** recipe (INT4 weights, block size 128, symmetric) with embeddings | |
| kept at INT8, KV cache 2048, and Falcon3's native chat template. Falcon3-3B is a | |
| standard `LlamaForCausalLM` architecture, so it rides the existing converter and | |
| runtime with no custom code. Blockwise (not channelwise) int4 is what preserves | |
| reasoning accuracy. | |
| ## Reproduce (official tools only) | |
| Built with **stock `litert-torch`** β no custom code, no graph patches. The only | |
| non-default choice is the int4 recipe: the tool's default named int4 is | |
| *channelwise* (which degrades small models), so this uses **blockwise-128** (the | |
| scheme the official models ship), passed as a recipe file to the standard export: | |
| ```python | |
| from litert_torch.generative.export_hf.export import export | |
| export( | |
| model="tiiuae/Falcon3-3B-Instruct", | |
| output_dir="out", | |
| quantization_recipe="falcon_int4_block128.json", # included in this repo | |
| cache_length=2048, | |
| trust_remote_code=True, | |
| ) | |
| ``` | |
| `falcon_int4_block128.json` is included in this repo. (If the export errors with a | |
| missing `ai_edge_quantizer/recipes/` directory, create it empty β a packaging gap | |
| in some releases that trips the `.json`-recipe path.) | |
| ## License | |
| Falcon LLM License (TII), inherited from the base model | |
| [tiiuae/Falcon3-3B-Instruct](https://huggingface.co/tiiuae/Falcon3-3B-Instruct). | |
| See https://falconllm.tii.ae/falcon-terms-and-conditions.html | |