Falcon3-3B-Instruct β€” LiteRT-LM (blockwise int4)

tiiuae/Falcon3-3B-Instruct converted to the LiteRT-LM (.litertlm) format for on-device inference with Google's LiteRT-LM runtime (the engine behind the official litert-community/* models).

Text-only conversion (the Falcon3 decoder; no vision/audio towers).

File model.litertlm (~1.74 GB)
Quantization int4 weights β€” blockwise (block 128), symmetric; embeddings INT8
Compute integer
Context (KV cache) 2048
Base model tiiuae/Falcon3-3B-Instruct
Decode speed ~27 tok/s (iPhone 17 Pro, Metal GPU) Β· ~89 tok/s (Mac M4 Max, LiteRT-LM, greedy)

Usage

Run with the LiteRT-LM runtime:

# build litert-lm from https://github.com/google-ai-edge/litert-lm, then:
litert_lm_main \
  --model_path model.litertlm \
  --backend gpu \
  --input_prompt "Explain on-device AI in one sentence."

The .litertlm bundle carries the tokenizer and the prompt template (Falcon3's native <|user|> / <|assistant|> format, stop token <|endoftext|>), so no separate tokenizer files are needed.

Quality β€” GSM8K parity

Measured on GSM8K (n=100, greedy, 0-shot chain-of-thought asking for #### <n>, identical prompt and answer-extraction for every row). The 4-bit MLX build is the known-good 4-bit control:

Configuration GSM8K
bf16 (reference) 75%
MLX 4-bit (control) 76%
This model β€” LiteRT int4 77%

LiteRT int4 is fully at parity β€” it matches or slightly exceeds both the 4-bit control and bf16 here (the small spread is sampling noise at n=100). This is a direct-answering instruct model (no <think> block) and terminates cleanly at <|endoftext|>.

Conversion

Converted with litert-torch using a blockwise int4 recipe (INT4 weights, block size 128, symmetric) with embeddings kept at INT8, KV cache 2048, and Falcon3's native chat template. Falcon3-3B is a standard LlamaForCausalLM architecture, so it rides the existing converter and runtime with no custom code. Blockwise (not channelwise) int4 is what preserves reasoning accuracy.

Reproduce (official tools only)

Built with stock litert-torch β€” no custom code, no graph patches. The only non-default choice is the int4 recipe: the tool's default named int4 is channelwise (which degrades small models), so this uses blockwise-128 (the scheme the official models ship), passed as a recipe file to the standard export:

from litert_torch.generative.export_hf.export import export
export(
    model="tiiuae/Falcon3-3B-Instruct",
    output_dir="out",
    quantization_recipe="falcon_int4_block128.json",  # included in this repo
    cache_length=2048,
    trust_remote_code=True,
)

falcon_int4_block128.json is included in this repo. (If the export errors with a missing ai_edge_quantizer/recipes/ directory, create it empty β€” a packaging gap in some releases that trips the .json-recipe path.)

License

Falcon LLM License (TII), inherited from the base model tiiuae/Falcon3-3B-Instruct. See https://falconllm.tii.ae/falcon-terms-and-conditions.html

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/Falcon3-3B-Instruct-LiteRT

Unable to build the model tree, the base model loops to the model itself. Learn more.