OLMo-2-1B-Instruct — LiteRT-LM (blockwise int4)

allenai/OLMo-2-0425-1B-Instruct converted to the LiteRT-LM (.litertlm) format for on-device inference with Google's LiteRT-LM runtime (the engine behind the official litert-community/* models).

OLMo-2 is AllenAI's fully-open model family (Apache-2.0; open weights, data, and training code). This 1B variant is small enough to run on a phone — verified on iPhone 17 Pro. Converted with the official upstream litert-torch — no fork.


File	`model.litertlm` (~0.93 GB)
Quantization	int4 weights — blockwise (block 32) + OCTAV optimal-clipping, symmetric; embedding INT8
Compute	integer
Context (KV cache)	4096
Base model	allenai/OLMo-2-0425-1B-Instruct
Decode speed	~24 tok/s (iPhone 17 Pro; loads 5.2 s, ~1.2 GB footprint) · ~138 tok/s (Mac M-series, Metal GPU)

Usage

Run with the LiteRT-LM runtime:

litert_lm_main \
  --model_path model.litertlm \
  --backend gpu \
  --input_prompt "Explain on-device AI in one sentence."

Run on Android

The easiest way to try this model on a phone is the official Google AI Edge Gallery app:

Install a recent Gallery (package com.google.ai.edge.gallery, APK from the repo's releases — 1.0.15+ supports .litertlm).
Download model.litertlm and push it to the device:
```
adb push model.litertlm /sdcard/Download/
```
In the app, tap + (bottom-right), pick the file, and choose CPU or GPU. At ~0.93 GB this 1B fits comfortably on an 8 GB phone.
Chat — the bundle already carries the tokenizer and OLMo-2 prompt template.

See the Gallery Importing Local Models guide for details. To embed it in your own Android app, use the LiteRT-LM Kotlin API (com.google.ai.edge.litertlm:litertlm-android).

Quality — GSM8K

Measured on GSM8K (n=100, greedy, 0-shot chain-of-thought, identical prompt and answer-extraction for every row).

Configuration	GSM8K
bf16 (reference)	72.0%
This model — LiteRT int4 (BOCTAV4)	63.0%

63 % is a strong, coherent, non-degenerate score for a 1B (the \boxed{}-style answers terminate cleanly at <|endoftext|>). At 1B, 4-bit quantization costs ~9 pt vs bf16 — a small model has less redundancy to absorb int4 rounding than a 3B+ (where the same recipe is at parity). An int8 build recovers only ~2 pt (65 %) for +60 % size, so int4 is shipped as the best size/quality trade-off for on-device.

Conversion

Converted with the official upstream litert-torch export_hf (clean git worktree at upstream/main, dev-fork patches excluded). Olmo2ForCausalLM rides the stock converter with no custom code: QK-norm and OLMo-2's reordered post-norm lower to generic ops. The int4 recipe is blockwise (block 32) + OCTAV with the embedding at INT8.

License

Apache-2.0, inherited from the base model allenai/OLMo-2-0425-1B-Instruct.

Downloads last month: 4

Model tree for mlboydaisuke/OLMo-2-1B-Instruct-LiteRT

Base model

allenai/OLMo-2-0425-1B

Finetuned

allenai/OLMo-2-0425-1B-SFT

Finetuned

allenai/OLMo-2-0425-1B-DPO

Finetuned

allenai/OLMo-2-0425-1B-RLVR1

Finetuned

allenai/OLMo-2-0425-1B-Instruct

Finetuned

(41)

this model