DictaLM-3.0-1.7B — Android / Google AI Edge Bundles

On-device Hebrew LLM for Android using Google AI Edge. Converts DictaLM-3.0-1.7B-Thinking and DictaLM-3.0-1.7B-Instruct (Hebrew LLMs fine-tuned from Qwen3-1.7B by Dicta) into .litertlm format that runs locally on Android — no internet or cloud API required.

Files

File	Size	Variant	Quant	Template
`dictalm-3.0-1.7b-thinking-float16-thinking.litertlm`	3.3 GB	Thinking	float16	thinking enabled
`dictalm-3.0-1.7b-thinking-dynamic_int8.litertlm`	1.7 GB	Thinking	dynamic INT8	no-think (fast)
`dictalm-3.0-1.7b-thinking-int4.litertlm`	852 MB	Thinking	INT4	no-think (fast)
`dictalm-3.0-1.7b-thinking-float16.litertlm`	3.3 GB	Thinking	float16	no-think (fast)
`dictalm-3.0-1.7b-instruct-float16.litertlm`	3.3 GB	Instruct	float16	—
`dictalm-3.0-1.7b-instruct-dynamic_int8.litertlm`	1.7 GB	Instruct	dynamic INT8	—
`dictalm-3.0-1.7b-instruct-int4.litertlm`	852 MB	Instruct	INT4	—

Which to use?

Best quality + visible reasoning: thinking-float16-thinking — model shows its thought process before answering
Best quality, no reasoning overhead: thinking-float16 or instruct-float16
Balanced (recommended): thinking-dynamic_int8 — good Hebrew quality, fits comfortably in RAM
Fastest / smallest: thinking-int4 or instruct-int4

Quick Start — Google AI Edge Gallery (Android)

Download a .litertlm file from this repo
Install Google AI Edge Gallery on your Android device
Open the app → Add Model → select your .litertlm file
Start chatting in Hebrew

Device Requirements

Spec	Minimum
RAM	3 GB free (INT4) / 5 GB free (INT8) / 8 GB free (float16)
Storage	1 GB (INT4) / 2 GB (INT8) / 4 GB (float16)
OS	Android 10+
Runtime	Google AI Edge Gallery

Tested on Pixel 10 (12 GB RAM).

Thinking Mode

The *-thinking-float16-thinking.litertlm bundle embeds a Jinja chat template that starts generation with <|im_start|>assistant\n<think>\n, which prompts the model to reason before answering. The <think>...</think> block will appear as raw text in the app output.

The other Thinking-variant bundles embed <think>\n\n</think> in the generation prompt to suppress thinking for faster responses.

The Instruct variant has no thinking capability.

Hebrew Language

All bundles embed a Hebrew system prompt:

You are a helpful Hebrew AI assistant named DictaLM. Respond in Hebrew unless explicitly asked otherwise.
אתה עוזר AI שימושי בעברית בשם DictaLM. הגב בעברית אלא אם ביקשו ממך אחרת.

How These Were Built

Architecture

DictaLM-3.0-1.7B is a fine-tune of Qwen3-1.7B:

Property	Value
Architecture	Qwen3ForCausalLM
Layers	28
Hidden size	2048
Heads (Q/KV)	16 / 8 (GQA)
Head dim	128
Intermediate size	6144
Vocab size	151,936
Tokenizer	HuggingFace tiktoken
Tie embeddings	True

litert-torch 0.8.0 already includes qwen3.build_1_7b_model() which exactly matches this config — no custom builder was needed.

Conversion Pipeline

HuggingFace safetensors
        ↓
litert-torch (Strategy 1: native Qwen3 builder)
        ↓
TFLite with prefill/decode KV-cache signatures
        ↓
bundle_litertlm.py (LlmMetadata proto + HF tokenizer + Jinja template)
        ↓
.litertlm (Google AI Edge runtime format)

Key detail: lm_head.weight is absent from the checkpoint (tie_word_embeddings=True) and is copied from model.embed_tokens.weight in a custom weight loader before conversion.

Reproduce a Build

Requirements: ~32 GB RAM, Python 3.12, litert-torch==0.8.0, LiteRT-LM builder

# Install dependencies
pip install litert-torch==0.8.0 mediapipe transformers safetensors

# Clone LiteRT-LM (needed by bundle_litertlm.py)
git clone --depth=1 https://github.com/google-ai-edge/LiteRT-LM /tmp/litert-lm
# Build flatbuffer + proto bindings — see LiteRT-LM docs

# Download model
huggingface-cli download dicta-il/DictaLM-3.0-1.7B-Thinking \
  --local-dir ./dictalm-3.0-1.7b-thinking

# Convert to TFLite (Strategy 1 — KV-cache prefill/decode)
python scripts/convert_dictalm_android.py \
  --model-dir ./dictalm-3.0-1.7b-thinking \
  --tflite-dir ./tflite_output/thinking_fp16 \
  --quantize float16 \
  --prefill-seq-len 1024 --kv-cache-max-len 1024 \
  --skip-task-bundle

# Bundle as .litertlm with thinking enabled
python scripts/bundle_litertlm.py \
  --tflite ./tflite_output/thinking_fp16/*.tflite \
  --tokenizer ./dictalm-3.0-1.7b-thinking/tokenizer.json \
  --tokenizer-type hf \
  --model-type qwen3 \
  --thinking \
  --output ./dictalm-3.0-1.7b-thinking-float16-thinking.litertlm \
  --quant float16

# Or without thinking (no-think mode, faster):
python scripts/bundle_litertlm.py \
  --tflite ./tflite_output/thinking_fp16/*.tflite \
  --tokenizer ./dictalm-3.0-1.7b-thinking/tokenizer.json \
  --tokenizer-type hf \
  --model-type qwen3 \
  --output ./dictalm-3.0-1.7b-thinking-float16.litertlm \
  --quant float16

Quantization Options

`--quantize`	Converter method	Approx size
`float16`	fp16	3.3 GB
`dynamic_int8`	dynamic_int8 (weights + activations)	1.7 GB
`int8`	weight_only_int8	1.7 GB
`int4`	dynamic_int4_block128	852 MB

Scripts

See scripts/ folder:

Script	Purpose
`convert_dictalm_android.py`	Convert DictaLM safetensors → TFLite with KV cache
`bundle_litertlm.py`	Bundle TFLite + HF tokenizer + LlmMetadata → .litertlm
`bundle_and_upload_dictalm.sh`	Automation: monitor conversions, bundle, upload to HF

License

Model weights: Qwen License / DictaLM terms Conversion scripts: Apache 2.0

Downloads last month: -

Model tree for barakplasma/dictalm-3.0-1.7b-thinking-android

Base model

dicta-il/DictaLM-3.0-1.7B-Instruct

Finetuned

(2)

this model