DictaLM-3.0-1.7B β€” Android / Google AI Edge Bundles

On-device Hebrew LLM for Android using Google AI Edge. Converts DictaLM-3.0-1.7B-Thinking and DictaLM-3.0-1.7B-Instruct (Hebrew LLMs fine-tuned from Qwen3-1.7B by Dicta) into .litertlm format that runs locally on Android β€” no internet or cloud API required.


Files

File Size Variant Quant Template
dictalm-3.0-1.7b-thinking-float16-thinking.litertlm 3.3 GB Thinking float16 thinking enabled
dictalm-3.0-1.7b-thinking-dynamic_int8.litertlm 1.7 GB Thinking dynamic INT8 no-think (fast)
dictalm-3.0-1.7b-thinking-int4.litertlm 852 MB Thinking INT4 no-think (fast)
dictalm-3.0-1.7b-thinking-float16.litertlm 3.3 GB Thinking float16 no-think (fast)
dictalm-3.0-1.7b-instruct-float16.litertlm 3.3 GB Instruct float16 β€”
dictalm-3.0-1.7b-instruct-dynamic_int8.litertlm 1.7 GB Instruct dynamic INT8 β€”
dictalm-3.0-1.7b-instruct-int4.litertlm 852 MB Instruct INT4 β€”

Which to use?

  • Best quality + visible reasoning: thinking-float16-thinking β€” model shows its thought process before answering
  • Best quality, no reasoning overhead: thinking-float16 or instruct-float16
  • Balanced (recommended): thinking-dynamic_int8 β€” good Hebrew quality, fits comfortably in RAM
  • Fastest / smallest: thinking-int4 or instruct-int4

Quick Start β€” Google AI Edge Gallery (Android)

  1. Download a .litertlm file from this repo
  2. Install Google AI Edge Gallery on your Android device
  3. Open the app β†’ Add Model β†’ select your .litertlm file
  4. Start chatting in Hebrew

Device Requirements

Spec Minimum
RAM 3 GB free (INT4) / 5 GB free (INT8) / 8 GB free (float16)
Storage 1 GB (INT4) / 2 GB (INT8) / 4 GB (float16)
OS Android 10+
Runtime Google AI Edge Gallery

Tested on Pixel 10 (12 GB RAM).


Thinking Mode

The *-thinking-float16-thinking.litertlm bundle embeds a Jinja chat template that starts generation with <|im_start|>assistant\n<think>\n, which prompts the model to reason before answering. The <think>...</think> block will appear as raw text in the app output.

The other Thinking-variant bundles embed <think>\n\n</think> in the generation prompt to suppress thinking for faster responses.

The Instruct variant has no thinking capability.


Hebrew Language

All bundles embed a Hebrew system prompt:

You are a helpful Hebrew AI assistant named DictaLM. Respond in Hebrew unless explicitly asked otherwise.
אΧͺΧ” Χ’Χ•Χ–Χ¨ AI Χ©Χ™ΧžΧ•Χ©Χ™ Χ‘Χ’Χ‘Χ¨Χ™Χͺ בשם DictaLM. Χ”Χ’Χ‘ Χ‘Χ’Χ‘Χ¨Χ™Χͺ אלא אם Χ‘Χ™Χ§Χ©Χ• ממך אחרΧͺ.

How These Were Built

Architecture

DictaLM-3.0-1.7B is a fine-tune of Qwen3-1.7B:

Property Value
Architecture Qwen3ForCausalLM
Layers 28
Hidden size 2048
Heads (Q/KV) 16 / 8 (GQA)
Head dim 128
Intermediate size 6144
Vocab size 151,936
Tokenizer HuggingFace tiktoken
Tie embeddings True

litert-torch 0.8.0 already includes qwen3.build_1_7b_model() which exactly matches this config β€” no custom builder was needed.

Conversion Pipeline

HuggingFace safetensors
        ↓
litert-torch (Strategy 1: native Qwen3 builder)
        ↓
TFLite with prefill/decode KV-cache signatures
        ↓
bundle_litertlm.py (LlmMetadata proto + HF tokenizer + Jinja template)
        ↓
.litertlm (Google AI Edge runtime format)

Key detail: lm_head.weight is absent from the checkpoint (tie_word_embeddings=True) and is copied from model.embed_tokens.weight in a custom weight loader before conversion.

Reproduce a Build

Requirements: ~32 GB RAM, Python 3.12, litert-torch==0.8.0, LiteRT-LM builder

# Install dependencies
pip install litert-torch==0.8.0 mediapipe transformers safetensors

# Clone LiteRT-LM (needed by bundle_litertlm.py)
git clone --depth=1 https://github.com/google-ai-edge/LiteRT-LM /tmp/litert-lm
# Build flatbuffer + proto bindings β€” see LiteRT-LM docs

# Download model
huggingface-cli download dicta-il/DictaLM-3.0-1.7B-Thinking \
  --local-dir ./dictalm-3.0-1.7b-thinking

# Convert to TFLite (Strategy 1 β€” KV-cache prefill/decode)
python scripts/convert_dictalm_android.py \
  --model-dir ./dictalm-3.0-1.7b-thinking \
  --tflite-dir ./tflite_output/thinking_fp16 \
  --quantize float16 \
  --prefill-seq-len 1024 --kv-cache-max-len 1024 \
  --skip-task-bundle

# Bundle as .litertlm with thinking enabled
python scripts/bundle_litertlm.py \
  --tflite ./tflite_output/thinking_fp16/*.tflite \
  --tokenizer ./dictalm-3.0-1.7b-thinking/tokenizer.json \
  --tokenizer-type hf \
  --model-type qwen3 \
  --thinking \
  --output ./dictalm-3.0-1.7b-thinking-float16-thinking.litertlm \
  --quant float16

# Or without thinking (no-think mode, faster):
python scripts/bundle_litertlm.py \
  --tflite ./tflite_output/thinking_fp16/*.tflite \
  --tokenizer ./dictalm-3.0-1.7b-thinking/tokenizer.json \
  --tokenizer-type hf \
  --model-type qwen3 \
  --output ./dictalm-3.0-1.7b-thinking-float16.litertlm \
  --quant float16

Quantization Options

--quantize Converter method Approx size
float16 fp16 3.3 GB
dynamic_int8 dynamic_int8 (weights + activations) 1.7 GB
int8 weight_only_int8 1.7 GB
int4 dynamic_int4_block128 852 MB

Scripts

See scripts/ folder:

Script Purpose
convert_dictalm_android.py Convert DictaLM safetensors β†’ TFLite with KV cache
bundle_litertlm.py Bundle TFLite + HF tokenizer + LlmMetadata β†’ .litertlm
bundle_and_upload_dictalm.sh Automation: monitor conversions, bundle, upload to HF

License

Model weights: Qwen License / DictaLM terms Conversion scripts: Apache 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for barakplasma/dictalm-3.0-1.7b-thinking-android

Finetuned
(2)
this model