barakplasma
/

translategemma-4b-it-android-task-quantized

@@ -19,20 +19,18 @@ On-device translation model for Android using [Google AI Edge](https://ai.google
 Converts [google/translategemma-4b-it](https://huggingface.co/google/translategemma-4b-it) (55 languages, 4B params)
 into formats that run locally on Android without internet or cloud APIs.
-Google only publishes WebGPU-only TFLite files. This repo bridges that gap with CPU/XNNPACK-compatible bundles
-in both `.litertlm` (LiteRT-LM, recommended) and `.task` (MediaPipe, legacy) formats.
 ---
 ## Files
-| File | Format | Size | Notes |
-|------|--------|------|-------|
-| `artifacts/int4/translategemma-4b-it-native-int4.litertlm` | LiteRT-LM | ~2 GB | INT4 weight-only, KV-cache, Jinja template embedded |
-| `artifacts/dynamic_int8/translategemma-4b-it-native-dynamic_int8.litertlm` | LiteRT-LM | ~4 GB | Dynamic INT8 *(uploading)* |
-| `artifacts/int4/translategemma-4b-it-native-int4.task` | MediaPipe | ~2 GB | INT4, KV-cache |
-**Start with `dynamic_int8`** — better translation quality than INT4. Use INT4 if RAM is tight.
 ---
@@ -41,27 +39,34 @@ in both `.litertlm` (LiteRT-LM, recommended) and `.task` (MediaPipe, legacy) for
 1. Download a `.litertlm` file above
 2. Open [Google AI Edge Gallery](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery)
 3. Import the model → select your `.litertlm` file
-4. Use **Prompt Lab** mode for best results (see below)
-### Prompt Lab mode (recommended)
-Set this as your **System Prompt**, then type text to translate in the input box:
 ```
-<start_of_turn>user
-You are a professional English (en) to Spanish (es) translator. Your goal is to accurately convey the meaning and nuances of the original English text while adhering to Spanish grammar, vocabulary, and cultural sensitivities.
-Produce only the Spanish translation, without any additional explanations or commentary. Please translate the following English text into Spanish:
-{{input}}<end_of_turn>
-<start_of_turn>model
 ```
-For other language pairs, replace `English (en)` / `Spanish (es)` with your source and target language.
-### AI Chat mode
-The `.litertlm` bundles have an embedded chat template. Just type your text — the model will attempt to translate it. Quality may vary since the app doesn't know source/target languages without explicit instructions.
 ---
@@ -69,24 +74,23 @@ The `.litertlm` bundles have an embedded chat template. Just type your text —
 | Spec | Minimum |
 |------|---------|
-| RAM | 6 GB free (INT4) / 8 GB free (INT8) |
-| Storage | 2 GB (INT4) / 4 GB (INT8) |
 | OS | Android 10+ |
 | Runtime | Google AI Edge Gallery or LiteRT-LM SDK |
-Tested on Pixel 10 (12 GB RAM). Both INT4 and INT8 load without "No KV cache" errors.
 ---
 ## What's Different From Google's Official Files
 Google's official TranslateGemma TFLite files target **WebGPU only** — they don't work with MediaPipe LLM inference on Android CPU.
-This repo's files use **Strategy 1** native conversion via `litert-torch` with a custom `build_translategemma_4b()` builder that:
-- Produces proper **prefill + decode signatures** with KV cache (required by MediaPipe / LiteRT-LM)
 - Uses the correct architecture: 34 layers, 2560 dim, 8 heads, 4 KV heads, sliding-window + global every 6th layer
 - Handles the `language_model.` weight prefix in TranslateGemma's multimodal safetensors
-- Quantizes weights natively during TFLite export (not post-hoc)
 ---
@@ -96,9 +100,9 @@ The `scripts/` folder contains the full conversion pipeline:
 | Script | Purpose |
 |--------|---------|
-| `scripts/convert_translategemma_android.py` | Single-quant conversion: Strategy 1 (litert-torch native) → Strategy 2 (generic fallback) |
-| `scripts/multi_quant_build_upload.py` | Batch conversion + upload for multiple quant levels |
-| `scripts/bundle_litertlm.py` | Bundle a TFLite + SentencePiece tokenizer into `.litertlm` with LlmMetadata |
 ### Reproduce a build
@@ -108,25 +112,25 @@ Requirements: ~128 GB RAM, Python 3.12, `litert-torch==0.8.0`
 # Clone LiteRT-LM builder (needed by bundle_litertlm.py)
 git clone --depth=1 https://github.com/google-ai-edge/LiteRT-LM /tmp/litert-lm
-pip install litert-torch==0.8.0 mediapipe transformers huggingface-hub flatc
 # Download model
 huggingface-cli download google/translategemma-4b-it --local-dir ./translategemma-4b-it
-# Convert to TFLite with KV cache (~10 min, needs ~128 GB RAM)
 python scripts/convert_translategemma_android.py \
   --model-dir ./translategemma-4b-it \
   --tflite-dir ./tflite_output/dynamic_int8 \
   --output-dir ./output \
-  --task-file ./output/translategemma-4b-it-native-dynamic_int8.task \
   --quantize dynamic_int8 \
-  --prefill-seq-len 1024 --kv-cache-max-len 1024
 # Bundle as .litertlm
 python scripts/bundle_litertlm.py \
   --tflite ./tflite_output/dynamic_int8/*.tflite \
   --tokenizer ./translategemma-4b-it/tokenizer.model \
-  --output ./output/translategemma-4b-it-native-dynamic_int8.litertlm \
   --quant dynamic_int8
 ```
@@ -134,7 +138,7 @@ python scripts/bundle_litertlm.py \
 ## Supported Languages
-TranslateGemma supports 55 languages including Arabic, Chinese, French, German, Hindi, Japanese, Korean, Portuguese, Russian, Spanish, and more. See [google/translategemma-4b-it](https://huggingface.co/google/translategemma-4b-it) for the full list.
 ---

 Converts [google/translategemma-4b-it](https://huggingface.co/google/translategemma-4b-it) (55 languages, 4B params)
 into formats that run locally on Android without internet or cloud APIs.
+Google only publishes WebGPU-only TFLite files. This repo bridges that gap with CPU/XNNPACK-compatible `.litertlm` bundles (LiteRT-LM format) with embedded chat template.
 ---
 ## Files
+| File | Size | Notes |
+|------|------|-------|
+| `artifacts/int4-generic/translategemma-4b-it-int4-generic.litertlm` | ~2 GB | INT4 blockwise quant — faster, lower RAM |
+| `artifacts/dynamic_int8-generic/translategemma-4b-it-dynamic_int8-generic.litertlm` | ~4 GB | Dynamic INT8 — better quality |
+**Start with INT4** if you're unsure — it loads faster and uses less RAM. Use dynamic_int8 for better translation quality.
 ---
 1. Download a `.litertlm` file above
 2. Open [Google AI Edge Gallery](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery)
 3. Import the model → select your `.litertlm` file
+4. Use **AI Chat** mode
+### Input format
+The embedded template supports structured input for any language pair:
 ```
+<src>LANG</src><dst>LANG</dst><text>YOUR TEXT HERE</text>
+```
+**Examples:**
+```
+<src>he</src><dst>en</dst><text>שלום עולם</text>
+```
+```
+<src>en</src><dst>he</dst><text>good morning</text>
+```
+```
+<src>en</src><dst>fr</dst><text>hello world</text>
+```
+```
+<src>ja</src><dst>en</dst><text>ありがとうございます</text>
 ```
+Use standard ISO 639-1 language codes: `en`, `he`, `fr`, `es`, `de`, `ar`, `zh`, `ja`, `ko`, `ru`, `pt`, etc.
+Plain text (no tags) is also accepted — the model will attempt translation based on context.
 ---
 | Spec | Minimum |
 |------|---------|
+| RAM | 6 GB free (INT4) / 8 GB free (dynamic_int8) |
+| Storage | 2 GB (INT4) / 4 GB (dynamic_int8) |
 | OS | Android 10+ |
 | Runtime | Google AI Edge Gallery or LiteRT-LM SDK |
 ---
 ## What's Different From Google's Official Files
 Google's official TranslateGemma TFLite files target **WebGPU only** — they don't work with MediaPipe LLM inference on Android CPU.
+This repo's files use native conversion via `litert-torch` with a custom `build_translategemma_4b()` builder that:
+- Produces proper **prefill + decode signatures** with KV cache (required by LiteRT-LM)
 - Uses the correct architecture: 34 layers, 2560 dim, 8 heads, 4 KV heads, sliding-window + global every 6th layer
+- Fixes `qkv_fused_interleaved=False` (critical — wrong default caused garbage output in all early builds)
 - Handles the `language_model.` weight prefix in TranslateGemma's multimodal safetensors
+- Embeds a generic Jinja chat template for any language pair via `<src>`/`<dst>`/`<text>` tags
 ---
 | Script | Purpose |
 |--------|---------|
+| `scripts/convert_translategemma_android.py` | Single-quant conversion via litert-torch native strategy |
+| `scripts/bundle_litertlm.py` | Bundle a TFLite + SentencePiece tokenizer into `.litertlm` with embedded Jinja template |
+| `scripts/multi_quant_build_upload.py` | Batch conversion + HuggingFace upload |
 ### Reproduce a build
 # Clone LiteRT-LM builder (needed by bundle_litertlm.py)
 git clone --depth=1 https://github.com/google-ai-edge/LiteRT-LM /tmp/litert-lm
+pip install litert-torch==0.8.0 mediapipe transformers huggingface-hub
 # Download model
 huggingface-cli download google/translategemma-4b-it --local-dir ./translategemma-4b-it
+# Convert to TFLite with KV cache (~30-60 min, needs ~128 GB RAM)
 python scripts/convert_translategemma_android.py \
   --model-dir ./translategemma-4b-it \
   --tflite-dir ./tflite_output/dynamic_int8 \
   --output-dir ./output \
+  --task-file ./output/translategemma-4b-it-dynamic_int8.task \
   --quantize dynamic_int8 \
+  --prefill-seq-len 1024 --kv-cache-max-len 1024 --allow-no-token
 # Bundle as .litertlm
 python scripts/bundle_litertlm.py \
   --tflite ./tflite_output/dynamic_int8/*.tflite \
   --tokenizer ./translategemma-4b-it/tokenizer.model \
+  --output ./output/translategemma-4b-it-dynamic_int8-generic.litertlm \
   --quant dynamic_int8
 ```
 ## Supported Languages
+TranslateGemma supports 55 languages including Arabic, Chinese, French, German, Hebrew, Hindi, Japanese, Korean, Portuguese, Russian, Spanish, and more. See [google/translategemma-4b-it](https://huggingface.co/google/translategemma-4b-it) for the full list.
 ---