Upload CLAUDE.md with huggingface_hub
Browse files
CLAUDE.md
ADDED
|
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CLAUDE.md
|
| 2 |
+
|
| 3 |
+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
| 4 |
+
|
| 5 |
+
## Project Overview
|
| 6 |
+
|
| 7 |
+
Python toolkit that converts Google's TranslateGemma 4B IT model (HuggingFace) into on-device inference bundles for Android. Google's official TFLite files only support WebGPU β this project produces CPU/XNNPACK-compatible `.litertlm` (LiteRT-LM) and `.task` (MediaPipe) files with proper KV-cache prefill/decode signatures.
|
| 8 |
+
|
| 9 |
+
## Common Commands
|
| 10 |
+
|
| 11 |
+
### Single quantization conversion (produces `.task`)
|
| 12 |
+
```bash
|
| 13 |
+
source /home/ubuntu/conv-venv/bin/activate
|
| 14 |
+
|
| 15 |
+
python convert_translategemma_android.py \
|
| 16 |
+
--model-dir ./translategemma-4b-it \
|
| 17 |
+
--tflite-dir ./tflite_output/dynamic_int8 \
|
| 18 |
+
--output-dir ./output \
|
| 19 |
+
--task-file ./output/translategemma-4b-it-native-dynamic_int8.task \
|
| 20 |
+
--quantize dynamic_int8 \
|
| 21 |
+
--prefill-seq-len 1024 --kv-cache-max-len 1024 --allow-no-token
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
Valid `--quantize` values: `none`, `dynamic_int8`, `int8`, `int4`, `float16`
|
| 25 |
+
Aliases accepted: `fp16`, `f16`, `i8`, `q8`, `i4`, `q4`, `fp32`
|
| 26 |
+
|
| 27 |
+
### Bundle a TFLite into `.litertlm` (recommended for Google AI Edge Gallery)
|
| 28 |
+
```bash
|
| 29 |
+
# Requires /tmp/litert-lm-pkg β see bundle_litertlm.py setup block if missing
|
| 30 |
+
python bundle_litertlm.py \
|
| 31 |
+
--tflite ./tflite_output/dynamic_int8/*.tflite \
|
| 32 |
+
--tokenizer ./translategemma-4b-it/tokenizer.model \
|
| 33 |
+
--output ./output/translategemma-4b-it-native-dynamic_int8.litertlm \
|
| 34 |
+
--quant dynamic_int8
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
### Bundle-only from existing TFLite (skip conversion)
|
| 38 |
+
```bash
|
| 39 |
+
python convert_translategemma_android.py \
|
| 40 |
+
--bundle-only \
|
| 41 |
+
--existing-tflite ./tflite_output/none/translategemma-4b-it-generic-none.tflite \
|
| 42 |
+
--quantize int8 \
|
| 43 |
+
--tflite-dir ./tflite_output/int8 \
|
| 44 |
+
--output-dir ./output \
|
| 45 |
+
--task-file ./output/translategemma-4b-it-int8.task
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
### Batch multi-quant build + HF upload
|
| 49 |
+
```bash
|
| 50 |
+
python multi_quant_build_upload.py \
|
| 51 |
+
--model-dir ./translategemma-4b-it \
|
| 52 |
+
--quants "int4,int8,dynamic_int8" \
|
| 53 |
+
--repo-id barakplasma/translategemma-4b-it-android-task-quantized \
|
| 54 |
+
--no-upload # remove to upload
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
## Architecture
|
| 58 |
+
|
| 59 |
+
### Three-script design
|
| 60 |
+
|
| 61 |
+
**`convert_translategemma_android.py`** β single conversion run, three strategies in sequence:
|
| 62 |
+
|
| 63 |
+
1. **Strategy 1** (`strategy1_litert_native`) β preferred; uses `litert-torch` with `build_translategemma_4b()`, a custom builder for the 4B architecture. Produces proper KV-cache TFLite with prefill/decode signatures. Quantization is applied natively by the converter via `QUANT_MAP`:
|
| 64 |
+
- `"int4"` β `"dynamic_int4_block128"` (blockwise INT4, ~2 GB)
|
| 65 |
+
- `"int8"` β `"weight_only_int8"` (~4 GB)
|
| 66 |
+
- `"dynamic_int8"` β `"dynamic_int8"` (~4 GB)
|
| 67 |
+
- `"float16"` β `"fp16"` (~8 GB)
|
| 68 |
+
|
| 69 |
+
2. **Strategy 2** (`strategy2_generic`) β fallback; wraps the HF model in `LogitsOnlyWrapper` and exports via `ai_edge_torch.convert()`. Always exports float32. **Important**: outputs a flat `input_ids β logits` TFLite with NO KV cache β NOT compatible with MediaPipe LLM inference.
|
| 70 |
+
|
| 71 |
+
3. **Strategy 3** (`strategy3_post_tflite_quantize`) β runs only when Strategy 2 was used (never after Strategy 1); applies post-hoc weight quantization to the TFLite flatbuffer via `ai_edge_quantizer`. Does NOT add KV cache but reduces file size.
|
| 72 |
+
|
| 73 |
+
**`bundle_litertlm.py`** β takes a Strategy 1 TFLite + SentencePiece tokenizer and packages them into `.litertlm` format with `LlmMetadata` proto (Gemma3 model type, embedded Jinja chat template, BOS/EOS stop tokens, 2K max tokens). Requires `/tmp/litert-lm-pkg/` with compiled FlatBuffers and proto Python bindings (see script header for setup).
|
| 74 |
+
|
| 75 |
+
**`multi_quant_build_upload.py`** β orchestrator; invokes `convert_translategemma_android.py` as a subprocess per quant level, handles timeouts/signals, writes `output/quantization_summary.json` and `output/README.md`, uploads artifacts to HuggingFace.
|
| 76 |
+
|
| 77 |
+
### Key function: `build_translategemma_4b()`
|
| 78 |
+
|
| 79 |
+
Critical β without this, `litert-torch 0.8.0` falls back to wrong-architecture builders (1B/270m). Hardcodes the correct config:
|
| 80 |
+
- 34 layers, embedding_dim=2560, 8 heads, 4 KV heads, head_dim=256, intermediate=10240
|
| 81 |
+
- Sliding window 1024; global attention at layers where `(idx+1) % 6 == 0` (indices 5,11,17,23,29)
|
| 82 |
+
- RMS norm with `zero_centered=True`, per-head QK normalization (`q_norm`, `k_norm`)
|
| 83 |
+
- Custom loader strips `language_model.` prefix from TranslateGemma's multimodal safetensors keys (standard Gemma3 safetensors don't have this prefix)
|
| 84 |
+
|
| 85 |
+
### Output formats
|
| 86 |
+
|
| 87 |
+
| Format | Runtime | Notes |
|
| 88 |
+
|--------|---------|-------|
|
| 89 |
+
| `.litertlm` | LiteRT-LM / Google AI Edge Gallery | Recommended; embeds Jinja prompt template and LlmMetadata |
|
| 90 |
+
| `.task` | MediaPipe GenAI | Legacy; no embedded template β user must manually add `<start_of_turn>` tokens |
|
| 91 |
+
|
| 92 |
+
### Prompt format for on-device inference
|
| 93 |
+
|
| 94 |
+
TranslateGemma requires this exact format (trained with it):
|
| 95 |
+
```
|
| 96 |
+
<bos><start_of_turn>user
|
| 97 |
+
You are a professional English (en) to Spanish (es) translator...
|
| 98 |
+
Produce only the Spanish translation...Please translate:
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
{text}<end_of_turn>
|
| 102 |
+
<start_of_turn>model
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
In Google AI Edge Gallery **Prompt Lab** mode, paste this as the System Prompt with `{{input}}` as the placeholder. `.litertlm` files embed a simplified Jinja template for AI Chat mode.
|
| 106 |
+
|
| 107 |
+
### Runtime notes
|
| 108 |
+
|
| 109 |
+
- `conv-venv/` β virtualenv with all deps (`litert-torch==0.8.0`, `mediapipe`, `ai_edge_torch`, `transformers`)
|
| 110 |
+
- `/tmp/litert-lm-pkg/` β manually assembled package from cloned LiteRT-LM repo with compiled FlatBuffer (`flatc -p --gen-onefile`) and proto (`protoc`) Python bindings; required by `bundle_litertlm.py` at runtime; NOT persistent across reboots
|
| 111 |
+
- `/tmp/litert-lm/` β cloned `google-ai-edge/LiteRT-LM` repo (schema source for rebuilding the package)
|
| 112 |
+
- Conversion requires ~128 GB RAM; 4B model loads ~46 GB
|
| 113 |
+
- `translategemma-4b-it/tokenizer.model` is the SentencePiece binary used by both `.task` and `.litertlm` bundlers; `ensure_tokenizer_model()` auto-converts from `tokenizer.json` if missing
|
| 114 |
+
- HuggingFace repo: `barakplasma/translategemma-4b-it-android-task-quantized`; upload token in `HF_TOKEN` env var
|