barakplasma commited on
Commit
f1f7e94
Β·
verified Β·
1 Parent(s): d94e0c9

Upload CLAUDE.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. CLAUDE.md +114 -0
CLAUDE.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ Python toolkit that converts Google's TranslateGemma 4B IT model (HuggingFace) into on-device inference bundles for Android. Google's official TFLite files only support WebGPU β€” this project produces CPU/XNNPACK-compatible `.litertlm` (LiteRT-LM) and `.task` (MediaPipe) files with proper KV-cache prefill/decode signatures.
8
+
9
+ ## Common Commands
10
+
11
+ ### Single quantization conversion (produces `.task`)
12
+ ```bash
13
+ source /home/ubuntu/conv-venv/bin/activate
14
+
15
+ python convert_translategemma_android.py \
16
+ --model-dir ./translategemma-4b-it \
17
+ --tflite-dir ./tflite_output/dynamic_int8 \
18
+ --output-dir ./output \
19
+ --task-file ./output/translategemma-4b-it-native-dynamic_int8.task \
20
+ --quantize dynamic_int8 \
21
+ --prefill-seq-len 1024 --kv-cache-max-len 1024 --allow-no-token
22
+ ```
23
+
24
+ Valid `--quantize` values: `none`, `dynamic_int8`, `int8`, `int4`, `float16`
25
+ Aliases accepted: `fp16`, `f16`, `i8`, `q8`, `i4`, `q4`, `fp32`
26
+
27
+ ### Bundle a TFLite into `.litertlm` (recommended for Google AI Edge Gallery)
28
+ ```bash
29
+ # Requires /tmp/litert-lm-pkg β€” see bundle_litertlm.py setup block if missing
30
+ python bundle_litertlm.py \
31
+ --tflite ./tflite_output/dynamic_int8/*.tflite \
32
+ --tokenizer ./translategemma-4b-it/tokenizer.model \
33
+ --output ./output/translategemma-4b-it-native-dynamic_int8.litertlm \
34
+ --quant dynamic_int8
35
+ ```
36
+
37
+ ### Bundle-only from existing TFLite (skip conversion)
38
+ ```bash
39
+ python convert_translategemma_android.py \
40
+ --bundle-only \
41
+ --existing-tflite ./tflite_output/none/translategemma-4b-it-generic-none.tflite \
42
+ --quantize int8 \
43
+ --tflite-dir ./tflite_output/int8 \
44
+ --output-dir ./output \
45
+ --task-file ./output/translategemma-4b-it-int8.task
46
+ ```
47
+
48
+ ### Batch multi-quant build + HF upload
49
+ ```bash
50
+ python multi_quant_build_upload.py \
51
+ --model-dir ./translategemma-4b-it \
52
+ --quants "int4,int8,dynamic_int8" \
53
+ --repo-id barakplasma/translategemma-4b-it-android-task-quantized \
54
+ --no-upload # remove to upload
55
+ ```
56
+
57
+ ## Architecture
58
+
59
+ ### Three-script design
60
+
61
+ **`convert_translategemma_android.py`** β€” single conversion run, three strategies in sequence:
62
+
63
+ 1. **Strategy 1** (`strategy1_litert_native`) β€” preferred; uses `litert-torch` with `build_translategemma_4b()`, a custom builder for the 4B architecture. Produces proper KV-cache TFLite with prefill/decode signatures. Quantization is applied natively by the converter via `QUANT_MAP`:
64
+ - `"int4"` β†’ `"dynamic_int4_block128"` (blockwise INT4, ~2 GB)
65
+ - `"int8"` β†’ `"weight_only_int8"` (~4 GB)
66
+ - `"dynamic_int8"` β†’ `"dynamic_int8"` (~4 GB)
67
+ - `"float16"` β†’ `"fp16"` (~8 GB)
68
+
69
+ 2. **Strategy 2** (`strategy2_generic`) β€” fallback; wraps the HF model in `LogitsOnlyWrapper` and exports via `ai_edge_torch.convert()`. Always exports float32. **Important**: outputs a flat `input_ids β†’ logits` TFLite with NO KV cache β€” NOT compatible with MediaPipe LLM inference.
70
+
71
+ 3. **Strategy 3** (`strategy3_post_tflite_quantize`) β€” runs only when Strategy 2 was used (never after Strategy 1); applies post-hoc weight quantization to the TFLite flatbuffer via `ai_edge_quantizer`. Does NOT add KV cache but reduces file size.
72
+
73
+ **`bundle_litertlm.py`** β€” takes a Strategy 1 TFLite + SentencePiece tokenizer and packages them into `.litertlm` format with `LlmMetadata` proto (Gemma3 model type, embedded Jinja chat template, BOS/EOS stop tokens, 2K max tokens). Requires `/tmp/litert-lm-pkg/` with compiled FlatBuffers and proto Python bindings (see script header for setup).
74
+
75
+ **`multi_quant_build_upload.py`** β€” orchestrator; invokes `convert_translategemma_android.py` as a subprocess per quant level, handles timeouts/signals, writes `output/quantization_summary.json` and `output/README.md`, uploads artifacts to HuggingFace.
76
+
77
+ ### Key function: `build_translategemma_4b()`
78
+
79
+ Critical β€” without this, `litert-torch 0.8.0` falls back to wrong-architecture builders (1B/270m). Hardcodes the correct config:
80
+ - 34 layers, embedding_dim=2560, 8 heads, 4 KV heads, head_dim=256, intermediate=10240
81
+ - Sliding window 1024; global attention at layers where `(idx+1) % 6 == 0` (indices 5,11,17,23,29)
82
+ - RMS norm with `zero_centered=True`, per-head QK normalization (`q_norm`, `k_norm`)
83
+ - Custom loader strips `language_model.` prefix from TranslateGemma's multimodal safetensors keys (standard Gemma3 safetensors don't have this prefix)
84
+
85
+ ### Output formats
86
+
87
+ | Format | Runtime | Notes |
88
+ |--------|---------|-------|
89
+ | `.litertlm` | LiteRT-LM / Google AI Edge Gallery | Recommended; embeds Jinja prompt template and LlmMetadata |
90
+ | `.task` | MediaPipe GenAI | Legacy; no embedded template β€” user must manually add `<start_of_turn>` tokens |
91
+
92
+ ### Prompt format for on-device inference
93
+
94
+ TranslateGemma requires this exact format (trained with it):
95
+ ```
96
+ <bos><start_of_turn>user
97
+ You are a professional English (en) to Spanish (es) translator...
98
+ Produce only the Spanish translation...Please translate:
99
+
100
+
101
+ {text}<end_of_turn>
102
+ <start_of_turn>model
103
+ ```
104
+
105
+ In Google AI Edge Gallery **Prompt Lab** mode, paste this as the System Prompt with `{{input}}` as the placeholder. `.litertlm` files embed a simplified Jinja template for AI Chat mode.
106
+
107
+ ### Runtime notes
108
+
109
+ - `conv-venv/` β€” virtualenv with all deps (`litert-torch==0.8.0`, `mediapipe`, `ai_edge_torch`, `transformers`)
110
+ - `/tmp/litert-lm-pkg/` β€” manually assembled package from cloned LiteRT-LM repo with compiled FlatBuffer (`flatc -p --gen-onefile`) and proto (`protoc`) Python bindings; required by `bundle_litertlm.py` at runtime; NOT persistent across reboots
111
+ - `/tmp/litert-lm/` β€” cloned `google-ai-edge/LiteRT-LM` repo (schema source for rebuilding the package)
112
+ - Conversion requires ~128 GB RAM; 4B model loads ~46 GB
113
+ - `translategemma-4b-it/tokenizer.model` is the SentencePiece binary used by both `.task` and `.litertlm` bundlers; `ensure_tokenizer_model()` auto-converts from `tokenizer.json` if missing
114
+ - HuggingFace repo: `barakplasma/translategemma-4b-it-android-task-quantized`; upload token in `HF_TOKEN` env var