macmacmacmac commited on
Commit
4f6ad08
·
verified ·
1 Parent(s): 77426ed

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +73 -340
README.md CHANGED
@@ -1,391 +1,124 @@
1
  ---
2
- license: gemma
3
  language:
4
  - en
5
- pipeline_tag: text-generation
 
 
6
  tags:
7
- - litert
8
- - litert-lm
9
- - gemma
10
- - agent
11
- - tool-calling
12
  - function-calling
13
- - multimodal
14
  - on-device
15
- library_name: litert-lm
 
 
16
  ---
17
 
18
- # Agent Gemma 3n E2B - Tool Calling Edition
19
-
20
- A specialized version of **Gemma 3n E2B** optimized for **on-device tool/function calling** with LiteRT-LM. While Google's standard LiteRT-LM models focus on general text generation, this model is specifically designed for agentic workflows with advanced tool calling capabilities.
21
 
22
- ## Why This Model?
23
 
24
- Google's official LiteRT-LM releases provide excellent on-device inference but don't include built-in tool calling support. This model bridges that gap by:
25
 
26
- - **Native tool/function calling** via Jinja templates
27
- - ✅ **Multimodal support** (text, vision, audio)
28
- - ✅ **On-device optimized** - No cloud API required
29
- - ✅ **INT4 quantized** - Efficient memory usage
30
- - ✅ **Production ready** - Tested and validated
31
 
32
- Perfect for building AI agents that need to interact with external tools, APIs, or functions while running completely on-device.
33
 
34
- ## Model Details
35
-
36
- - **Base Model**: Gemma 3n E2B
37
- - **Format**: LiteRT-LM v1.4.0
38
- - **Quantization**: INT4
39
- - **Size**: ~3.2GB
40
- - **Tokenizer**: SentencePiece
41
- - **Capabilities**:
42
- - Advanced tool/function calling
43
- - Multi-turn conversations with tool interactions
44
- - Vision processing (images)
45
- - Audio processing
46
- - Streaming responses
47
-
48
- ## Tool Calling Example
49
-
50
- The model uses a sophisticated Jinja template that supports OpenAI-style function calling:
51
-
52
- ```python
53
- from litert_lm import Engine, Conversation
54
-
55
- # Load the model
56
- engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="cpu")
57
- conversation = Conversation.create(engine)
58
-
59
- # Define tools the model can use
60
- tools = [
61
- {
62
- "name": "get_weather",
63
- "description": "Get current weather for a location",
64
- "parameters": {
65
- "type": "object",
66
- "properties": {
67
- "location": {"type": "string", "description": "City name"},
68
- "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
69
- },
70
- "required": ["location"]
71
- }
72
- },
73
- {
74
- "name": "search_web",
75
- "description": "Search the internet for information",
76
- "parameters": {
77
- "type": "object",
78
- "properties": {
79
- "query": {"type": "string", "description": "Search query"}
80
- },
81
- "required": ["query"]
82
- }
83
- }
84
- ]
85
-
86
- # Have a conversation with tool calling
87
- message = {
88
- "role": "user",
89
- "content": "What's the weather in San Francisco and latest news about AI?"
90
- }
91
-
92
- response = conversation.send_message(message, tools=tools)
93
- print(response)
94
  ```
95
-
96
- ### Example Output
97
-
98
- The model will generate structured tool calls:
99
-
100
- ```
101
- <start_function_call>call:get_weather{location:San Francisco,unit:celsius}<end_function_call>
102
- <start_function_call>call:search_web{query:latest AI news}<end_function_call>
103
- <start_function_response>
104
  ```
105
 
106
- You then execute the functions and send back results:
107
 
108
- ```python
109
- # Execute tools (your implementation)
110
- weather = get_weather("San Francisco", "celsius")
111
- news = search_web("latest AI news")
112
-
113
- # Send tool responses back
114
- tool_response = {
115
- "role": "tool",
116
- "content": [
117
- {
118
- "name": "get_weather",
119
- "response": {"temperature": 18, "condition": "partly cloudy"}
120
- },
121
- {
122
- "name": "search_web",
123
- "response": {"results": ["OpenAI releases GPT-5...", "..."]}
124
- }
125
- ]
126
- }
127
-
128
- final_response = conversation.send_message(tool_response)
129
- print(final_response)
130
- # "The weather in San Francisco is 18°C and partly cloudy.
131
- # In AI news, OpenAI has released GPT-5..."
132
- ```
133
 
134
- ## Advanced Features
135
 
136
- ### Multi-Modal Tool Calling
137
-
138
- Combine vision, audio, and tool calling:
139
-
140
- ```python
141
- message = {
142
- "role": "user",
143
- "content": [
144
- {"type": "image", "data": image_bytes},
145
- {"type": "text", "text": "What's in this image? Search for more info about it."}
146
- ]
147
- }
148
-
149
- response = conversation.send_message(message, tools=[search_tool])
150
- # Model can see the image AND call search functions
151
  ```
152
-
153
- ### Streaming Tool Calls
154
-
155
- Get tool calls as they're generated:
156
-
157
- ```python
158
- def on_token(token):
159
- if "<start_function_call>" in token:
160
- print("Tool being called...")
161
- print(token, end="", flush=True)
162
-
163
- conversation.send_message_async(message, tools=tools, callback=on_token)
164
  ```
165
 
166
- ### Nested Tool Execution
167
-
168
- The model can chain tool calls:
169
-
170
- ```python
171
- # User: "Book me a flight to Tokyo and reserve a hotel"
172
- # Model: calls check_flights() → calls book_hotel() → confirms both
173
  ```
174
-
175
- ## Performance
176
-
177
- Benchmarked on CPU (no GPU acceleration):
178
-
179
- - **Prefill Speed**: 21.20 tokens/sec
180
- - **Decode Speed**: 11.44 tokens/sec
181
- - **Time to First Token**: ~1.6s
182
- - **Cold Start**: ~4.7s
183
- - **Tool Call Latency**: ~100-200ms additional
184
-
185
- GPU acceleration provides 3-5x speedup on supported hardware.
186
-
187
- ## Installation & Usage
188
-
189
- ### Requirements
190
-
191
- 1. **LiteRT-LM Runtime** - Build from source:
192
- ```bash
193
- git clone https://github.com/google-ai-edge/LiteRT.git
194
- cd LiteRT/LiteRT-LM
195
- bazel build -c opt //runtime/engine:litert_lm_main
196
- ```
197
-
198
- 2. **Supported Platforms**: Linux (clang), macOS, Android
199
-
200
- ### Quick Start
201
-
202
- ```bash
203
- # Download model
204
- wget https://huggingface.co/kontextdev/agent-gemma/resolve/main/gemma-3n-E2B-it-agent-fixed.litertlm
205
-
206
- # Run with simple prompt
207
- ./bazel-bin/runtime/engine/litert_lm_main \
208
- --model_path=gemma-3n-E2B-it-agent-fixed.litertlm \
209
- --backend=cpu \
210
- --input_prompt="Hello, I need help with some tasks"
211
-
212
- # Run with GPU (if available)
213
- ./bazel-bin/runtime/engine/litert_lm_main \
214
- --model_path=gemma-3n-E2B-it-agent-fixed.litertlm \
215
- --backend=gpu \
216
- --input_prompt="What can you help me with?"
217
  ```
218
 
219
- ### Python API (Recommended)
220
-
221
- ```python
222
- from litert_lm import Engine, Conversation, SessionConfig
223
 
224
- # Initialize
225
- engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="gpu")
 
 
 
226
 
227
- # Configure session
228
- config = SessionConfig(
229
- max_tokens=2048,
230
- temperature=0.7,
231
- top_p=0.9
232
- )
233
 
234
- # Start conversation
235
- conversation = Conversation.create(engine, config)
236
 
237
- # Define your tools
238
- tools = [...] # Your function definitions
 
 
239
 
240
- # Chat with tool calling
241
- while True:
242
- user_input = input("You: ")
243
- response = conversation.send_message(
244
- {"role": "user", "content": user_input},
245
- tools=tools
246
  )
 
247
 
248
- # Handle tool calls if present
249
- if has_tool_calls(response):
250
- results = execute_tools(extract_calls(response))
251
- response = conversation.send_message({
252
- "role": "tool",
253
- "content": results
254
- })
255
-
256
- print(f"Agent: {response['content']}")
257
- ```
258
-
259
- ## Tool Call Format
260
-
261
- The model uses this format for tool interactions:
262
-
263
- **Function Declaration** (system/developer role):
264
- ```
265
- <start_of_turn>developer
266
- <start_function_declaration>
267
- {
268
- "name": "function_name",
269
- "description": "What it does",
270
- "parameters": {...}
271
- }
272
- <end_function_declaration>
273
- <end_of_turn>
274
- ```
275
-
276
- **Function Call** (assistant):
277
- ```
278
- <start_function_call>call:function_name{arg1:value1,arg2:value2}<end_function_call>
279
  ```
280
 
281
- **Function Response** (tool role):
282
- ```
283
- <start_function_response>response:function_name{result:value}<end_function_response>
284
- ```
285
 
286
- ## Use Cases
287
-
288
- ### Personal AI Assistant
289
- - Calendar management
290
- - Email sending
291
- - Web searching
292
- - File operations
293
-
294
- ### IoT & Smart Home
295
- - Device control
296
- - Sensor monitoring
297
- - Automation workflows
298
- - Voice commands
299
-
300
- ### Development Tools
301
- - Code generation with API calls
302
- - Database queries
303
- - Deployment automation
304
- - Testing & debugging
305
 
306
- ### Business Applications
307
- - CRM integration
308
- - Data analysis
309
- - Report generation
310
- - Customer support
311
 
312
- ## Model Architecture
 
 
 
313
 
314
- Built on Gemma 3n E2B with 9 optimized components:
315
 
316
- ```
317
- Section 0: LlmMetadata (Agent Jinja template)
318
- Section 1: SentencePiece Tokenizer
319
- Section 2: TFLite Embedder
320
- Section 3: TFLite Per-Layer Embedder
321
- Section 4: TFLite Audio Encoder (HW accelerated)
322
- Section 5: TFLite End-of-Audio Detector
323
- Section 6: TFLite Vision Adapter
324
- Section 7: TFLite Vision Encoder
325
- Section 8: TFLite Prefill/Decode (INT4)
326
  ```
327
 
328
- All components are optimized for on-device inference with hardware acceleration support.
329
 
330
- ## Comparison
 
 
 
 
331
 
332
- | Feature | Standard Gemma LiteRT-LM | This Model |
333
- |---------|-------------------------|------------|
334
- | Text Generation | ✅ | ✅ |
335
- | Tool Calling | ❌ | ✅ |
336
- | Multimodal | ✅ | ✅ |
337
- | Streaming | ✅ | ✅ |
338
- | On-Device | ✅ | ✅ |
339
- | Jinja Templates | Basic | Advanced Agent Template |
340
- | INT4 Quantization | ✅ | ✅ |
341
 
342
- ## Limitations
343
 
344
- - **Tool Execution**: The model generates tool calls but doesn't execute them - you need to implement the actual functions
345
- - **Context Window**: Limited to 4096 tokens (configurable)
346
- - **Streaming Tool Calls**: Partial tool calls may need buffering
347
- - **Hardware Requirements**: Minimum 4GB RAM recommended
348
- - **No Native GPU on CPU-only systems**: Falls back to CPU inference
 
349
 
350
- ## Tips for Best Results
351
 
352
- 1. **Clear Tool Descriptions**: Provide detailed function descriptions
353
- 2. **Schema Validation**: Validate tool call arguments before execution
354
- 3. **Error Handling**: Handle malformed tool calls gracefully
355
- 4. **Context Management**: Keep conversation history concise
356
- 5. **Temperature**: Use 0.7-0.9 for creative tasks, 0.3-0.5 for precise tool calls
357
- 6. **Batching**: Process multiple tool calls in parallel when possible
358
 
359
  ## License
360
 
361
  This model inherits the [Gemma license](https://ai.google.dev/gemma/terms) from the base model.
362
-
363
- ## Citation
364
-
365
- ```bibtex
366
- @misc{agent-gemma-litertlm,
367
- title={Agent Gemma 3n E2B - Tool Calling Edition},
368
- author={kontextdev},
369
- year={2025},
370
- publisher={HuggingFace},
371
- howpublished={\url{https://huggingface.co/kontextdev/agent-gemma}}
372
- }
373
- ```
374
-
375
- ## Links
376
-
377
- - [LiteRT-LM GitHub](https://github.com/google-ai-edge/LiteRT/tree/main/LiteRT-LM)
378
- - [Gemma Model Family](https://ai.google.dev/gemma)
379
- - [LiteRT Documentation](https://ai.google.dev/edge/litert)
380
- - [Tool Calling Guide](https://ai.google.dev/gemma/docs/function-calling)
381
-
382
- ## Support
383
-
384
- For issues or questions:
385
- - Open an issue on [GitHub](https://github.com/google-ai-edge/LiteRT/issues)
386
- - Check the [LiteRT-LM docs](https://ai.google.dev/edge/litert/inference)
387
- - Community forum: [Google AI Edge](https://discuss.ai.google.dev/)
388
-
389
- ---
390
-
391
- Built with ❤️ for the on-device AI community
 
1
  ---
 
2
  language:
3
  - en
4
+ license: gemma
5
+ library_name: transformers
6
+ base_model: google/gemma-3n-E2B-it
7
  tags:
 
 
 
 
 
8
  - function-calling
9
+ - tool-use
10
  - on-device
11
+ - mobile
12
+ - gemma
13
+ - litertlm
14
  ---
15
 
16
+ # Agent Gemma — Gemma 3n E2B Fine-Tuned for Function Calling
 
 
17
 
18
+ A fine-tuned version of [google/gemma-3n-E2B-it](https://huggingface.co/google/gemma-3n-E2B-it) trained for on-device function calling using Google's [FunctionGemma](https://ai.google.dev/gemma/docs/functiongemma/function-calling-with-hf) technique.
19
 
20
+ ## What's Different from Stock Gemma 3n
21
 
22
+ ### Fixed: `format_function_declaration` Template Error
 
 
 
 
23
 
24
+ The stock Gemma 3n chat template uses `format_function_declaration()` a custom Jinja2 function available in Google's Python tokenizer but **not supported by LiteRT-LM's on-device template engine**. This causes:
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ```
27
+ Failed to apply template: unknown function: format_function_declaration is unknown (in template:21)
 
 
 
 
 
 
 
 
28
  ```
29
 
30
+ This model replaces the stock template with a **LiteRT-LM compatible** template that uses only standard Jinja2 features (`tojson` filter, `<start_function_declaration>` / `<end_function_declaration>` markers). The template is embedded in both `tokenizer_config.json` and `chat_template.jinja`.
31
 
32
+ ### Function Calling Format
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
+ The model uses the FunctionGemma markup format:
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ```
37
+ <start_function_call>call:function_name{param:<escape>value<escape>}<end_function_call>
 
 
 
 
 
 
 
 
 
 
 
38
  ```
39
 
40
+ Tool declarations are formatted as:
 
 
 
 
 
 
41
  ```
42
+ <start_function_declaration>{"name": "get_weather", "parameters": {...}}<end_function_declaration>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ```
44
 
45
+ ## Training Details
 
 
 
46
 
47
+ - **Base model:** google/gemma-3n-E2B-it (5.4B parameters)
48
+ - **Method:** QLoRA (rank=16, alpha=32) — 22.9M trainable parameters (0.42%)
49
+ - **Dataset:** [google/mobile-actions](https://huggingface.co/datasets/google/mobile-actions) (8,693 training samples)
50
+ - **Training:** 500 steps, batch_size=1, max_seq_length=512, learning_rate=2e-4
51
+ - **Precision:** bfloat16
52
 
53
+ ## Usage
 
 
 
 
 
54
 
55
+ ### With LiteRT-LM on Android (Kotlin)
 
56
 
57
+ ```kotlin
58
+ // After converting to .litertlm format
59
+ val engine = Engine(EngineConfig(modelPath = "agent-gemma.litertlm"))
60
+ engine.initialize()
61
 
62
+ val conversation = engine.createConversation(
63
+ ConversationConfig(
64
+ systemMessage = Message.of("You are a helpful assistant."),
65
+ tools = listOf(MyToolSet()) // @Tool annotated class
 
 
66
  )
67
+ )
68
 
69
+ // No format_function_declaration error!
70
+ conversation.sendMessageAsync(Message.of("What's the weather?"))
71
+ .collect { print(it) }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ```
73
 
74
+ ### With Transformers (Python)
 
 
 
75
 
76
+ ```python
77
+ from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
+ model = AutoModelForCausalLM.from_pretrained("kontextdev/agent-gemma")
80
+ tokenizer = AutoTokenizer.from_pretrained("kontextdev/agent-gemma")
 
 
 
81
 
82
+ messages = [
83
+ {"role": "developer", "content": "You are a helpful assistant."},
84
+ {"role": "user", "content": "What's the weather in Tokyo?"}
85
+ ]
86
 
87
+ tools = [{"function": {"name": "get_weather", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}}}}]
88
 
89
+ text = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
90
+ inputs = tokenizer(text, return_tensors="pt")
91
+ output = model.generate(**inputs, max_new_tokens=256)
92
+ print(tokenizer.decode(output[0]))
 
 
 
 
 
 
93
  ```
94
 
95
+ ## Chat Template
96
 
97
+ The custom chat template (in `tokenizer_config.json` and `chat_template.jinja`) supports these roles:
98
+ - `developer` / `system` — system instructions + tool declarations
99
+ - `user` — user messages
100
+ - `model` / `assistant` — model responses, including `tool_calls`
101
+ - `tool` — tool execution results
102
 
103
+ ## Converting to .litertlm
 
 
 
 
 
 
 
 
104
 
105
+ Use the [LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM) conversion tools to package for on-device deployment:
106
 
107
+ ```bash
108
+ # The chat_template.jinja is included in this repo
109
+ python scripts/convert-to-litertlm.py \
110
+ --model_dir kontextdev/agent-gemma \
111
+ --output agent-gemma.litertlm
112
+ ```
113
 
114
+ ## Files
115
 
116
+ - `model-*.safetensors` Merged model weights (bfloat16)
117
+ - `tokenizer_config.json` Tokenizer config with embedded chat template
118
+ - `chat_template.jinja` Standalone chat template file
119
+ - `config.json` Model architecture config
120
+ - `checkpoint-*` Training checkpoints (LoRA)
 
121
 
122
  ## License
123
 
124
  This model inherits the [Gemma license](https://ai.google.dev/gemma/terms) from the base model.