Instructions to use kontextdev/agent-gemma with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use kontextdev/agent-gemma with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="kontextdev/agent-gemma")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("kontextdev/agent-gemma")
model = AutoModelForMultimodalLM.from_pretrained("kontextdev/agent-gemma")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use kontextdev/agent-gemma with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "kontextdev/agent-gemma"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kontextdev/agent-gemma",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/kontextdev/agent-gemma

SGLang

How to use kontextdev/agent-gemma with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "kontextdev/agent-gemma" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kontextdev/agent-gemma",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "kontextdev/agent-gemma" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kontextdev/agent-gemma",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use kontextdev/agent-gemma with Docker Model Runner:
```
docker model run hf.co/kontextdev/agent-gemma
```

macmacmacmac commited on Mar 23

Commit

4f6ad08

verified ·

1 Parent(s): 77426ed

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +73 -340

README.md CHANGED Viewed

@@ -1,391 +1,124 @@
 ---
-license: gemma
 language:
 - en
-pipeline_tag: text-generation
 tags:
-- litert
-- litert-lm
-- gemma
-- agent
-- tool-calling
 - function-calling
-- multimodal
 - on-device
-library_name: litert-lm
 ---
-# Agent Gemma 3n E2B - Tool Calling Edition
-A specialized version of **Gemma 3n E2B** optimized for **on-device tool/function calling** with LiteRT-LM. While Google's standard LiteRT-LM models focus on general text generation, this model is specifically designed for agentic workflows with advanced tool calling capabilities.
-## Why This Model?
-Google's official LiteRT-LM releases provide excellent on-device inference but don't include built-in tool calling support. This model bridges that gap by:
-- ✅ **Native tool/function calling** via Jinja templates
-- ✅ **Multimodal support** (text, vision, audio)
-- ✅ **On-device optimized** - No cloud API required
-- ✅ **INT4 quantized** - Efficient memory usage
-- ✅ **Production ready** - Tested and validated
-Perfect for building AI agents that need to interact with external tools, APIs, or functions while running completely on-device.
-## Model Details
-- **Base Model**: Gemma 3n E2B
-- **Format**: LiteRT-LM v1.4.0
-- **Quantization**: INT4
-- **Size**: ~3.2GB
-- **Tokenizer**: SentencePiece
-- **Capabilities**:
-  - Advanced tool/function calling
-  - Multi-turn conversations with tool interactions
-  - Vision processing (images)
-  - Audio processing
-  - Streaming responses
-## Tool Calling Example
-The model uses a sophisticated Jinja template that supports OpenAI-style function calling:
-```python
-from litert_lm import Engine, Conversation
-# Load the model
-engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="cpu")
-conversation = Conversation.create(engine)
-# Define tools the model can use
-tools = [
-    {
-        "name": "get_weather",
-        "description": "Get current weather for a location",
-        "parameters": {
-            "type": "object",
-            "properties": {
-                "location": {"type": "string", "description": "City name"},
-                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
-            },
-            "required": ["location"]
-        }
-    },
-    {
-        "name": "search_web",
-        "description": "Search the internet for information",
-        "parameters": {
-            "type": "object",
-            "properties": {
-                "query": {"type": "string", "description": "Search query"}
-            },
-            "required": ["query"]
-        }
-    }
-]
-# Have a conversation with tool calling
-message = {
-    "role": "user",
-    "content": "What's the weather in San Francisco and latest news about AI?"
-}
-response = conversation.send_message(message, tools=tools)
-print(response)
 ```
-### Example Output
-The model will generate structured tool calls:
-```
-<start_function_call>call:get_weather{location:San Francisco,unit:celsius}<end_function_call>
-<start_function_call>call:search_web{query:latest AI news}<end_function_call>
-<start_function_response>
 ```
-You then execute the functions and send back results:
-```python
-# Execute tools (your implementation)
-weather = get_weather("San Francisco", "celsius")
-news = search_web("latest AI news")
-# Send tool responses back
-tool_response = {
-    "role": "tool",
-    "content": [
-        {
-            "name": "get_weather",
-            "response": {"temperature": 18, "condition": "partly cloudy"}
-        },
-        {
-            "name": "search_web",
-            "response": {"results": ["OpenAI releases GPT-5...", "..."]}
-        }
-    ]
-}
-final_response = conversation.send_message(tool_response)
-print(final_response)
-# "The weather in San Francisco is 18°C and partly cloudy.
-#  In AI news, OpenAI has released GPT-5..."
-```
-## Advanced Features
-### Multi-Modal Tool Calling
-Combine vision, audio, and tool calling:
-```python
-message = {
-    "role": "user",
-    "content": [
-        {"type": "image", "data": image_bytes},
-        {"type": "text", "text": "What's in this image? Search for more info about it."}
-    ]
-}
-response = conversation.send_message(message, tools=[search_tool])
-# Model can see the image AND call search functions
 ```
-### Streaming Tool Calls
-Get tool calls as they're generated:
-```python
-def on_token(token):
-    if "<start_function_call>" in token:
-        print("Tool being called...")
-    print(token, end="", flush=True)
-conversation.send_message_async(message, tools=tools, callback=on_token)
 ```
-### Nested Tool Execution
-The model can chain tool calls:
-```python
-# User: "Book me a flight to Tokyo and reserve a hotel"
-# Model: calls check_flights() → calls book_hotel() → confirms both
 ```
-## Performance
-Benchmarked on CPU (no GPU acceleration):
-- **Prefill Speed**: 21.20 tokens/sec
-- **Decode Speed**: 11.44 tokens/sec
-- **Time to First Token**: ~1.6s
-- **Cold Start**: ~4.7s
-- **Tool Call Latency**: ~100-200ms additional
-GPU acceleration provides 3-5x speedup on supported hardware.
-## Installation & Usage
-### Requirements
-1. **LiteRT-LM Runtime** - Build from source:
-   ```bash
-   git clone https://github.com/google-ai-edge/LiteRT.git
-   cd LiteRT/LiteRT-LM
-   bazel build -c opt //runtime/engine:litert_lm_main
-   ```
-2. **Supported Platforms**: Linux (clang), macOS, Android
-### Quick Start
-```bash
-# Download model
-wget https://huggingface.co/kontextdev/agent-gemma/resolve/main/gemma-3n-E2B-it-agent-fixed.litertlm
-# Run with simple prompt
-./bazel-bin/runtime/engine/litert_lm_main \
-  --model_path=gemma-3n-E2B-it-agent-fixed.litertlm \
-  --backend=cpu \
-  --input_prompt="Hello, I need help with some tasks"
-# Run with GPU (if available)
-./bazel-bin/runtime/engine/litert_lm_main \
-  --model_path=gemma-3n-E2B-it-agent-fixed.litertlm \
-  --backend=gpu \
-  --input_prompt="What can you help me with?"
 ```
-### Python API (Recommended)
-```python
-from litert_lm import Engine, Conversation, SessionConfig
-# Initialize
-engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="gpu")
-# Configure session
-config = SessionConfig(
-    max_tokens=2048,
-    temperature=0.7,
-    top_p=0.9
-)
-# Start conversation
-conversation = Conversation.create(engine, config)
-# Define your tools
-tools = [...]  # Your function definitions
-# Chat with tool calling
-while True:
-    user_input = input("You: ")
-    response = conversation.send_message(
-        {"role": "user", "content": user_input},
-        tools=tools
     )
-    # Handle tool calls if present
-    if has_tool_calls(response):
-        results = execute_tools(extract_calls(response))
-        response = conversation.send_message({
-            "role": "tool",
-            "content": results
-        })
-    print(f"Agent: {response['content']}")
-```
-## Tool Call Format
-The model uses this format for tool interactions:
-**Function Declaration** (system/developer role):
-```
-<start_of_turn>developer
-<start_function_declaration>
-{
-  "name": "function_name",
-  "description": "What it does",
-  "parameters": {...}
-}
-<end_function_declaration>
-<end_of_turn>
-```
-**Function Call** (assistant):
-```
-<start_function_call>call:function_name{arg1:value1,arg2:value2}<end_function_call>
 ```
-**Function Response** (tool role):
-```
-<start_function_response>response:function_name{result:value}<end_function_response>
-```
-## Use Cases
-### Personal AI Assistant
-- Calendar management
-- Email sending
-- Web searching
-- File operations
-### IoT & Smart Home
-- Device control
-- Sensor monitoring
-- Automation workflows
-- Voice commands
-### Development Tools
-- Code generation with API calls
-- Database queries
-- Deployment automation
-- Testing & debugging
-### Business Applications
-- CRM integration
-- Data analysis
-- Report generation
-- Customer support
-## Model Architecture
-Built on Gemma 3n E2B with 9 optimized components:
-```
-Section 0: LlmMetadata (Agent Jinja template)
-Section 1: SentencePiece Tokenizer
-Section 2: TFLite Embedder
-Section 3: TFLite Per-Layer Embedder
-Section 4: TFLite Audio Encoder (HW accelerated)
-Section 5: TFLite End-of-Audio Detector
-Section 6: TFLite Vision Adapter
-Section 7: TFLite Vision Encoder
-Section 8: TFLite Prefill/Decode (INT4)
 ```
-All components are optimized for on-device inference with hardware acceleration support.
-## Comparison
-| Feature | Standard Gemma LiteRT-LM | This Model |
-|---------|-------------------------|------------|
-| Text Generation | ✅ | ✅ |
-| Tool Calling | ❌ | ✅ |
-| Multimodal | ✅ | ✅ |
-| Streaming | ✅ | ✅ |
-| On-Device | ✅ | ✅ |
-| Jinja Templates | Basic | Advanced Agent Template |
-| INT4 Quantization | ✅ | ✅ |
-## Limitations
-- **Tool Execution**: The model generates tool calls but doesn't execute them - you need to implement the actual functions
-- **Context Window**: Limited to 4096 tokens (configurable)
-- **Streaming Tool Calls**: Partial tool calls may need buffering
-- **Hardware Requirements**: Minimum 4GB RAM recommended
-- **No Native GPU on CPU-only systems**: Falls back to CPU inference
-## Tips for Best Results
-1. **Clear Tool Descriptions**: Provide detailed function descriptions
-2. **Schema Validation**: Validate tool call arguments before execution
-3. **Error Handling**: Handle malformed tool calls gracefully
-4. **Context Management**: Keep conversation history concise
-5. **Temperature**: Use 0.7-0.9 for creative tasks, 0.3-0.5 for precise tool calls
-6. **Batching**: Process multiple tool calls in parallel when possible
 ## License
 This model inherits the [Gemma license](https://ai.google.dev/gemma/terms) from the base model.
-## Citation
-```bibtex
-@misc{agent-gemma-litertlm,
-  title={Agent Gemma 3n E2B - Tool Calling Edition},
-  author={kontextdev},
-  year={2025},
-  publisher={HuggingFace},
-  howpublished={\url{https://huggingface.co/kontextdev/agent-gemma}}
-}
-```
-## Links
-- [LiteRT-LM GitHub](https://github.com/google-ai-edge/LiteRT/tree/main/LiteRT-LM)
-- [Gemma Model Family](https://ai.google.dev/gemma)
-- [LiteRT Documentation](https://ai.google.dev/edge/litert)
-- [Tool Calling Guide](https://ai.google.dev/gemma/docs/function-calling)
-## Support
-For issues or questions:
-- Open an issue on [GitHub](https://github.com/google-ai-edge/LiteRT/issues)
-- Check the [LiteRT-LM docs](https://ai.google.dev/edge/litert/inference)
-- Community forum: [Google AI Edge](https://discuss.ai.google.dev/)
----
-Built with ❤️ for the on-device AI community

 ---
 language:
 - en
+license: gemma
+library_name: transformers
+base_model: google/gemma-3n-E2B-it
 tags:
 - function-calling
+- tool-use
 - on-device
+- mobile
+- gemma
+- litertlm
 ---
+# Agent Gemma — Gemma 3n E2B Fine-Tuned for Function Calling
+A fine-tuned version of [google/gemma-3n-E2B-it](https://huggingface.co/google/gemma-3n-E2B-it) trained for on-device function calling using Google's [FunctionGemma](https://ai.google.dev/gemma/docs/functiongemma/function-calling-with-hf) technique.
+## What's Different from Stock Gemma 3n
+### Fixed: `format_function_declaration` Template Error
+The stock Gemma 3n chat template uses `format_function_declaration()` — a custom Jinja2 function available in Google's Python tokenizer but **not supported by LiteRT-LM's on-device template engine**. This causes:
 ```
+Failed to apply template: unknown function: format_function_declaration is unknown (in template:21)
 ```
+This model replaces the stock template with a **LiteRT-LM compatible** template that uses only standard Jinja2 features (`tojson` filter, `<start_function_declaration>` / `<end_function_declaration>` markers). The template is embedded in both `tokenizer_config.json` and `chat_template.jinja`.
+### Function Calling Format
+The model uses the FunctionGemma markup format:
 ```
+<start_function_call>call:function_name{param:<escape>value<escape>}<end_function_call>
 ```
+Tool declarations are formatted as:
 ```
+<start_function_declaration>{"name": "get_weather", "parameters": {...}}<end_function_declaration>
 ```
+## Training Details
+- **Base model:** google/gemma-3n-E2B-it (5.4B parameters)
+- **Method:** QLoRA (rank=16, alpha=32) — 22.9M trainable parameters (0.42%)
+- **Dataset:** [google/mobile-actions](https://huggingface.co/datasets/google/mobile-actions) (8,693 training samples)
+- **Training:** 500 steps, batch_size=1, max_seq_length=512, learning_rate=2e-4
+- **Precision:** bfloat16
+## Usage
+### With LiteRT-LM on Android (Kotlin)
+```kotlin
+// After converting to .litertlm format
+val engine = Engine(EngineConfig(modelPath = "agent-gemma.litertlm"))
+engine.initialize()
+val conversation = engine.createConversation(
+    ConversationConfig(
+        systemMessage = Message.of("You are a helpful assistant."),
+        tools = listOf(MyToolSet())  // @Tool annotated class
     )
+)
+// No format_function_declaration error!
+conversation.sendMessageAsync(Message.of("What's the weather?"))
+    .collect { print(it) }
 ```
+### With Transformers (Python)
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("kontextdev/agent-gemma")
+tokenizer = AutoTokenizer.from_pretrained("kontextdev/agent-gemma")
+messages = [
+    {"role": "developer", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "What's the weather in Tokyo?"}
+]
+tools = [{"function": {"name": "get_weather", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}}}}]
+text = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(text, return_tensors="pt")
+output = model.generate(**inputs, max_new_tokens=256)
+print(tokenizer.decode(output[0]))
 ```
+## Chat Template
+The custom chat template (in `tokenizer_config.json` and `chat_template.jinja`) supports these roles:
+- `developer` / `system` — system instructions + tool declarations
+- `user` — user messages
+- `model` / `assistant` — model responses, including `tool_calls`
+- `tool` — tool execution results
+## Converting to .litertlm
+Use the [LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM) conversion tools to package for on-device deployment:
+```bash
+# The chat_template.jinja is included in this repo
+python scripts/convert-to-litertlm.py \
+  --model_dir kontextdev/agent-gemma \
+  --output agent-gemma.litertlm
+```
+## Files
+- `model-*.safetensors` — Merged model weights (bfloat16)
+- `tokenizer_config.json` — Tokenizer config with embedded chat template
+- `chat_template.jinja` — Standalone chat template file
+- `config.json` — Model architecture config
+- `checkpoint-*` — Training checkpoints (LoRA)
 ## License
 This model inherits the [Gemma license](https://ai.google.dev/gemma/terms) from the base model.