Instructions to use BrinqAI/functiongemma-270m-physical-ai with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use BrinqAI/functiongemma-270m-physical-ai with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="BrinqAI/functiongemma-270m-physical-ai",
	filename="functiongemma-physical-ai-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use BrinqAI/functiongemma-270m-physical-ai with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Use Docker

docker model run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

LM Studio
Jan

vLLM

How to use BrinqAI/functiongemma-270m-physical-ai with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "BrinqAI/functiongemma-270m-physical-ai"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BrinqAI/functiongemma-270m-physical-ai",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Ollama
How to use BrinqAI/functiongemma-270m-physical-ai with Ollama:
```
ollama run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
```

Unsloth Studio new

How to use BrinqAI/functiongemma-270m-physical-ai with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for BrinqAI/functiongemma-270m-physical-ai to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for BrinqAI/functiongemma-270m-physical-ai to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for BrinqAI/functiongemma-270m-physical-ai to start chatting

Pi new

How to use BrinqAI/functiongemma-270m-physical-ai with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "functiongemma-270m-physical-ai"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Docker Model Runner
How to use BrinqAI/functiongemma-270m-physical-ai with Docker Model Runner:
```
docker model run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
```

Lemonade

How to use BrinqAI/functiongemma-270m-physical-ai with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Run and chat with the model

lemonade run user.functiongemma-270m-physical-ai-Q4_K_M

List all available models

lemonade list

FunctionGemma 270M — Physical AI

Fine-tuned google/functiongemma-270m-it for voice-controlled physical-AI / household-IoT actions on a Synaptics SL2619 "Coral" edge board (Google IO 2026 demo).

Revision	File	Tool count	Notes
v7 (current)	`functiongemma-physical-ai-v7-Q5_K_M.gguf`	10	`list_alarms` removed; alarm-query prompts route via `respond()`. 250-row eval: 86.8% overall, 92.8% single-tool, 75.0% multi-tool exact-match, 0.0% parse failure.
v6 (previous)	`functiongemma-physical-ai-v6-Q5_K_M.gguf`	11	Camera + vision dropped. Single-tool routing 95.5%, multi-tool exact-match 23.9%.
v4c (legacy)	`functiongemma-physical-ai-Q4_K_M.gguf`	13	Earlier checkpoint, includes camera/scene tools.

Schema ships as tools.json (10 tools, current). Token-to-tool mapping is in token_map.json.

Output format — function tokens

Tool calls emit as function tokens: each tool name compiles to a single special-vocabulary token (<tool_0> … <tool_9> for v7) and a single <end> terminator. A complete call decodes in roughly 8–15 output tokens, vs ~30–80 for native FunctionGemma's <start_function_call>call:NAME{...}<end_function_call> syntax. On a 2-core Cortex-A55 this is the difference between sub-second and 2–5 s voice-UX latency.

Sample output: <tool_3>(3,"red")<end> for blink_lights(count=3, color="red").

<tool_0> → turn_on_lights, <tool_3> → blink_lights, <tool_8> → get_system_status, <tool_9> → respond (v7 numbering; v6 used <tool_9> and <tool_10> for those — bumped down by one when list_alarms was removed). Full mapping in token_map.json.

⚠️ Inference servers MUST stop generation on <end_of_turn> (or <eos>), NOT on <end>. Multi-tool sequences emit <tool_A>(args)<end><tool_B>(args)<end>, so stopping at the first <end> truncates legitimate multi-tool output.

Quick start (Ollama)

hf download BrinqAI/functiongemma-270m-physical-ai \
  functiongemma-physical-ai-Q4_K_M.gguf Modelfile tools.json token_map.json \
  --local-dir ./fg-physical-ai

cd fg-physical-ai
ollama create functiongemma-physical-ai -f Modelfile

ollama create -f Modelfile is the documented install path because the shipped Modelfile bakes in the stop tokens (<end>, <end_of_turn>, <eos>) and decode parameters (temperature=0, num_ctx=1024, num_predict=80). Direct ollama pull hf.co/... does not apply these, and the function-token output will run past <end> until it hits num_predict. Only use the direct-pull path if your client injects stops at request time (the Python snippet below does this via options.stop).

Calling the model

The model expects prompts built via the FunctionGemma chat template (developer role + user role, with the tools list passed in via tokenizer.apply_chat_template(..., tools=tools)). Send to Ollama with raw=true so it forwards the prompt verbatim. Plain ollama run from the CLI does not pass tools and will degenerate to chat-style refusals.

Standalone client (depends on transformers for the chat template, plus the tools.json and token_map.json files in the same directory):

import json
import re
import urllib.request
from transformers import AutoTokenizer

OLLAMA_URL = "http://localhost:11434"
MODEL = "functiongemma-physical-ai"

tools = json.load(open("tools.json"))["tools"]
reverse_token_map = json.load(open("token_map.json"))["reverse"]
tokenizer = AutoTokenizer.from_pretrained("google/functiongemma-270m-it")


def build_prompt(user_text: str) -> str:
    msgs = [
        {
            "role": "developer",
            "content": (
                "You are a model that can do function calling with the "
                "following functions\n"
            ),
            "tool_calls": None,
        },
        {"role": "user", "content": user_text, "tool_calls": None},
    ]
    return tokenizer.apply_chat_template(
        msgs, tools=tools, tokenize=False, add_generation_prompt=True
    )


def call_model(user_text: str) -> str:
    body = json.dumps({
        "model": MODEL,
        "prompt": build_prompt(user_text),
        "raw": True,
        "stream": False,
        "options": {
            "temperature": 0.0,
            "top_p": 1.0,
            "num_predict": 80,
            "stop": ["<end>", "<end_of_turn>", "<eos>"],
        },
    }).encode()
    req = urllib.request.Request(
        f"{OLLAMA_URL}/api/generate",
        data=body,
        headers={"Content-Type": "application/json"},
    )
    with urllib.request.urlopen(req, timeout=60) as resp:
        return json.loads(resp.read())["response"]


def parse_call(raw: str) -> tuple[str | None, str]:
    """Return (tool_name, raw_args_string). tool_name is None on parse fail."""
    m = re.match(r"\s*(<tool_\d+>)\((.*?)\)<end>", raw)
    if not m:
        return None, ""
    tok, args = m.group(1), m.group(2)
    return reverse_token_map.get(tok), args


raw = call_model("Turn on the lights")
print(raw)                  # e.g. <tool_0>()<end>
print(parse_call(raw))      # ('turn_on_lights', '')

Training data

v7 (current)

Size: 2,000 train / 250 eval (coral_v7_compact.jsonl).
Schema change: list_alarms removed. Out-of-scope alarm-query prompts ("what alarms do I have?") are deliberately routed through respond() rather than answered by a query tool. Compact token map shifted accordingly: get_system_status is now <tool_8> (was <tool_9>), respond is <tool_9> (was <tool_10>).
Multi-tool: 84 of 250 eval rows (33.6%) are multi-tool sequences, matching the Google mobile-actions distribution.
GGUF eval (Q5_K_M, greedy): overall 86.8% (217/250), single-tool 92.8% (154/166), multi-tool exact-match 75.0% (63/84), parse failure 0.0% (0/250). Per-tool F1 ranges from 0.74 (respond) to 1.00 (cancel_alarm).
Known weak spots (informal on-device REPL): "tell me a joke" / "what alarms do I have" tend to misroute to play_buzzer instead of respond — more respond() negatives sharing keywords with physical-action tools would help in v8.

v6 (previous)

Size: 1,400 train / 150 eval (v5/v6 dataset lineage, coral_v5_compact.jsonl).
Tool count: 11. Cameras / vision tools dropped from earlier checkpoints; alarm-list tool kept.

v4 (legacy)

Size: 367 train / 100 eval.
Multi-tool: 13% (vs Google mobile-actions 33.4%).
Buzzer schema: pattern-only (binary GPIO on the reference HAT — no PWM). Old frequency_hz / duration_seconds prompts are routed through respond() as out-of-scope negatives.

Methodology

This model uses the functional-token approach introduced by Octopus v2 (Chen and Li, 2024): special vocabulary tokens are added for each callable function so a tool call decodes in a single output token rather than a multi-token JSON string. On-device this collapses ~30–80-token native FunctionGemma calls down to ~8–15 tokens, enabling sub-second decode on a 2-core Cortex-A55.

The training recipe is a direct port of Brinq's SmartPanel v14 trainer (full bf16, mean-init for new tokens, completion-only loss mask), adapted for a smaller dataset:

Full bf16 fine-tune (no LoRA).
Mean-init for new <tool_0>..<tool_9> and <end> special tokens (init = mean of existing input embeddings; random init under-converges for tiny models on small datasets).
Completion-only loss mask: hand-rolled, masking everything before <start_of_turn>model\n. TRL 0.25's completion_only_loss=True is a no-op on flat-text data and FunctionGemma's chat template lacks {% generation %} markers required for assistant_only_loss.
8 epochs, lr 3e-5, cosine schedule, 0.1 warmup. (2,000 examples in v7 — fewer epochs than v4's 15 because dataset size grew 5×.)
Tool-token loss weight 4.0 to keep the new function tokens learning faster than the rest of the vocabulary (Gemma3's 262k-vocab dilutes the signal otherwise).
Effective batch 16 = per_device_train_batch_size=2 × gradient_accumulation_steps=8 (kept this way to avoid the 8 GiB cross-entropy logit allocation OOM that bites Gemma3's 262k vocab).
max_length=1024 to fit the full 13-tool schema in the prompt.
bf16, gradient checkpointing, adamw_torch_fused, weight_decay=0.01.
Trained inside unsloth/unsloth:latest Docker container with GPU passthrough.

Citation

@article{chen2024octopusv2,
  title   = {Octopus v2: On-device language model for super agent},
  author  = {Chen, Wei and Li, Zhiyuan},
  journal = {arXiv preprint arXiv:2404.01744},
  year    = {2024},
  url     = {https://arxiv.org/abs/2404.01744}
}

Eval results

v7 checkpoint (2,000 train / 250 eval), Q5_K_M GGUF, greedy decode:

Metric	Result
Overall accuracy	217 / 250 = 86.8%
Single-tool accuracy	154 / 166 = 92.8%
Multi-tool exact-match	63 / 84 = 75.0%
Parse failure rate	0 / 250 = 0.0%

Per-tool F1: cancel_alarm 1.00, get_system_status 0.96, set_alarm 0.93, set_neopixel_pattern 0.92, turn_on_lights 0.90, blink_lights 0.89, turn_off_lights 0.89, set_led_color 0.88, play_buzzer 0.83, respond 0.74. (respond is the lowest because the model occasionally chooses a physical-action tool with a hallucinated text argument when the prompt shares keywords with one — an issue the dispatcher's enum validation catches at runtime.)

On-device latency (SL2619 / 2× Cortex-A55 @ 2 GHz, performance governor): ~42 s cold prefill (one-time), ~1.6 s / turn warm — measured across a 33-prompt exhaustive REPL run on the actual Coralboard.

Latency

~1.1 – 1.3 s per call on a laptop CPU (Ollama / standalone client above).
~1.6 s / turn warm, ~42 s cold prefill on SL2619 (2× Cortex-A55 @ 2 GHz) with the CPU governor pinned to performance. Measured 2026-05-05 on the Grinn Coralboard with the v7 GGUF + the Function_calling/ demo from BrinqAI/sl2610-examples.

ONNX exports (for compiler toolchains)

For compiler-targeted backends (ONNX Runtime, IREE/MLIR, OpenVINO, TensorRT, Synaptics Torq), the model is also published as ONNX with KV-cache support (text-generation-with-past). Both exports are derived from the same coral-functiongemma-v4c-compact checkpoint as the GGUF above.

Path	Precision	Weight init dtype	Size	ORT runnable
`onnx/compact-fp32/model.onnx`	fp32	237 / 237 FLOAT	1.7 GB	yes
`onnx/compact-fp16/model.onnx`	fp16	237 / 237 FLOAT16	833 MB	no — see note

Both files are structurally valid (onnx.checker.check_model(..., full_check=True) passes). Each export ships with the matching tokenizer and config.json so it can be loaded directly:

from transformers import AutoTokenizer
import onnxruntime as ort
import numpy as np, json

MODEL = "onnx/compact-fp32"  # or downloaded local path
tok = AutoTokenizer.from_pretrained(MODEL)
sess = ort.InferenceSession(f"{MODEL}/model.onnx", providers=["CPUExecutionProvider"])

tools = json.load(open("tools.json"))["tools"]
prompt = tok.apply_chat_template(
    [{"role": "developer",
      "content": "You are a model that can do function calling with the following functions\n",
      "tool_calls": None},
     {"role": "user", "content": "Turn on the lights", "tool_calls": None}],
    tools=tools, tokenize=False, add_generation_prompt=True,
)
# Then feed input_ids + empty past_key_values.* (shape (1, num_kv_heads, 0, head_dim))
# greedy-decode in a loop, stop on <end>. See repo for full snippet.

Smoke decode of "Turn on the lights" against the fp32 ONNX returns <tool_0>()<end> (= turn_on_lights()), matching the GGUF output.

fp16 + ONNX Runtime caveat

The fp16 ONNX file is structurally valid but does not currently load in ONNX Runtime ≥ 1.20 for this model: ORT's SimplifiedLayerNormFusion pass chokes on the InsertedPrecisionFreeCast_* nodes that the fp16 conversion inserts around Gemma3's RMSNorm layers. The error is graph-optimizer-internal and reproduces with ORT_DISABLE_ALL. This is an ORT bug, not an ONNX-spec issue — the file passes onnx.checker and the graph is well-formed.

For compiler frontends that consume ONNX directly (IREE / MLIR, TensorRT, OpenVINO, Synaptics Torq), the fp16 file should ingest fine. For runtime inference via onnxruntime itself, use the fp32 export and let your compiler or runtime do its own dtype conversion / quantization downstream.

Files

functiongemma-physical-ai-v7-Q5_K_M.gguf  # 248 MB, GGUF Q5_K_M, 10-tool v7 schema (current)
functiongemma-physical-ai-v6-Q5_K_M.gguf  # 248 MB, GGUF Q5_K_M, 11-tool v6 schema (previous)
functiongemma-physical-ai-Q4_K_M.gguf     # 253 MB, GGUF Q4_K_M, v4c (legacy)
Modelfile                                  # Ollama Modelfile (function-token format)
tools.json                                 # 10-tool schema (mobile-actions format, current)
token_map.json                             # function-token <-> tool-name map
onnx/compact-fp32/                         # ONNX export, fp32, with KV cache (1.7 GB)
onnx/compact-fp16/                         # ONNX export, fp16, with KV cache (833 MB) — see ORT caveat above
README.md                                  # this file

License

Released under the Gemma Terms of Use. By using this model you agree to those terms. Base model: google/functiongemma-270m-it.

Links

Base model: https://huggingface.co/google/functiongemma-270m-it
Octopus v2 paper: https://arxiv.org/abs/2404.01744
Hardware demo (Coralboard, Google IO 2026 — full physical setup, WLED-over-USB-CDC, Grinn HAT, end-to-end voice + text REPL): https://github.com/BrinqAI/sl2610-examples/tree/coralboard/functiongemma/Function_calling (BrinqAI fork of the upstream Synaptics demo repo, synaptics-astra-demos/sl2610-examples).

Downloads last month: 159

GGUF

Model size

0.3B params

Architecture

gemma3

Hardware compatibility

4-bit

5-bit

Model tree for BrinqAI/functiongemma-270m-physical-ai

Base model

google/functiongemma-270m-it

Quantized

(49)

this model

Paper for BrinqAI/functiongemma-270m-physical-ai

Octopus v2: On-device language model for super agent

Paper • 2404.01744 • Published Apr 2, 2024 • 58