How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
Use Docker
docker model run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
Quick Links

FunctionGemma 270M — Physical AI (v10, Octopus v2)

Fine-tuned google/functiongemma-270m-it for voice-controlled physical-AI / household-IoT actions on a Synaptics SL2619 "Coral" edge board (Google IO 2026 demo).

Current revision: functiongemma-physical-ai-v10-Q5_K_M.gguf — 6 tools, ~248 MB Q5_K_M, ~0.48 s cold prefill on the 2-core Cortex-A55, 97.9 % mean token accuracy on eval.

Schema ships as tools.json. Token-to-tool mapping is in token_map.json.

Tool surface (6 tools)

Token Name Args Purpose
<tool_0> set_lights color?, effect?, state? Drive whatever lights are connected — HAT 3-LED indicators or a WLED-driven addressable strip / ring. All three args optional; the model emits only what the user implied.
<tool_1> play_buzzer pattern Named pattern on the piezo buzzer: beep, double_beep, chirp, siren, alarm, success, error.
<tool_2> set_alarm duration or time, label? Schedule an alarm. Fires the buzzer plus a visible flash.
<tool_3> cancel_alarm label? Cancel one alarm by label, or all if no label given.
<tool_4> get_system_status metric cpu, memory, temperature, npu, or all.
<tool_5> respond message Natural-language reply when no physical-action tool fits, or when the request is ambiguous and the model needs to ask for clarification.

The model is hardware-agnostic for lighting: it parses user intent into semantic args (color, effect, state) and leaves the dispatcher to map those onto whatever LED hardware is detected at launch — the HAT's three indicator LEDs, a WLED-driven strip, or a Neopixel ring. The user vocabulary is hardware-agnostic too: "lights", "LEDs", "strip", "indicators" all refer to whatever is wired up.

Prompt format

The v10 model is trained Octopus v2 style: no schema, no tools list, just a bare user turn.

<start_of_turn>user
{user_text}<end_of_turn>
<start_of_turn>model

Tool semantics live in the model weights (via the special functional tokens <tool_0><tool_5> plus <end>), not in the prompt. The tools.json schema in this repo is the dispatcher's arg-validation contract and is embedded in the GGUF metadata for schema-drift checks, but it is not loaded into the inference prompt. Typical prompts are ~13 tokens.

Output format — functional tokens, named args

Tool calls emit as functional tokens with named arguments, per the Mercedes-Benz Octopus v2 convention (arXiv 2501.02342). Each tool name compiles to a single special-vocabulary token (<tool_0><tool_5>); arguments are written as name="value" pairs; a single <end> token terminates the call. The model emits only the args the user implied — absent args are simply not present.

Examples:

User says Model emits Resolves to
turn the lights red <tool_0>(color="red")<end> set_lights(color="red")
rainbow on the strip <tool_0>(effect="rainbow")<end> set_lights(effect="rainbow")
lights off <tool_0>(state="off")<end> set_lights(state="off")
red sparkle <tool_0>(color="red", effect="sparkle")<end> set_lights(color="red", effect="sparkle")
set an alarm in 5 minutes <tool_2>(duration="5 minutes")<end> set_alarm(duration="5 minutes")
cancel all alarms <tool_3>()<end> cancel_alarm()
what's the cpu <tool_4>(metric="cpu")<end> get_system_status(metric="cpu")
good morning <tool_5>(message="Good morning. ...")<end> respond(message="...")

A complete call decodes in roughly 8–20 output tokens, well inside the sub-second voice-UX budget on a 2-core Cortex-A55.

⚠️ Inference servers MUST stop generation on <end_of_turn> (or <eos>), NOT on <end>. The model can emit multi-tool sequences <tool_A>(args)<end><tool_B>(args)<end>, so stopping at the first <end> truncates legitimate multi-tool output.

Quick start (Ollama)

hf download BrinqAI/functiongemma-270m-physical-ai \
  functiongemma-physical-ai-v10-Q5_K_M.gguf Modelfile tools.json token_map.json \
  --local-dir ./fg-physical-ai

cd fg-physical-ai
ollama create functiongemma-physical-ai -f Modelfile

The shipped Modelfile bakes in the stop tokens (<end_of_turn>, <eos>) and decode parameters (temperature=0, num_ctx=1024, num_predict=80).

Calling the model

Send a bare user turn — no schema, no tools list. With Ollama, use raw=true:

import json
import re
import urllib.request

OLLAMA_URL = "http://localhost:11434"
MODEL = "functiongemma-physical-ai"

reverse_token_map = json.load(open("token_map.json"))["reverse"]

NAMED_ARG_RE = re.compile(r'(\w+)\s*=\s*"((?:[^"\\]|\\.)*)"')


def build_prompt(user_text: str) -> str:
    return (
        f"<start_of_turn>user\n{user_text}<end_of_turn>\n"
        f"<start_of_turn>model\n"
    )


def call_model(user_text: str) -> str:
    body = json.dumps({
        "model": MODEL,
        "prompt": build_prompt(user_text),
        "raw": True,
        "stream": False,
        "options": {
            "temperature": 0.0,
            "top_p": 1.0,
            "num_predict": 80,
            "stop": ["<end_of_turn>", "<eos>"],
        },
    }).encode()
    req = urllib.request.Request(
        f"{OLLAMA_URL}/api/generate",
        data=body,
        headers={"Content-Type": "application/json"},
    )
    with urllib.request.urlopen(req, timeout=60) as resp:
        return json.loads(resp.read())["response"]


def parse_call(raw: str) -> tuple[str | None, dict[str, str]]:
    """Return (tool_name, kwargs). tool_name is None on parse fail."""
    m = re.match(r"\s*(<tool_\d+>)\((.*?)\)<end>", raw)
    if not m:
        return None, {}
    tok, body = m.group(1), m.group(2)
    kwargs = {k: v for k, v in NAMED_ARG_RE.findall(body)}
    return reverse_token_map.get(tok), kwargs


raw = call_model("turn the lights red")
print(raw)               # e.g. '<tool_0>(color="red")<end>'
print(parse_call(raw))   # ('set_lights', {'color': 'red'})

For llama-cpp-python directly, use detokenize(..., special=True) so the <tool_N> and <end> tokens render in the output instead of being stripped.

Training data

Training data was generated from Haiku-authored phrasing templates crossed with deterministic entity pools, then lightly augmented with Moonshine-flavored ASR noise (dropped function words, lowercased traces, filler-word prepends). Each record is a flat {input, output} pair — no tools / messages array, no chat template.

Train rows 5,222
Eval rows 920
Tools 6
Per-template entity expansion color × effect × state pools for set_lights; pattern pool for play_buzzer; duration / time pools for set_alarm; metric pool for get_system_status
ASR-style augmentation Moonshine-sim noise on a fraction of records (dropped articles, lowercased traces, filler prepends)
Multi-tool fraction None — single-tool emphasis; multi-tool routines composed at dispatch time

The set_lights tool also gets explicit failure-mode rows that route bare ambiguous prompts to respond() — e.g. "rainbow" alone ("Did you mean the lights? Try 'rainbow on the lights'."), "siren" alone (prompts the user toward play_buzzer), and bare "on" / "off" (asks what the user wants to act on).

Methodology

  • Full bf16 fine-tune (no LoRA).
  • Functional tokens: <tool_0><tool_5> + <end> added as additional_special_tokens; new embeddings mean-initialized from the existing input-embedding matrix (random init under-converges on small datasets at this scale).
  • Completion-only loss mask: hand-rolled — labels before <start_of_turn>model\n are masked to -100. The model learns only from the assistant turn, not the user prompt.
  • 5 epochs, lr 3e-5, cosine schedule, 0.1 warmup, weight decay 0.01.
  • Effective batch = 16 (per_device_train_batch_size=8 × gradient_accumulation_steps=2).
  • max_length=256 — the trained prompt format is ~13 tokens and the assistant turn fits comfortably under 64 tokens, including respond() messages.
  • bf16, gradient checkpointing, adamw_torch_fused, metric_for_best_model="eval_loss" + load_best_model_at_end=True.
  • Training wallclock: 5 min on a single H100 (~15–20 min on a 4090).

Citation

@article{chen2024octopusv2,
  title   = {Octopus v2: On-device language model for super agent},
  author  = {Chen, Wei and Li, Zhiyuan},
  journal = {arXiv preprint arXiv:2404.01744},
  year    = {2024},
  url     = {https://arxiv.org/abs/2404.01744}
}

@article{merc2025octopusv2,
  title   = {Octopus v2 named-arg function calling},
  journal = {arXiv preprint arXiv:2501.02342},
  year    = {2025},
  url     = {https://arxiv.org/abs/2501.02342}
}

Results

Training metrics (final epoch)

Final train loss 0.493
Final eval loss 0.046
Mean token accuracy (eval) 97.9 %

Held-out smoke test (post-train, 36 prompts spanning all 6 tools)

Smoke-test routing accuracy 35 / 36 (97.2 %)

The 36-prompt suite covers single-tool happy paths for every tool plus failure modes the model is expected to deflect: ambiguous color words without a target ("make it red"), effect names without a target ("rainbow"), unsupported features ("play a tone at 2000 hz"), and out-of-scope appliances. Failure-mode prompts all route to respond() with a helpful clarification message.

On-device benchmark (Coralboard, 2-core Cortex-A55 @ 2 GHz, Q5_K_M GGUF)

Measured with llama-cpp-python 0.3.16, n_ctx=1024, n_threads=2, CPU governor performance, 8 representative prompts spanning all 6 tools.

Model load 2.23 s
Prompt tokens 11–16 (mean ~13)
Cold prefill (turn 1) 0.48 s
Warm prefill (turn 2+, avg) 0.47 s
Decode rate ~9.7 tok/s
Decode time, typical tool call (3–8 output tokens) 0.3–0.8 s
Decode time, respond() (~25 output tokens) ~2.6 s
End-to-end first turn (model load + prefill + decode) ~3.4 s

Files

functiongemma-physical-ai-v10-Q5_K_M.gguf  # ~248 MB, Q5_K_M weights (Ollama / llama.cpp)
Modelfile                                  # Ollama Modelfile (functional-token format)
tools.json                                 # 6-tool schema, canonical mobile-actions format
token_map.json                             # functional-token <-> tool-name map
README.md                                  # this file

Earlier checkpoint GGUFs from the project's development history (functiongemma-physical-ai-v9-Q5_K_M.gguf, functiongemma-physical-ai-v7-Q5_K_M.gguf, functiongemma-physical-ai-v6-Q5_K_M.gguf, functiongemma-physical-ai-Q4_K_M.gguf) remain in the repo for reproducibility. They use different tool surfaces and (for v7 and earlier) a different inference-prompt format; new deployments should use the v10 file above.

License

Released under the Gemma Terms of Use. By using this model you agree to those terms. Base model: google/functiongemma-270m-it.

Links

Downloads last month
235
GGUF
Model size
0.3B params
Architecture
gemma3
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BrinqAI/functiongemma-270m-physical-ai

Quantized
(50)
this model

Papers for BrinqAI/functiongemma-270m-physical-ai