WELCOME TO FUNCTIONGEMMA

Welcome to the tutorial on FunctionGemma. This interactive experience will teach you everything about function calling, tokenization, and prompt engineering through hands-on experimentation.

WHAT YOU'LL LEARN

  • How tokenization works in language models
  • Why zero-shot function calling fails
  • How few-shot examples solve the problem
  • Token-level analysis and debugging
  • Best practices for prompt engineering
  • ONNX model optimization and deployment

ABOUT FUNCTIONGEMMA-270M-IT-ONNX

Model: onnx-community/functiongemma-270m-it-ONNX

Size: 270 million parameters

Purpose: Specialized for function calling tasks

Format: ONNX quantized (q4 for WebGPU, q8 for WASM)

Key Finding: Requires few-shot examples to generate correct function calls!

Architecture: Based on Google's Gemma 3 270M, fine-tuned for function calling

MODEL LOADED SUCCESSFULLY

The model has been automatically loaded and is ready to use. You can now proceed with the lessons!

TOKENIZATION BASICS

Tokenization is the process of converting text into tokens (numbers) that the model can understand. Let's explore this interactively!

WHAT IS TOKENIZATION?

Language models don't understand words directly. They work with tokens - numeric IDs that represent pieces of text. A token can be a word, part of a word, or even a single character.

TRY IT YOURSELF

// How tokenization works in code:
const text = "call:get_current_temperature";
// Tokenize the text
const tokens = await tokenizer.encode(text);
// Result: [6639, 236787, 828, 236779, 4002, 236779, 27495]
// Each number represents a token ID

// Decode tokens back to text
const decoded = await tokenizer.decode(tokens);
// Result: "call:get_current_temperature"

KEY INSIGHT

Special tokens like <start_function_call> have specific token IDs (e.g., token 48). The model uses these to understand structure.

ZERO-SHOT FUNCTION CALLING (WHY IT FAILS)

Zero-shot means asking the model to do something without showing it an example. Let's see what happens!

THE PROBLEM

Without examples, FunctionGemma generates error: instead of call: after <start_function_call>.

TEST ZERO-SHOT APPROACH

// Zero-shot approach - NO examples provided
const messages = [
  {
    role: "developer",
    content: "You are a model that can do function calling..."
  },
  {
    role: "user",
    content: "What's the temperature in London?"
  }
  // โŒ No example shown to the model!
];

// Result: Model generates "error:" instead of "call:"
// Token 1899 ("error") is chosen instead of token 6639 ("call")

TOKEN ANALYSIS

After <start_function_call> (token 48), the model's probability distribution favors token 1899 ("error") over token 6639 ("call") when no example is provided.

ONE-SHOT FUNCTION CALLING (PARTIAL SUCCESS)

One-shot means showing the model ONE example. Let's see if this helps!

TEST ONE-SHOT APPROACH

// One-shot approach - ONE example provided
const messages = [
  {
    role: "developer",
    content: "You are a model that can do function calling..."
  },
  {
    role: "user",
    content: "What's the temperature in Paris?"
  },
  {
    role: "assistant",
    // โœ… ONE example showing correct format
    content: "<start_function_call>call:get_current_temperature{location:<escape>Paris<escape>}<end_function_call>"
  },
  {
    role: "user",
    content: "What's the temperature in Tokyo?"
  }
];

RESULTS MAY VARY

One-shot can work sometimes, but it's not as reliable as few-shot. The model needs more context to consistently generate correct function calls.

FEW-SHOT FUNCTION CALLING (THE SOLUTION!)

Few-shot means showing the model multiple examples. This is the proven solution!

THE SOLUTION

By providing a few-shot example, we shift the model's token probabilities. Token 6639 ("call") becomes more likely than token 1899 ("error").

TEST FEW-SHOT APPROACH

// โœ… FEW-SHOT APPROACH (PROVEN TO WORK):
// Add example conversation showing correct format
const messages = [
  {
    role: "developer",
    content: "You are a model that can do function calling with the following functions"
  },
  {
    role: "user",
    content: "What's the temperature in Paris?"
  },
  {
    role: "assistant",
    // โœ… Example showing the EXACT format we want
    content: "<start_function_call>call:get_current_temperature{location:<escape>Paris<escape>}<end_function_call>"
  },
  {
    role: "user",
    content: query  // Your actual query
  }
];

// Apply chat template with tools
const inputs = await tokenizer.apply_chat_template(messages, {
  tools: [weatherFunction],
  tokenize: true,
  add_generation_prompt: true,
  return_dict: true
});

// Generate response
const output = await model.generate({
  ...inputs,
  max_new_tokens: 512,
  do_sample: false,
  temperature: 0.0
});

// โœ… Result: Correct function call generated!
// <start_function_call>call:get_current_temperature{location:<escape>New York<escape>}<end_function_call>

WHY FEW-SHOT WORKS

  • Shows the model the exact format we expect
  • Shifts token probabilities in favor of "call:" instead of "error:"
  • Provides context about the task structure
  • Works consistently with the quantized ONNX model

TOKEN-LEVEL DEEP DIVE

Let's examine what happens at the token level when the model generates function calls.

TOKEN-LEVEL ANALYSIS

TOKEN ID TOKEN TEXT CONTEXT PROBABILITY SHIFT
48 <start_function_call> Always correct N/A
1899 "error" Zero-shot (no example) โŒ High probability
6639 "call" Few-shot (with example) โœ… High probability
236787 ":" Always correct N/A
// Token-level analysis of generated output
// First 20 generated tokens:

// Token 48: "<start_function_call>" โœ…
// Token 6639: "call" โœ… (with few-shot) or Token 1899: "error" โŒ (zero-shot)
// Token 236787: ":" โœ…
// Token 828: "get" โœ…
// Token 236779: "_" โœ…
// Token 4002: "current" โœ…
// Token 236779: "_" โœ…
// Token 27495: "temperature" โœ…
// Token 236782: "{" โœ…
// Token 7125: "location" โœ…
// Token 236787: ":" โœ…
// Token 52: "<escape>" โœ…
// Token 27822: "London" โœ…
// Token 52: "<escape>" โœ…
// Token 236783: "}" โœ…
// Token 49: "<end_function_call>" โœ…

// The critical decision point is after token 48:
// - Without example: Token 1899 ("error") is more likely
// - With example: Token 6639 ("call") is more likely

HYPOTHESIS

The model was trained on function calling data that included error handling examples. Without context, it defaults to the error generation pattern. Few-shot examples provide the necessary context to trigger the correct generation path.

INTERACTIVE PLAYGROUND

Now it's your turn! Experiment with different queries and see how the model responds. Add your own examples to test zero-shot, one-shot, and few-shot approaches.

FUNCTION SCHEMA

SYSTEM MESSAGE

YOUR QUERY

CUSTOM EXAMPLES

Add example conversations to use in one-shot and few-shot modes. Each example should show a user query and the expected assistant response with function call.

EXAMPLE FORMAT

User: "What's the temperature in Paris?"
Assistant: "<start_function_call>call:get_current_temperature{location:<escape>Paris<escape>}<end_function_call>"

For few-shot, add multiple examples. For one-shot, only the first example will be used. For zero-shot, no examples are used.

OUTPUT

TIPS FOR EXPERIMENTATION

  • Try different cities and locations
  • Compare zero-shot vs few-shot results
  • Modify the function schema and see what happens
  • Add custom examples to test different scenarios
  • Watch the token visualization to understand the generation process
  • Experiment with different max_tokens values

RESOURCES & LINKS

Explore these resources to deepen your understanding of FunctionGemma, ONNX, and function calling.

KEY RESOURCES SUMMARY

FunctionGemma is a specialized 270M parameter model fine-tuned from Google's Gemma 3 for function calling tasks. It's optimized for edge deployment and requires few-shot examples for reliable function call generation.

ONNX (Open Neural Network Exchange) is an open format for representing machine learning models, enabling interoperability between different frameworks and optimized inference across platforms.

The model is available in quantized formats (q4 for WebGPU, q8 for WASM) to enable efficient browser-based inference.