Nexus Syntegra Nexus TinyFunction Banner

Nexus-TinyFunction-1.2B-v2.0

A fast, tiny function-calling model fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct. Built on LFM2.5's hybrid recurrent-attention architecture for significantly faster inference than transformer-only models of similar or even smaller size — fast enough to run on the CPU of a mobile phone.

No thinking trace or chain-of-thought required. As an instruct-tuned model, it produces accurate tool calls directly without verbose reasoning overhead, keeping latency low and token usage minimal.

Highlights

  • Blazing fast inference — hybrid recurrent-attention architecture runs faster than similarly-sized (and even smaller) pure transformer models on both GPU and CPU
  • No thinking trace needed — direct tool calls without chain-of-thought overhead, unlike reasoning-based models
  • Runs anywhere — Q4_K_M quantization fits in ~700 MB, fast enough for Android phones, Raspberry Pi, edge servers
  • Strong irrelevance detection (80.42%) — reliably refuses to call tools when no tool matches the query, avoiding hallucinated function calls
  • 94.25% simple function calling — accurate single-tool selection and argument extraction
  • JSON Syntax Reliability: 99.3% — near-perfect structured output
  • Parallel & Multiple tool calling — handles complex multi-tool scenarios

Benchmark Results

BFCL V4 Benchmark: All Models (Q8_0 GGUF)

The following charts compare models we tested locally in Q8_0 GGUF quantization on the same hardware under identical conditions.

BFCL Benchmark Comparison

Average BFCL Score Ranking

BFCL Ranking

Inference Speed Comparison

Speed Comparison

Head-to-Head: vs LFM2.5 Nova (Same Base Model)

Direct comparison with NovachronoAI/LFM2.5-1.2B-Nova-Function-Calling, the other LFM2.5-based function-calling fine-tune. Both models share the same base (LiquidAI/LFM2.5-1.2B-Instruct, BFCL V4 non-live avg: 24.8%).

Head-to-Head vs Nova

All scores from BFCL V4, Q8_0 GGUF quantization via llama-server on a single NVIDIA RTX 5090.

JSON Syntax Reliability

JSON Syntax Reliability

Model JSON Validity Invalid Tool Calls*
Nexus-TinyFunction-1.2B-v2.0 99.3% 18 2458
xLAM-2 3B 99.8% 4 2485
xLAM-2 1B 99.6% 10 2480
Qwen3.5 4B 99.1% 21 2423
Qwen3.5 2B 99.0% 24 2296
Qwen3.5 0.8B 98.9% 25 2220
LFM2.5 Nova 1.2B 98.3% 23 1334
LFM2.5 Base 1.2B 96.8% 13 407

*Tool Calls = samples where the model attempted a tool call (out of 2,501 total per model). Models with fewer tool calls responded with plain text more often — the base model only attempted 407/2,501 calls.

BFCL V4 Official Leaderboard Comparison

How does a 1.2B model compare to frontier API models? We evaluated on 5 of 8 BFCL V4 non-live categories (Python Simple, Multiple, Parallel, Parallel Multiple, Irrelevance Detection). Java Simple, JavaScript Simple, and the combined Simple AST average are excluded — we did not train on or evaluate these categories.

Transparency: Official leaderboard scores use API inference at full precision. Our scores are from Q8_0 GGUF quantization via llama-server. The # rank shown is each model's official rank across all 8 non-live categories.

BFCL V4 Leaderboard Ranking

BFCL V4 Per-Category Comparison

Why LFM2.5?

We chose LiquidAI/LFM2.5-1.2B-Instruct as the base model for several reasons:

  • Faster than transformers at any size — LFM2.5's hybrid recurrent-attention architecture achieves faster inference than pure transformer models of similar size, and even outpaces many smaller transformer models. The sub-quadratic scaling on sequence length is especially beneficial for function-calling workloads where tool definitions consume significant context.
  • Built for edge and mobile — At 1.2B parameters, the model runs on consumer hardware, Android phones, Raspberry Pi, and edge servers. The Q4_K_M quantization fits in under 700 MB of RAM.
  • Instruct-tuned, not reasoning-dependent — The base model is already instruct-tuned, so function calls are produced directly without chain-of-thought or thinking traces. This keeps latency low and avoids wasting tokens on reasoning overhead.
  • Massive improvement over base — The base LFM2.5-1.2B-Instruct averages just 24.8% on BFCL V4 non-live categories. Our fine-tune brings that to 85.2% — a 60pp gain.

Model Details

Developed by Nexus-Syntegra
Base model LiquidAI/LFM2.5-1.2B-Instruct
Architecture Hybrid recurrent-attention (Lfm2ForCausalLM), 1.2B parameters
Context length 32,768 tokens
License Apache 2.0 (fine-tune); base model weights subject to LFM Open License v1.0
Language English
Fine-tune method QLoRA SFT + 3-stage curriculum learning + DPO
Hardware Single NVIDIA RTX 5090 (32 GB) — training, quantization, and all benchmarks
Format ChatML with <tools> / <tool_call> XML tags

Prompt Format

This model uses ChatML format with XML-tagged tool definitions and tool calls.

Important: Do not use apply_chat_template(tools=...) — the base model's chat template formats tools differently than our fine-tune expects. Instead, include the tools directly in the system message as shown below.

System Prompt with Tools

<|im_start|>system
You are a function calling AI assistant. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.

<tools>
[{"name": "get_weather", "description": "Get current weather for a location", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"]}}]
</tools><|im_end|>

Single Tool Call

<|im_start|>user
What's the weather in Tokyo?<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call><|im_end|>

Parallel Tool Calls

When multiple tools should be called, the model outputs them as a JSON array:

<|im_start|>user
What's the weather in Tokyo and London?<|im_end|>
<|im_start|>assistant
<tool_call>
[{"name": "get_weather", "arguments": {"city": "Tokyo"}}, {"name": "get_weather", "arguments": {"city": "London"}}]
</tool_call><|im_end|>

Irrelevance (No Tool Match)

When no tool matches the user's query, the model responds in plain text without any tool call tags:

<|im_start|>user
Tell me a joke<|im_end|>
<|im_start|>assistant
Sure! Why do programmers prefer dark mode? Because light attracts bugs!<|im_end|>

How to Use

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import json

model_id = "nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="bfloat16", device_map="auto", trust_remote_code=True
)

tools = [{"name": "get_weather", "description": "Get weather for a city",
          "parameters": {"type": "object", "properties": {"city": {"type": "string"}},
                         "required": ["city"]}}]

system_prompt = (
    "You are a function calling AI assistant. You are provided with function "
    "signatures within <tools></tools> XML tags. You may call one or more functions "
    "to assist with the user query. Don't make assumptions about what values to "
    "plug into functions.\n\n<tools>\n" + json.dumps(tools) + "\n</tools>"
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What's the weather in Tokyo and London?"},
]

input_ids = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
)
input_ids = input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=False))

With llama.cpp

GGUF quantizations are available at nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF.

# Download the Q4_K_M quantization
huggingface-cli download nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF \
  Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf --local-dir .

# Run server with function calling support
./llama-server \
  --model Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf \
  --jinja \
  --ctx-size 4096 \
  --port 8080

With Ollama

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf
PARAMETER temperature 0
PARAMETER num_ctx 4096
EOF

ollama create nexus-tinyfunction-1.2b-v2.0 -f Modelfile
ollama run nexus-tinyfunction-1.2b-v2.0

Training Details

Method

QLoRA Supervised Fine-Tuning (SFT) with 3-stage curriculum learning, followed by Direct Preference Optimization (DPO). Trained using Unsloth + TRL SFTTrainer.

Training Data

~38,000 curated examples from public datasets and synthetic augmentation:

Dataset Examples Purpose
Public function-calling datasets ~16,500 General function calling and irrelevance detection
Synthetic (BFCL-derived + augmented) ~22,000 Edge cases, curriculum labels
Total ~38,500

Hyperparameters

Parameter Value
LoRA rank (r) 128
LoRA alpha 128
Target modules q_proj, k_proj, v_proj, out_proj, in_proj, w1, w2, w3
Effective batch size 32 (2 x 16 gradient accumulation)
Learning rate (SFT) 2e-5 (cosine with 10% warmup)
Curriculum stages 3 (foundation / disambiguation / adversarial)
DPO beta 0.1
DPO learning rate 1e-6
Precision bf16
Packing Enabled
Hardware Single NVIDIA RTX 5090 (32 GB) — all training, quantization, and benchmarks

GGUF Quantizations

Quantized versions for llama.cpp, Ollama, and LM Studio are available at nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF:

Quantization Size Description
Q8_0 ~1.2 GB Near-lossless, highest quality
Q4_K_M ~698 MB Good quality/size balance, recommended for constrained hardware

Limitations

  • Parallel function calling is the weakest dimension — the model sometimes drops or merges parallel calls
  • Argument extraction for complex nested objects and optional parameters can be imprecise
  • English only — trained exclusively on English data
  • Context length — quality may degrade with very long tool lists near the 32K limit
  • Not suitable for safety-critical, medical, legal, or financial applications
  • Fine-tuning contributions licensed under Apache 2.0; base model weights remain subject to the LFM Open License v1.0 ($10M annual revenue commercial use threshold)

Acknowledgements

  • Liquid AI for the LFM2.5-1.2B-Instruct base model
  • Unsloth for efficient LoRA training

Citation

@misc{Nexus_TinyFunction_1_2B_v2_0,
  title = {Nexus-TinyFunction-1.2B-v2.0: Function Calling Fine-Tune of LFM2.5-1.2B},
  author = {Nexus-Syntegra},
  year = {2026},
  url = {https://huggingface.co/nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0},
  note = {Fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct for function calling}
}
Downloads last month
63
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0

Adapter
(23)
this model
Adapters
1 model
Quantizations
1 model

Evaluation results