Nexus-TinyFunction-1.2B-v2.0

A fast, tiny function-calling model fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct. Built on LFM2.5's hybrid recurrent-attention architecture for significantly faster inference than transformer-only models of similar or even smaller size — fast enough to run on the CPU of a mobile phone.

No thinking trace or chain-of-thought required. As an instruct-tuned model, it produces accurate tool calls directly without verbose reasoning overhead, keeping latency low and token usage minimal.

Highlights

Blazing fast inference — hybrid recurrent-attention architecture runs faster than similarly-sized (and even smaller) pure transformer models on both GPU and CPU
No thinking trace needed — direct tool calls without chain-of-thought overhead, unlike reasoning-based models
Runs anywhere — Q4_K_M quantization fits in ~700 MB, fast enough for Android phones, Raspberry Pi, edge servers
Strong irrelevance detection (80.42%) — reliably refuses to call tools when no tool matches the query, avoiding hallucinated function calls
94.25% simple function calling — accurate single-tool selection and argument extraction
JSON Syntax Reliability: 99.3% — near-perfect structured output
Parallel & Multiple tool calling — handles complex multi-tool scenarios

Benchmark Results

BFCL V4 Benchmark: All Models (Q8_0 GGUF)

The following charts compare models we tested locally in Q8_0 GGUF quantization on the same hardware under identical conditions.

Average BFCL Score Ranking

Inference Speed Comparison

Head-to-Head: vs LFM2.5 Nova (Same Base Model)

Direct comparison with NovachronoAI/LFM2.5-1.2B-Nova-Function-Calling, the other LFM2.5-based function-calling fine-tune. Both models share the same base (LiquidAI/LFM2.5-1.2B-Instruct, BFCL V4 non-live avg: 24.8%).

All scores from BFCL V4, Q8_0 GGUF quantization via llama-server on a single NVIDIA RTX 5090.

JSON Syntax Reliability

Model	JSON Validity	Invalid	Tool Calls*
Nexus-TinyFunction-1.2B-v2.0	99.3%	18	2458
xLAM-2 3B	99.8%	4	2485
xLAM-2 1B	99.6%	10	2480
Qwen3.5 4B	99.1%	21	2423
Qwen3.5 2B	99.0%	24	2296
Qwen3.5 0.8B	98.9%	25	2220
LFM2.5 Nova 1.2B	98.3%	23	1334
LFM2.5 Base 1.2B	96.8%	13	407

*Tool Calls = samples where the model attempted a tool call (out of 2,501 total per model). Models with fewer tool calls responded with plain text more often — the base model only attempted 407/2,501 calls.

BFCL V4 Official Leaderboard Comparison

How does a 1.2B model compare to frontier API models? We evaluated on 5 of 8 BFCL V4 non-live categories (Python Simple, Multiple, Parallel, Parallel Multiple, Irrelevance Detection). Java Simple, JavaScript Simple, and the combined Simple AST average are excluded — we did not train on or evaluate these categories.

Transparency: Official leaderboard scores use API inference at full precision. Our scores are from Q8_0 GGUF quantization via llama-server. The # rank shown is each model's official rank across all 8 non-live categories.

Why LFM2.5?

We chose LiquidAI/LFM2.5-1.2B-Instruct as the base model for several reasons:

Faster than transformers at any size — LFM2.5's hybrid recurrent-attention architecture achieves faster inference than pure transformer models of similar size, and even outpaces many smaller transformer models. The sub-quadratic scaling on sequence length is especially beneficial for function-calling workloads where tool definitions consume significant context.
Built for edge and mobile — At 1.2B parameters, the model runs on consumer hardware, Android phones, Raspberry Pi, and edge servers. The Q4_K_M quantization fits in under 700 MB of RAM.
Instruct-tuned, not reasoning-dependent — The base model is already instruct-tuned, so function calls are produced directly without chain-of-thought or thinking traces. This keeps latency low and avoids wasting tokens on reasoning overhead.
Massive improvement over base — The base LFM2.5-1.2B-Instruct averages just 24.8% on BFCL V4 non-live categories. Our fine-tune brings that to 85.2% — a 60pp gain.

Model Details


Developed by	Nexus-Syntegra
Base model	LiquidAI/LFM2.5-1.2B-Instruct
Architecture	Hybrid recurrent-attention (Lfm2ForCausalLM), 1.2B parameters
Context length	32,768 tokens
License	Apache 2.0 (fine-tune); base model weights subject to LFM Open License v1.0
Language	English
Fine-tune method	QLoRA SFT + 3-stage curriculum learning + DPO
Hardware	Single NVIDIA RTX 5090 (32 GB) — training, quantization, and all benchmarks
Format	ChatML with `<tools>` / `<tool_call>` XML tags

Prompt Format

This model uses ChatML format with XML-tagged tool definitions and tool calls.

Important: Do not use apply_chat_template(tools=...) — the base model's chat template formats tools differently than our fine-tune expects. Instead, include the tools directly in the system message as shown below.

System Prompt with Tools

<|im_start|>system
You are a function calling AI assistant. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.

<tools>
[{"name": "get_weather", "description": "Get current weather for a location", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"]}}]
</tools><|im_end|>

Single Tool Call

<|im_start|>user
What's the weather in Tokyo?<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call><|im_end|>

Parallel Tool Calls

When multiple tools should be called, the model outputs them as a JSON array:

<|im_start|>user
What's the weather in Tokyo and London?<|im_end|>
<|im_start|>assistant
<tool_call>
[{"name": "get_weather", "arguments": {"city": "Tokyo"}}, {"name": "get_weather", "arguments": {"city": "London"}}]
</tool_call><|im_end|>

Irrelevance (No Tool Match)

When no tool matches the user's query, the model responds in plain text without any tool call tags:

<|im_start|>user
Tell me a joke<|im_end|>
<|im_start|>assistant
Sure! Why do programmers prefer dark mode? Because light attracts bugs!<|im_end|>

How to Use

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import json

model_id = "nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="bfloat16", device_map="auto", trust_remote_code=True
)

tools = [{"name": "get_weather", "description": "Get weather for a city",
          "parameters": {"type": "object", "properties": {"city": {"type": "string"}},
                         "required": ["city"]}}]

system_prompt = (
    "You are a function calling AI assistant. You are provided with function "
    "signatures within <tools></tools> XML tags. You may call one or more functions "
    "to assist with the user query. Don't make assumptions about what values to "
    "plug into functions.\n\n<tools>\n" + json.dumps(tools) + "\n</tools>"
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What's the weather in Tokyo and London?"},
]

input_ids = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
)
input_ids = input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=False))

With llama.cpp

GGUF quantizations are available at nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF.

# Download the Q4_K_M quantization
huggingface-cli download nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF \
  Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf --local-dir .

# Run server with function calling support
./llama-server \
  --model Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf \
  --jinja \
  --ctx-size 4096 \
  --port 8080

With Ollama

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf
PARAMETER temperature 0
PARAMETER num_ctx 4096
EOF

ollama create nexus-tinyfunction-1.2b-v2.0 -f Modelfile
ollama run nexus-tinyfunction-1.2b-v2.0

Training Details

Method

QLoRA Supervised Fine-Tuning (SFT) with 3-stage curriculum learning, followed by Direct Preference Optimization (DPO). Trained using Unsloth + TRL SFTTrainer.

Training Data

~38,000 curated examples from public datasets and synthetic augmentation:

Dataset	Examples	Purpose
Public function-calling datasets	~16,500	General function calling and irrelevance detection
Synthetic (BFCL-derived + augmented)	~22,000	Edge cases, curriculum labels
Total	~38,500

Hyperparameters

Parameter	Value
LoRA rank (r)	128
LoRA alpha	128
Target modules	q_proj, k_proj, v_proj, out_proj, in_proj, w1, w2, w3
Effective batch size	32 (2 x 16 gradient accumulation)
Learning rate (SFT)	2e-5 (cosine with 10% warmup)
Curriculum stages	3 (foundation / disambiguation / adversarial)
DPO beta	0.1
DPO learning rate	1e-6
Precision	bf16
Packing	Enabled
Hardware	Single NVIDIA RTX 5090 (32 GB) — all training, quantization, and benchmarks

GGUF Quantizations

Quantized versions for llama.cpp, Ollama, and LM Studio are available at nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF:

Quantization	Size	Description
Q8_0	~1.2 GB	Near-lossless, highest quality
Q4_K_M	~698 MB	Good quality/size balance, recommended for constrained hardware

Limitations

Parallel function calling is the weakest dimension — the model sometimes drops or merges parallel calls
Argument extraction for complex nested objects and optional parameters can be imprecise
English only — trained exclusively on English data
Context length — quality may degrade with very long tool lists near the 32K limit
Not suitable for safety-critical, medical, legal, or financial applications
Fine-tuning contributions licensed under Apache 2.0; base model weights remain subject to the LFM Open License v1.0 ($10M annual revenue commercial use threshold)

Acknowledgements

Liquid AI for the LFM2.5-1.2B-Instruct base model
Unsloth for efficient LoRA training

Citation

@misc{Nexus_TinyFunction_1_2B_v2_0,
  title = {Nexus-TinyFunction-1.2B-v2.0: Function Calling Fine-Tune of LFM2.5-1.2B},
  author = {Nexus-Syntegra},
  year = {2026},
  url = {https://huggingface.co/nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0},
  note = {Fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct for function calling}
}