Nexus-TinyFunction-1.2B-v2.0
A fast, tiny function-calling model fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct. Built on LFM2.5's hybrid recurrent-attention architecture for significantly faster inference than transformer-only models of similar or even smaller size — fast enough to run on the CPU of a mobile phone.
No thinking trace or chain-of-thought required. As an instruct-tuned model, it produces accurate tool calls directly without verbose reasoning overhead, keeping latency low and token usage minimal.
Highlights
- Blazing fast inference — hybrid recurrent-attention architecture runs faster than similarly-sized (and even smaller) pure transformer models on both GPU and CPU
- No thinking trace needed — direct tool calls without chain-of-thought overhead, unlike reasoning-based models
- Runs anywhere — Q4_K_M quantization fits in ~700 MB, fast enough for Android phones, Raspberry Pi, edge servers
- Strong irrelevance detection (80.42%) — reliably refuses to call tools when no tool matches the query, avoiding hallucinated function calls
- 94.25% simple function calling — accurate single-tool selection and argument extraction
- JSON Syntax Reliability: 99.3% — near-perfect structured output
- Parallel & Multiple tool calling — handles complex multi-tool scenarios
Benchmark Results
BFCL V4 Benchmark: All Models (Q8_0 GGUF)
The following charts compare models we tested locally in Q8_0 GGUF quantization on the same hardware under identical conditions.
Average BFCL Score Ranking
Inference Speed Comparison
Head-to-Head: vs LFM2.5 Nova (Same Base Model)
Direct comparison with NovachronoAI/LFM2.5-1.2B-Nova-Function-Calling, the other LFM2.5-based function-calling fine-tune. Both models share the same base (LiquidAI/LFM2.5-1.2B-Instruct, BFCL V4 non-live avg: 24.8%).
All scores from BFCL V4, Q8_0 GGUF quantization via llama-server on a single NVIDIA RTX 5090.
JSON Syntax Reliability
| Model | JSON Validity | Invalid | Tool Calls* |
|---|---|---|---|
| Nexus-TinyFunction-1.2B-v2.0 | 99.3% | 18 | 2458 |
| xLAM-2 3B | 99.8% | 4 | 2485 |
| xLAM-2 1B | 99.6% | 10 | 2480 |
| Qwen3.5 4B | 99.1% | 21 | 2423 |
| Qwen3.5 2B | 99.0% | 24 | 2296 |
| Qwen3.5 0.8B | 98.9% | 25 | 2220 |
| LFM2.5 Nova 1.2B | 98.3% | 23 | 1334 |
| LFM2.5 Base 1.2B | 96.8% | 13 | 407 |
*Tool Calls = samples where the model attempted a tool call (out of 2,501 total per model). Models with fewer tool calls responded with plain text more often — the base model only attempted 407/2,501 calls.
BFCL V4 Official Leaderboard Comparison
How does a 1.2B model compare to frontier API models? We evaluated on 5 of 8 BFCL V4 non-live categories (Python Simple, Multiple, Parallel, Parallel Multiple, Irrelevance Detection). Java Simple, JavaScript Simple, and the combined Simple AST average are excluded — we did not train on or evaluate these categories.
Transparency: Official leaderboard scores use API inference at full precision. Our scores are from Q8_0 GGUF quantization via llama-server. The
#rank shown is each model's official rank across all 8 non-live categories.
Why LFM2.5?
We chose LiquidAI/LFM2.5-1.2B-Instruct as the base model for several reasons:
- Faster than transformers at any size — LFM2.5's hybrid recurrent-attention architecture achieves faster inference than pure transformer models of similar size, and even outpaces many smaller transformer models. The sub-quadratic scaling on sequence length is especially beneficial for function-calling workloads where tool definitions consume significant context.
- Built for edge and mobile — At 1.2B parameters, the model runs on consumer hardware, Android phones, Raspberry Pi, and edge servers. The Q4_K_M quantization fits in under 700 MB of RAM.
- Instruct-tuned, not reasoning-dependent — The base model is already instruct-tuned, so function calls are produced directly without chain-of-thought or thinking traces. This keeps latency low and avoids wasting tokens on reasoning overhead.
- Massive improvement over base — The base LFM2.5-1.2B-Instruct averages just 24.8% on BFCL V4 non-live categories. Our fine-tune brings that to 85.2% — a 60pp gain.
Model Details
| Developed by | Nexus-Syntegra |
| Base model | LiquidAI/LFM2.5-1.2B-Instruct |
| Architecture | Hybrid recurrent-attention (Lfm2ForCausalLM), 1.2B parameters |
| Context length | 32,768 tokens |
| License | Apache 2.0 (fine-tune); base model weights subject to LFM Open License v1.0 |
| Language | English |
| Fine-tune method | QLoRA SFT + 3-stage curriculum learning + DPO |
| Hardware | Single NVIDIA RTX 5090 (32 GB) — training, quantization, and all benchmarks |
| Format | ChatML with <tools> / <tool_call> XML tags |
Prompt Format
This model uses ChatML format with XML-tagged tool definitions and tool calls.
Important: Do not use
apply_chat_template(tools=...)— the base model's chat template formats tools differently than our fine-tune expects. Instead, include the tools directly in the system message as shown below.
System Prompt with Tools
<|im_start|>system
You are a function calling AI assistant. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
<tools>
[{"name": "get_weather", "description": "Get current weather for a location", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"]}}]
</tools><|im_end|>
Single Tool Call
<|im_start|>user
What's the weather in Tokyo?<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call><|im_end|>
Parallel Tool Calls
When multiple tools should be called, the model outputs them as a JSON array:
<|im_start|>user
What's the weather in Tokyo and London?<|im_end|>
<|im_start|>assistant
<tool_call>
[{"name": "get_weather", "arguments": {"city": "Tokyo"}}, {"name": "get_weather", "arguments": {"city": "London"}}]
</tool_call><|im_end|>
Irrelevance (No Tool Match)
When no tool matches the user's query, the model responds in plain text without any tool call tags:
<|im_start|>user
Tell me a joke<|im_end|>
<|im_start|>assistant
Sure! Why do programmers prefer dark mode? Because light attracts bugs!<|im_end|>
How to Use
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
model_id = "nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="bfloat16", device_map="auto", trust_remote_code=True
)
tools = [{"name": "get_weather", "description": "Get weather for a city",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}},
"required": ["city"]}}]
system_prompt = (
"You are a function calling AI assistant. You are provided with function "
"signatures within <tools></tools> XML tags. You may call one or more functions "
"to assist with the user query. Don't make assumptions about what values to "
"plug into functions.\n\n<tools>\n" + json.dumps(tools) + "\n</tools>"
)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "What's the weather in Tokyo and London?"},
]
input_ids = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
)
input_ids = input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=False))
With llama.cpp
GGUF quantizations are available at nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF.
# Download the Q4_K_M quantization
huggingface-cli download nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF \
Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf --local-dir .
# Run server with function calling support
./llama-server \
--model Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf \
--jinja \
--ctx-size 4096 \
--port 8080
With Ollama
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf
PARAMETER temperature 0
PARAMETER num_ctx 4096
EOF
ollama create nexus-tinyfunction-1.2b-v2.0 -f Modelfile
ollama run nexus-tinyfunction-1.2b-v2.0
Training Details
Method
QLoRA Supervised Fine-Tuning (SFT) with 3-stage curriculum learning, followed by Direct Preference Optimization (DPO). Trained using Unsloth + TRL SFTTrainer.
Training Data
~38,000 curated examples from public datasets and synthetic augmentation:
| Dataset | Examples | Purpose |
|---|---|---|
| Public function-calling datasets | ~16,500 | General function calling and irrelevance detection |
| Synthetic (BFCL-derived + augmented) | ~22,000 | Edge cases, curriculum labels |
| Total | ~38,500 |
Hyperparameters
| Parameter | Value |
|---|---|
| LoRA rank (r) | 128 |
| LoRA alpha | 128 |
| Target modules | q_proj, k_proj, v_proj, out_proj, in_proj, w1, w2, w3 |
| Effective batch size | 32 (2 x 16 gradient accumulation) |
| Learning rate (SFT) | 2e-5 (cosine with 10% warmup) |
| Curriculum stages | 3 (foundation / disambiguation / adversarial) |
| DPO beta | 0.1 |
| DPO learning rate | 1e-6 |
| Precision | bf16 |
| Packing | Enabled |
| Hardware | Single NVIDIA RTX 5090 (32 GB) — all training, quantization, and benchmarks |
GGUF Quantizations
Quantized versions for llama.cpp, Ollama, and LM Studio are available at nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF:
| Quantization | Size | Description |
|---|---|---|
| Q8_0 | ~1.2 GB | Near-lossless, highest quality |
| Q4_K_M | ~698 MB | Good quality/size balance, recommended for constrained hardware |
Limitations
- Parallel function calling is the weakest dimension — the model sometimes drops or merges parallel calls
- Argument extraction for complex nested objects and optional parameters can be imprecise
- English only — trained exclusively on English data
- Context length — quality may degrade with very long tool lists near the 32K limit
- Not suitable for safety-critical, medical, legal, or financial applications
- Fine-tuning contributions licensed under Apache 2.0; base model weights remain subject to the LFM Open License v1.0 ($10M annual revenue commercial use threshold)
Acknowledgements
Citation
@misc{Nexus_TinyFunction_1_2B_v2_0,
title = {Nexus-TinyFunction-1.2B-v2.0: Function Calling Fine-Tune of LFM2.5-1.2B},
author = {Nexus-Syntegra},
year = {2026},
url = {https://huggingface.co/nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0},
note = {Fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct for function calling}
}
- Downloads last month
- 63
Model tree for nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0
Base model
LiquidAI/LFM2.5-1.2B-BaseEvaluation results
- Simple Function Calling on BFCL v4self-reported94.250
- Multiple Function Calling on BFCL v4self-reported91.500
- Parallel Function Calling on BFCL v4self-reported81.500
- Parallel Multiple on BFCL v4self-reported78.500
- Irrelevance Detection on BFCL v4self-reported80.420








