Instructions to use gyung/LFM2-350M-ToolLLaMA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use gyung/LFM2-350M-ToolLLaMA with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="gyung/LFM2-350M-ToolLLaMA")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gyung/LFM2-350M-ToolLLaMA")
model = AutoModelForCausalLM.from_pretrained("gyung/LFM2-350M-ToolLLaMA")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use gyung/LFM2-350M-ToolLLaMA with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "gyung/LFM2-350M-ToolLLaMA"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gyung/LFM2-350M-ToolLLaMA",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/gyung/LFM2-350M-ToolLLaMA

SGLang

How to use gyung/LFM2-350M-ToolLLaMA with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "gyung/LFM2-350M-ToolLLaMA" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gyung/LFM2-350M-ToolLLaMA",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "gyung/LFM2-350M-ToolLLaMA" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gyung/LFM2-350M-ToolLLaMA",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use gyung/LFM2-350M-ToolLLaMA with Docker Model Runner:
```
docker model run hf.co/gyung/LFM2-350M-ToolLLaMA
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

🔧 LFM2-350M-ToolLLaMA

🇺🇸 English

🇰🇷 한국어

354M 파라미터 소형 모델로 Tool Calling을 수행하는 ReAct 에이전트

Liquid AI의 LFM2-350M을 ToolBench 데이터셋(187,494 예시)으로 Full SFT하여, ReAct(Thought/Action/Action Input) 형식의 Tool Calling을 수행하도록 학습한 모델입니다.

논문 "Small Language Models for Efficient Agentic Tool Calling" (arXiv:2512.15943v1)의 실험을 LFM2 아키텍처(LNN 기반)로 재현하였습니다.

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "gyung/LFM2-350M-ToolLLaMA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

system_prompt = """You are AutoGPT, you can use many tools(functions) to do the following task.
At each step, you need to give your thought to analyze the status now and what to do next, with a function call to actually execute your step.
Your output should follow this format:
Thought:
Action:
Action Input:"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What is the current weather in Seoul, South Korea?"},
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.1)

result = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print(result)

실제 출력 예시:

Thought: To find the current weather in Seoul, South Korea, I need to use the "weather" API.
I will call the "weather" function with the argument "location" set to "Seoul, South Korea"
to retrieve the weather information for that location.
Action: weather
Action Input: {"location": "Seoul, South Korea"}

모델 상세

항목	내용
Base Model	LiquidAI/LFM2-350M (LNN 기반)
파라미터	354M
학습 데이터	gyung/toolbench-lfm-chatml (187,494 학습 예시)
출력 형식	ReAct — `Thought: / Action: / Action Input:`
학습 방식	Full SFT (TRL `SFTTrainer`)
정밀도	BF16
학습 환경	NVIDIA H100 80GB × 1 (VESSL AI)

학습 설정

논문 arXiv:2512.15943v1의 하이퍼파라미터를 재현하였습니다.

SFTConfig(
    output_dir="./sft-output",

    # === 학습 파라미터 (논문 기반) ===
    num_train_epochs=1,
    per_device_train_batch_size=8,       # H100 80GB: 350M BF16이면 16 여유
    gradient_accumulation_steps=4,       # effective batch size = 8 × 4 = 32
    gradient_checkpointing=True,

    # === 최적화 (논문 설정) ===
    learning_rate=5e-5,                  # 논문: 5×10⁻⁵
    lr_scheduler_type="cosine",
    warmup_steps=100,                    # 논문: 100 warmup steps
    max_grad_norm=0.3,                   # 논문: aggressive clipping
    weight_decay=0.01,                   # 논문: AdamW 0.01
    optim="adamw_torch",

    # === 정밀도 (H100: BF16 네이티브) ===
    bf16=True,
    fp16=False,

    # === 시퀀스 설정 ===
    max_length=8192,                     # 논문: 8192 tokens

    # === 체크포인트 + Hub 업로드 ===
    # 총 ~5,800 steps → 500 step마다 저장 → ~12회 체크포인트
    logging_steps=50,
    save_strategy="steps",
    save_steps=500,                      # 자주 저장 (장애 대비 + 중간 성능 비교)
    save_total_limit=5,

    # === HuggingFace Hub 업로드 ===
    push_to_hub=True,
    hub_model_id="gyung/LFM2-350M-ToolLLaMA",
    hub_strategy="every_save",           # 매 체크포인트마다 HF 업로드

    # === 데이터셋 설정 ===
    dataset_text_field="messages",
    packing=False,
    report_to="none",
)

총 학습 Step: 5,891 (187,494 / 32)
학습 시간: 4시간 4분 14초 (H100 80GB × 1)
최종 Loss: 0.1668
체크포인트: 매 500 step마다 자동 HuggingFace 업로드

Training Loss Curve

Step	Loss	Step	Loss	Step	Loss
50	1.5385	500	0.3496	2000	0.1959
100	0.6595	1000	0.2441	3000	0.1717
200	0.4875	1500	0.2223	4000	0.1746
300	0.4343	1800	0.2081	5000	0.1743
				5850	0.1668

학습 초반 급격한 loss 감소 후, ~1000 step 이후 안정적으로 수렴. 최종 loss 0.1668.

학습 통계

항목	값
최대 VRAM	53.73 GB / 80 GB
평균 학습 속도	12.86 samples/s, 0.40 steps/s
평균 Train Loss	0.2236
Mean Token Accuracy	95.64%
총 학습 토큰	~500M tokens
Total FLOPs	1.42 × 10¹⁸

Framework Versions

TRL: 0.20.0
Transformers: 4.54.0
PyTorch: 2.9.1+cu128
Datasets: 4.6.0
Tokenizers: 0.21.4

최종 모델 테스트

1) 도구 목록 없이 자유 생성

Query	Thought	Action	Action Input	평가
"What is the current weather in Seoul?"	✅ 논리적	`weather`	`{"location": "Seoul, South Korea"}`	⭐⭐⭐⭐⭐
"Search for the latest news about AI"	✅ 논리적	`news_api`	적절한 파라미터	⭐⭐⭐⭐
"Find info about movie 'Inception'"	✅ 논리적	`search`	`{"query": "Inception"}`	⭐⭐⭐⭐

도구 이름이 generic한 것은 system prompt에 도구 목록을 명시하지 않았기 때문. 형식과 추론은 완벽.

2) 도구 목록 포함 테스트 — 4/4 (100%)

ToolBench 형식으로 사용 가능한 도구를 system prompt에 명시한 경우:

Query	Expected Action	Predicted Action	Match
Current weather in Seoul	`get_current_weather_for_weatherapi`	`get_current_weather_for_weatherapi`	✅
Latest news about AI	`search_news_for_newsapi`	`search_news_for_newsapi`	✅
Movie info for 'Inception'	`search_movie_for_imdb`	`search_movie_for_imdb`	✅
Ultimate Oscillator BTCUSDT	`ultimateoscillator_for_qvantana`	`ultimateoscillator_for_qvantana`	✅

📝 실제 응답 예시 (클릭하여 펼치기)

Query: "Get the Ultimate Oscillator data for BTCUSDT on Binance with daily interval."

Thought: I need to call the 'ultimateoscillator_for_qvantana' function to get the
Ultimate Oscillator data for BTCUSDT on Binance with a daily interval. The current
arguments for the function call are: exchange - "Binance", market - "BTCUSDT",
and interval - "1d". This will allow me to retrieve the necessary data for analysis.
Action: ultimateoscillator_for_qvantana
Action Input: {
  "exchange": "Binance",
  "market": "BTCUSDT",
  "interval": "1d"
}

💡 도구 목록이 2~3개로 명확한 시나리오에서는 100% 정확도. 복잡한 시나리오(도구 10개+)에서는 ToolBench eval 기준 ~55%.

평가 결과

평가 환경

Hardware: NVIDIA A100-SXM4-80GB
평가 데이터: gyung/toolbench-lfm-chatml (eval split, 762 samples)
생성 설정: max_new_tokens=512, temperature=0.1

평가 지표

지표	설명
Completion Rate	의미 있는 응답(10자 이상)을 생성한 비율
Format Accuracy	ReAct 형식(`Thought:` + `Action:`) 포함 비율
Action Match Rate	정답 Action 이름과 정확히 일치하는 비율

결과 비교

모델	Params	Completion	Format Acc.	Action Match
LFM2-350M (Base)	354M	100%	62.7%	0.0%
LFM2-350M-ToolLLaMA (1K step)	354M	100%	100.0%	55.1%
LFM2-350M-ToolLLaMA (2.5K step)	354M	100%	100.0%	54.1%
LFM2-350M-ToolLLaMA (5.9K step, 최종)	354M	100%	99.9%	56.3%
LFM2-1.2B-Tool (Liquid AI 공식)	1.2B	100%	97.9%	0.0%
GPT-5-Nano (OpenAI, 학습 없음)	—	100%*	78.6%*	8.5%*
ToolLLaMA-2-7b-v2 (논문 원본)	6.7B	100%	99.7%	75.3%

핵심 인사이트

SFT 효과: 학습 전 0.0% → 최종 56.3% Action Match Rate, Format Accuracy 99.9% 달성
최적 체크포인트: 1K step(55.1%) → 2.5K(54.1%) → 5.9K(56.3%) — 최종 모델이 가장 높은 성능
LFM2-1.2B-Tool 비교: 3.5배 큰 공식 모델(0.0%)을 압도 — 학습 데이터 형식 정렬의 중요성
ToolLLaMA-7B 대비: 20배 작은 모델로 75.3% 대비 56.3% 달성 (74.8% 수준)

논문 평가 방식과의 차이

논문에서 보고된 OPT-350M의 Pass Rate 77.55% 와 본 모델의 Action Match Rate는 직접 비교가 불가능합니다.

항목	논문 (ToolEval)	본 프로젝트
평가 방법	ChatGPT-based judge (4회 판정 + 다수결)	문자열 정확 일치
평가 기준	해결 경로의 적절성 (Pass/Fail)	Action 이름 정확 일치
테스트셋	1,100 queries (G1~G3, 6개 카테고리)	762 samples (eval split)
실행 방식	다단계 추론 (최대 10회 반복)	단일 턴 생성

💡 논문의 ToolEval은 모델이 실제로 여러 턴에 걸쳐 tool을 호출하며 문제를 해결하는 전체 과정을 ChatGPT가 평가하는 방식입니다. 본 프로젝트의 Action Match Rate는 첫 번째 tool 호출의 정확한 이름 일치만 측정하므로, 더 보수적인(엄격한) 지표입니다.

LLM-as-Judge 평가 (ToolEval 스타일)

논문의 ToolEval에 가까운 평가를 위해, eval_judge.py에서 GPT 기반 judge를 구현하였습니다:

# 단일 라운드 평가 (빠른 테스트)
python eval_judge.py --input results/eval_results.json --rounds 1

# 논문 재현 (4라운드 다수결 투표)
python eval_judge.py --input results/eval_results.json --rounds 4

이 judge는 Action 이름의 정확한 일치가 아닌, 도구 선택의 논리적 적절성을 기준으로 판정합니다.

데이터 형식

본 모델은 ChatML 형식의 대화 데이터로 학습되었습니다.

{
  "messages": [
    {"role": "system", "content": "You are AutoGPT, you can use many tools(functions) to do the following task..."},
    {"role": "user", "content": "사용자 쿼리..."},
    {"role": "assistant", "content": "Thought: ...\nAction: tool_name\nAction Input: {\"param\": \"value\"}"},
    {"role": "tool", "content": "{\"response\": ...}"}
  ]
}

원본 ToolBench의 "function" 역할은 LFM2 ChatML 호환을 위해 "tool"로 변환하였습니다.

실전 배포 전략

354M 모델의 강점(빠른 추론, 저비용, 엣지 배포)을 살리려면, 라우터 + 도구 선별 주입 파이프라인이 가장 효과적입니다.

사용자 쿼리 → [라우터: 인텐트 분류] → 해당 도구 2~3개만 프롬프트에 주입 → LFM2-350M → ~100% 정확도

비교	도구 전체 주입 (ToolBench식)	라우터 + 선별 주입
프롬프트 도구 수	10~50개	2~3개
Action Match	~56%	~100%
프롬프트 길이	수천 토큰	수백 토큰
추론 속도	느림	빠름

# Stage 1: 라우터 (인코더 모델 / 임베딩 유사도 기반)
TOOL_CATEGORIES = {
    "weather": ["get_current_weather_for_weatherapi", "get_forecast_for_weatherapi"],
    "finance": ["ultimateoscillator_for_qvantana", "typicalprice_for_qvantana"],
    "news":    ["search_news_for_newsapi", "get_top_headlines_for_newsapi"],
}

# Stage 2: 해당 도구만 프롬프트에 주입 → LFM2-350M → ~100%
category = classify_query(user_query)   # BERT, 코사인 유사도 등
tools = TOOL_CATEGORIES[category]       # 2~3개만 선택
prompt = build_prompt(tools, user_query)
response = model.generate(prompt)       # → 거의 100%

이것이 논문에서 말하는 **"Small Language Models for Efficient Agentic Tool Calling"**의 실전적 의미입니다: 작은 모델 + 스마트 라우팅 = 큰 모델보다 빠르고 저렴하고 정확한 시스템.

제한 사항

📊 평가 지표 한계 — Action Match Rate는 엄격한 문자열 매칭으로, 의미적으로 동일한 도구를 다른 이름으로 예측한 경우 실패로 처리
🧠 단일 턴 평가 — 멀티 턴 에이전트 시나리오(도구 결과를 받고 다시 추론)에 대한 평가 미실시
🏗️ 아키텍처 차이 — 논문은 OPT-350M(Transformer), 본 모델은 LFM2-350M(LNN 기반)으로 직접적인 아키텍처 비교는 불가

향후 개선 방향: 길이별 분할 학습

현재 학습은 모든 샘플을 max_length=8192로 통일하여 처리합니다. 하지만 ToolBench 데이터의 대부분은 1K~4K 토큰 범위이므로, 길이별로 버킷을 나눠 배치 사이즈를 다르게 설정하면 학습 속도를 크게 개선할 수 있습니다.

버킷	context length	batch size	효과
짧은 데이터	~1024	64~128	패딩 낭비 최소화
중간 데이터	~2048–4096	16~32	대부분의 데이터 해당
긴 데이터	~8192	8	현재와 동일

속도: 전체 학습 시간 50~~60% 단축 가능 (4시간 → ~1.5~~2시간)
성능: 패딩 감소로 gradient 추정이 안정적이 되어, 동일하거나 약간 나은 성능 기대
간편한 대안: TRL의 packing=True 옵션 — 짧은 시퀀스 여러 개를 하나에 이어붙여 패딩 낭비를 없앨 수 있음

학습 재현

사전 준비

HuggingFace 토큰 (Write 권한) — 발급 링크
GPU 환경 — NVIDIA H100/A100 80GB 권장

학습 실행

# VESSL AI에서 실행
vessl run create -f vessl_notebook.yaml
# → Jupyter에서 notebooks/01_train_trl.ipynb 업로드 후 순서대로 실행

평가 실행

# VESSL AI 같은 인스턴스에서
# → notebooks/03_evaluate.ipynb 업로드 후 순서대로 실행

참고 자료

📄 논문: Small Language Models for Efficient Agentic Tool Calling
🤗 Base Model: LiquidAI/LFM2-350M
📊 Dataset: gyung/toolbench-lfm-chatml
🔧 ToolBench: OpenBMB/ToolBench
📖 Liquid AI Docs: Tool Use Guide

Citation

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

@article{qin2023toolllm,
    title   = {ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs},
    author  = {Qin, Yujia and others},
    journal = {arXiv preprint arXiv:2307.16789},
    year    = {2023}
}

@article{liquidai2025lfm2,
 title={LFM2 Technical Report},
 author={Liquid AI},
 journal={arXiv preprint arXiv:2511.23404},
 year={2025}
}

🇺🇸 English

A 354M parameter ReAct agent for Tool Calling, fine-tuned on ToolBench

This model is a full SFT of Liquid AI's LFM2-350M on the ToolBench dataset (187,494 training examples), trained to perform Tool Calling in the ReAct (Thought/Action/Action Input) format.

It reproduces the experiment from "Small Language Models for Efficient Agentic Tool Calling" (arXiv:2512.15943v1) using the LFM2 architecture (LNN-based).

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "gyung/LFM2-350M-ToolLLaMA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

system_prompt = """You are AutoGPT, you can use many tools(functions) to do the following task.
At each step, you need to give your thought to analyze the status now and what to do next, with a function call to actually execute your step.
Your output should follow this format:
Thought:
Action:
Action Input:"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What is the current weather in Seoul, South Korea?"},
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.1)

result = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print(result)

Actual Output Example:

Thought: To find the current weather in Seoul, South Korea, I need to use the "weather" API.
I will call the "weather" function with the argument "location" set to "Seoul, South Korea"
to retrieve the weather information for that location.
Action: weather
Action Input: {"location": "Seoul, South Korea"}

Model Details

Item	Details
Base Model	LiquidAI/LFM2-350M (LNN-based)
Parameters	354M
Training Data	gyung/toolbench-lfm-chatml (187,494 examples)
Output Format	ReAct — `Thought: / Action: / Action Input:`
Training Method	Full SFT (TRL `SFTTrainer`)
Precision	BF16
Training Hardware	NVIDIA H100 80GB × 1 (VESSL AI)

Training Configuration

Hyperparameters reproduced from arXiv:2512.15943v1.

SFTConfig(
    output_dir="./sft-output",

    # === Training Parameters (Paper-based) ===
    num_train_epochs=1,
    per_device_train_batch_size=8,       # H100 80GB: 350M BF16 fits 16 easily
    gradient_accumulation_steps=4,       # effective batch size = 8 × 4 = 32
    gradient_checkpointing=True,

    # === Optimization (Paper Settings) ===
    learning_rate=5e-5,                  # Paper: 5×10⁻⁵
    lr_scheduler_type="cosine",
    warmup_steps=100,                    # Paper: 100 warmup steps
    max_grad_norm=0.3,                   # Paper: aggressive clipping
    weight_decay=0.01,                   # Paper: AdamW 0.01
    optim="adamw_torch",

    # === Precision (H100: native BF16) ===
    bf16=True,
    fp16=False,

    # === Sequence Settings ===
    max_length=8192,                     # Paper: 8192 tokens

    # === Checkpointing + Hub Upload ===
    # Total ~5,800 steps → save every 500 → ~12 checkpoints
    logging_steps=50,
    save_strategy="steps",
    save_steps=500,                      # Frequent saves (fault tolerance + perf comparison)
    save_total_limit=5,

    # === HuggingFace Hub Upload ===
    push_to_hub=True,
    hub_model_id="gyung/LFM2-350M-ToolLLaMA",
    hub_strategy="every_save",           # Upload to HF at every checkpoint

    # === Dataset Settings ===
    dataset_text_field="messages",
    packing=False,
    report_to="none",
)

Total Training Steps: 5,891 (187,494 / 32)
Training Time: 4h 4m 14s (H100 80GB × 1)
Final Loss: 0.1668
Checkpoints: Auto-uploaded to HuggingFace every 500 steps

Training Loss Curve

Step	Loss	Step	Loss	Step	Loss
50	1.5385	500	0.3496	2000	0.1959
100	0.6595	1000	0.2441	3000	0.1717
200	0.4875	1500	0.2223	4000	0.1746
300	0.4343	1800	0.2081	5000	0.1743
				5850	0.1668

Rapid loss drop in early training, stabilizing after ~1000 steps. Final loss: 0.1668.

Training Statistics

Item	Value
Peak VRAM	53.73 GB / 80 GB
Avg. Throughput	12.86 samples/s, 0.40 steps/s
Avg. Train Loss	0.2236
Mean Token Accuracy	95.64%
Total Tokens Trained	~500M tokens
Total FLOPs	1.42 × 10¹⁸

Framework Versions

TRL: 0.20.0
Transformers: 4.54.0
PyTorch: 2.9.1+cu128
Datasets: 4.6.0
Tokenizers: 0.21.4

Final Model Test

1) Free generation (no tool list)

Query	Thought	Action	Action Input	Rating
"What is the current weather in Seoul?"	✅ Logical	`weather`	`{"location": "Seoul, South Korea"}`	⭐⭐⭐⭐⭐
"Search for the latest news about AI"	✅ Logical	`news_api`	Appropriate params	⭐⭐⭐⭐
"Find info about movie 'Inception'"	✅ Logical	`search`	`{"query": "Inception"}`	⭐⭐⭐⭐

Generic tool names are expected since no tool list was given. Format and reasoning are perfect.

2) With tool list in prompt — 4/4 (100%)

When available tools are specified in the system prompt (ToolBench style):

Query	Expected Action	Predicted Action	Match
Current weather in Seoul	`get_current_weather_for_weatherapi`	`get_current_weather_for_weatherapi`	✅
Latest news about AI	`search_news_for_newsapi`	`search_news_for_newsapi`	✅
Movie info for 'Inception'	`search_movie_for_imdb`	`search_movie_for_imdb`	✅
Ultimate Oscillator BTCUSDT	`ultimateoscillator_for_qvantana`	`ultimateoscillator_for_qvantana`	✅

📝 Actual response example (click to expand)

Query: "Get the Ultimate Oscillator data for BTCUSDT on Binance with daily interval."

Thought: I need to call the 'ultimateoscillator_for_qvantana' function to get the
Ultimate Oscillator data for BTCUSDT on Binance with a daily interval. The current
arguments for the function call are: exchange - "Binance", market - "BTCUSDT",
and interval - "1d". This will allow me to retrieve the necessary data for analysis.
Action: ultimateoscillator_for_qvantana
Action Input: {
  "exchange": "Binance",
  "market": "BTCUSDT",
  "interval": "1d"
}

💡 100% accuracy with 2–3 clear tool choices. For complex scenarios (10+ tools), ToolBench eval shows ~55% Action Match Rate.

Evaluation Results

Evaluation Setup

Hardware: NVIDIA A100-SXM4-80GB
Eval Data: gyung/toolbench-lfm-chatml (eval split, 762 samples)
Generation Settings: max_new_tokens=512, temperature=0.1

Metrics

Metric	Description
Completion Rate	Ratio of non-empty responses (≥10 chars)
Format Accuracy	Ratio containing ReAct format (`Thought:` + `Action:`)
Action Match Rate	Exact string match with ground truth Action name

Results Comparison

Model	Params	Completion	Format Acc.	Action Match
LFM2-350M (Base)	354M	100%	62.7%	0.0%
LFM2-350M-ToolLLaMA (1K step)	354M	100%	100.0%	55.1%
LFM2-350M-ToolLLaMA (2.5K step)	354M	100%	100.0%	54.1%
LFM2-350M-ToolLLaMA (5.9K step, Final)	354M	100%	99.9%	56.3%
LFM2-1.2B-Tool (Liquid AI Official)	1.2B	100%	97.9%	0.0%
GPT-5-Nano (OpenAI, no ToolBench training)	—	100%*	78.6%*	8.5%*
ToolLLaMA-2-7b-v2 (Paper Original)	6.7B	100%	99.7%	75.3%

Key Insights

SFT Effect: Base 0.0% → Final 56.3% Action Match Rate, Format Accuracy 99.9%
Best checkpoint: 1K(55.1%) → 2.5K(54.1%) → 5.9K(56.3%) — final model achieves the highest score
vs LFM2-1.2B-Tool: Outperforms 3.5× larger official model (0.0%) — importance of training data format alignment
vs ToolLLaMA-7B: 56.3% vs 75.3% with a 20× smaller model (74.8% of 7B performance)

Difference from Paper's Evaluation

The paper's reported OPT-350M Pass Rate of 77.55% and this model's Action Match Rate are not directly comparable.

Aspect	Paper (ToolEval)	This Project
Method	ChatGPT-based judge (4 rounds + majority vote)	Exact string match
Criteria	Solution path adequacy (Pass/Fail)	Exact Action name match
Test Set	1,100 queries (G1~G3, 6 categories)	762 samples (eval split)
Execution	Multi-step reasoning (up to 10 iterations)	Single-turn generation

💡 The paper's ToolEval evaluates the entire multi-turn tool-calling process via ChatGPT judge. This project's Action Match Rate only measures exact first-turn tool name match, making it a more conservative (stricter) metric.

LLM-as-Judge Evaluation (ToolEval Style)

For evaluation closer to the paper's ToolEval, a GPT-based judge is implemented in eval_judge.py:

# Single round (quick test)
python eval_judge.py --input results/eval_results.json --rounds 1

# Paper reproduction (4-round majority voting)
python eval_judge.py --input results/eval_results.json --rounds 4

This judge evaluates logical appropriateness of tool selection rather than exact string matching.

Data Format

The model was trained on ChatML-formatted conversation data.

{
  "messages": [
    {"role": "system", "content": "You are AutoGPT, you can use many tools(functions) to do the following task..."},
    {"role": "user", "content": "User query..."},
    {"role": "assistant", "content": "Thought: ...\nAction: tool_name\nAction Input: {\"param\": \"value\"}"},
    {"role": "tool", "content": "{\"response\": ...}"}
  ]
}

The original ToolBench "function" role was converted to "tool" for LFM2 ChatML compatibility.

Practical Deployment Strategy

To maximize the 354M model's strengths (fast inference, low cost, edge deployment), a router + selective tool injection pipeline is most effective.

User Query → [Router: Intent Classification] → Inject 2-3 relevant tools → LFM2-350M → ~100% accuracy

Comparison	All tools injected (ToolBench style)	Router + Selective injection
Tools in prompt	10–50	2–3
Action Match	~56%	~100%
Prompt length	Thousands of tokens	Hundreds of tokens
Inference speed	Slow	Fast

# Stage 1: Router (encoder model / embedding similarity)
TOOL_CATEGORIES = {
    "weather": ["get_current_weather_for_weatherapi", "get_forecast_for_weatherapi"],
    "finance": ["ultimateoscillator_for_qvantana", "typicalprice_for_qvantana"],
    "news":    ["search_news_for_newsapi", "get_top_headlines_for_newsapi"],
}

# Stage 2: Inject only relevant tools → LFM2-350M → ~100%
category = classify_query(user_query)   # BERT, cosine similarity, etc.
tools = TOOL_CATEGORIES[category]       # Select only 2-3
prompt = build_prompt(tools, user_query)
response = model.generate(prompt)       # → Near 100%

This is the practical meaning of "Small Language Models for Efficient Agentic Tool Calling": small model + smart routing = faster, cheaper, and more accurate than a single large model.

Limitations

📊 Evaluation metric limitation — Action Match Rate uses strict string matching; semantically equivalent tools predicted under different names count as failures
🧠 Single-turn evaluation — Multi-turn agent scenarios (receiving tool results and reasoning again) not yet evaluated
🏗️ Architecture difference — Paper uses OPT-350M (Transformer), this model uses LFM2-350M (LNN-based); direct architecture comparison not possible

Potential Improvements: Length-Bucketed Training

The current training uses a uniform max_length=8192 for all samples. However, most ToolBench data falls in the 1K–4K token range, meaning significant compute is wasted on padding. Bucketing by sequence length with different batch sizes could substantially improve training speed.

Bucket	Context Length	Batch Size	Effect
Short data	~1024	64–128	Minimal padding waste
Medium data	~2048–4096	16–32	Covers majority of data
Long data	~8192	8	Same as current

Speed: Could reduce total training time by 50–60% (4h → ~1.5–2h)
Performance: Reduced padding leads to more stable gradient estimation — expect equal or slightly better performance
Simple alternative: TRL's packing=True — concatenates multiple short sequences into one, eliminating padding waste

Reproduction

Prerequisites

HuggingFace Token (Write permission) — Get token
GPU Environment — NVIDIA H100/A100 80GB recommended

Training

# Run on VESSL AI
vessl run create -f vessl_notebook.yaml
# → Upload notebooks/01_train_trl.ipynb in Jupyter and run cells sequentially

Evaluation

# On the same VESSL AI instance
# → Upload notebooks/03_evaluate.ipynb and run cells sequentially

References

📄 Paper: Small Language Models for Efficient Agentic Tool Calling
🤗 Base Model: LiquidAI/LFM2-350M
📊 Dataset: gyung/toolbench-lfm-chatml
🔧 ToolBench: OpenBMB/ToolBench
📖 Liquid AI Docs: Tool Use Guide

Citation

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

@article{qin2023toolllm,
    title   = {ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs},
    author  = {Qin, Yujia and others},
    journal = {arXiv preprint arXiv:2307.16789},
    year    = {2023}
}

@article{liquidai2025lfm2,
 title={LFM2 Technical Report},
 author={Liquid AI},
 journal={arXiv preprint arXiv:2511.23404},
 year={2025}
}

Downloads last month: 9

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for gyung/LFM2-350M-ToolLLaMA

Base model

LiquidAI/LFM2-350M

Finetuned

(61)

this model

Quantizations

1 model

Dataset used to train gyung/LFM2-350M-ToolLLaMA

Papers for gyung/LFM2-350M-ToolLLaMA

Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning

Paper • 2512.15943 • Published Dec 17, 2025 • 3

LFM2 Technical Report

Paper • 2511.23404 • Published Nov 28, 2025 • 65

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Paper • 2307.16789 • Published Jul 31, 2023 • 102