🔧 LFM2-350M-ToolLLaMA

🇺🇸 English

🇰🇷 한국어

354M 파라미터 소형 모델로 Tool Calling을 수행하는 ReAct 에이전트

Base Model Dataset Paper

Liquid AI의 LFM2-350MToolBench 데이터셋(187,494 예시)으로 Full SFT하여, ReAct(Thought/Action/Action Input) 형식의 Tool Calling을 수행하도록 학습한 모델입니다.

논문 "Small Language Models for Efficient Agentic Tool Calling" (arXiv:2512.15943v1)의 실험을 LFM2 아키텍처(LNN 기반)로 재현하였습니다.


Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "gyung/LFM2-350M-ToolLLaMA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

system_prompt = """You are AutoGPT, you can use many tools(functions) to do the following task.
At each step, you need to give your thought to analyze the status now and what to do next, with a function call to actually execute your step.
Your output should follow this format:
Thought:
Action:
Action Input:"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What is the current weather in Seoul, South Korea?"},
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.1)

result = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print(result)

실제 출력 예시:

Thought: To find the current weather in Seoul, South Korea, I need to use the "weather" API.
I will call the "weather" function with the argument "location" set to "Seoul, South Korea"
to retrieve the weather information for that location.
Action: weather
Action Input: {"location": "Seoul, South Korea"}

모델 상세

항목 내용
Base Model LiquidAI/LFM2-350M (LNN 기반)
파라미터 354M
학습 데이터 gyung/toolbench-lfm-chatml (187,494 학습 예시)
출력 형식 ReAct — Thought: / Action: / Action Input:
학습 방식 Full SFT (TRL SFTTrainer)
정밀도 BF16
학습 환경 NVIDIA H100 80GB × 1 (VESSL AI)

학습 설정

논문 arXiv:2512.15943v1의 하이퍼파라미터를 재현하였습니다.

SFTConfig(
    output_dir="./sft-output",

    # === 학습 파라미터 (논문 기반) ===
    num_train_epochs=1,
    per_device_train_batch_size=8,       # H100 80GB: 350M BF16이면 16 여유
    gradient_accumulation_steps=4,       # effective batch size = 8 × 4 = 32
    gradient_checkpointing=True,

    # === 최적화 (논문 설정) ===
    learning_rate=5e-5,                  # 논문: 5×10⁻⁵
    lr_scheduler_type="cosine",
    warmup_steps=100,                    # 논문: 100 warmup steps
    max_grad_norm=0.3,                   # 논문: aggressive clipping
    weight_decay=0.01,                   # 논문: AdamW 0.01
    optim="adamw_torch",

    # === 정밀도 (H100: BF16 네이티브) ===
    bf16=True,
    fp16=False,

    # === 시퀀스 설정 ===
    max_length=8192,                     # 논문: 8192 tokens

    # === 체크포인트 + Hub 업로드 ===
    # 총 ~5,800 steps → 500 step마다 저장 → ~12회 체크포인트
    logging_steps=50,
    save_strategy="steps",
    save_steps=500,                      # 자주 저장 (장애 대비 + 중간 성능 비교)
    save_total_limit=5,

    # === HuggingFace Hub 업로드 ===
    push_to_hub=True,
    hub_model_id="gyung/LFM2-350M-ToolLLaMA",
    hub_strategy="every_save",           # 매 체크포인트마다 HF 업로드

    # === 데이터셋 설정 ===
    dataset_text_field="messages",
    packing=False,
    report_to="none",
)
  • 총 학습 Step: 5,891 (187,494 / 32)
  • 학습 시간: 4시간 4분 14초 (H100 80GB × 1)
  • 최종 Loss: 0.1668
  • 체크포인트: 매 500 step마다 자동 HuggingFace 업로드

Training Loss Curve

Step Loss Step Loss Step Loss
50 1.5385 500 0.3496 2000 0.1959
100 0.6595 1000 0.2441 3000 0.1717
200 0.4875 1500 0.2223 4000 0.1746
300 0.4343 1800 0.2081 5000 0.1743
5850 0.1668

학습 초반 급격한 loss 감소 후, ~1000 step 이후 안정적으로 수렴. 최종 loss 0.1668.

학습 통계

항목
최대 VRAM 53.73 GB / 80 GB
평균 학습 속도 12.86 samples/s, 0.40 steps/s
평균 Train Loss 0.2236
Mean Token Accuracy 95.64%
총 학습 토큰 ~500M tokens
Total FLOPs 1.42 × 10¹⁸

Framework Versions

  • TRL: 0.20.0
  • Transformers: 4.54.0
  • PyTorch: 2.9.1+cu128
  • Datasets: 4.6.0
  • Tokenizers: 0.21.4

최종 모델 테스트

1) 도구 목록 없이 자유 생성

Query Thought Action Action Input 평가
"What is the current weather in Seoul?" ✅ 논리적 weather {"location": "Seoul, South Korea"} ⭐⭐⭐⭐⭐
"Search for the latest news about AI" ✅ 논리적 news_api 적절한 파라미터 ⭐⭐⭐⭐
"Find info about movie 'Inception'" ✅ 논리적 search {"query": "Inception"} ⭐⭐⭐⭐

도구 이름이 generic한 것은 system prompt에 도구 목록을 명시하지 않았기 때문. 형식과 추론은 완벽.

2) 도구 목록 포함 테스트 — 4/4 (100%)

ToolBench 형식으로 사용 가능한 도구를 system prompt에 명시한 경우:

Query Expected Action Predicted Action Match
Current weather in Seoul get_current_weather_for_weatherapi get_current_weather_for_weatherapi
Latest news about AI search_news_for_newsapi search_news_for_newsapi
Movie info for 'Inception' search_movie_for_imdb search_movie_for_imdb
Ultimate Oscillator BTCUSDT ultimateoscillator_for_qvantana ultimateoscillator_for_qvantana
📝 실제 응답 예시 (클릭하여 펼치기)

Query: "Get the Ultimate Oscillator data for BTCUSDT on Binance with daily interval."

Thought: I need to call the 'ultimateoscillator_for_qvantana' function to get the
Ultimate Oscillator data for BTCUSDT on Binance with a daily interval. The current
arguments for the function call are: exchange - "Binance", market - "BTCUSDT",
and interval - "1d". This will allow me to retrieve the necessary data for analysis.
Action: ultimateoscillator_for_qvantana
Action Input: {
  "exchange": "Binance",
  "market": "BTCUSDT",
  "interval": "1d"
}

💡 도구 목록이 2~3개로 명확한 시나리오에서는 100% 정확도. 복잡한 시나리오(도구 10개+)에서는 ToolBench eval 기준 ~55%.


평가 결과

평가 환경

  • Hardware: NVIDIA A100-SXM4-80GB
  • 평가 데이터: gyung/toolbench-lfm-chatml (eval split, 762 samples)
  • 생성 설정: max_new_tokens=512, temperature=0.1

평가 지표

지표 설명
Completion Rate 의미 있는 응답(10자 이상)을 생성한 비율
Format Accuracy ReAct 형식(Thought: + Action:) 포함 비율
Action Match Rate 정답 Action 이름과 정확히 일치하는 비율

결과 비교

모델 Params Completion Format Acc. Action Match
LFM2-350M (Base) 354M 100% 62.7% 0.0%
LFM2-350M-ToolLLaMA (1K step) 354M 100% 100.0% 55.1%
LFM2-350M-ToolLLaMA (2.5K step) 354M 100% 100.0% 54.1%
LFM2-350M-ToolLLaMA (5.9K step, 최종) 354M 100% 99.9% 56.3%
LFM2-1.2B-Tool (Liquid AI 공식) 1.2B 100% 97.9% 0.0%
GPT-5-Nano (OpenAI, 학습 없음) 100%* 78.6%* 8.5%*
ToolLLaMA-2-7b-v2 (논문 원본) 6.7B 100% 99.7% 75.3%

핵심 인사이트

  1. SFT 효과: 학습 전 0.0% → 최종 56.3% Action Match Rate, Format Accuracy 99.9% 달성
  2. 최적 체크포인트: 1K step(55.1%) → 2.5K(54.1%) → 5.9K(56.3%) — 최종 모델이 가장 높은 성능
  3. LFM2-1.2B-Tool 비교: 3.5배 큰 공식 모델(0.0%)을 압도 — 학습 데이터 형식 정렬의 중요성
  4. ToolLLaMA-7B 대비: 20배 작은 모델로 75.3% 대비 56.3% 달성 (74.8% 수준)

논문 평가 방식과의 차이

논문에서 보고된 OPT-350M의 Pass Rate 77.55% 와 본 모델의 Action Match Rate직접 비교가 불가능합니다.

항목 논문 (ToolEval) 본 프로젝트
평가 방법 ChatGPT-based judge (4회 판정 + 다수결) 문자열 정확 일치
평가 기준 해결 경로의 적절성 (Pass/Fail) Action 이름 정확 일치
테스트셋 1,100 queries (G1~G3, 6개 카테고리) 762 samples (eval split)
실행 방식 다단계 추론 (최대 10회 반복) 단일 턴 생성

💡 논문의 ToolEval은 모델이 실제로 여러 턴에 걸쳐 tool을 호출하며 문제를 해결하는 전체 과정을 ChatGPT가 평가하는 방식입니다. 본 프로젝트의 Action Match Rate는 첫 번째 tool 호출의 정확한 이름 일치만 측정하므로, 더 보수적인(엄격한) 지표입니다.

LLM-as-Judge 평가 (ToolEval 스타일)

논문의 ToolEval에 가까운 평가를 위해, eval_judge.py에서 GPT 기반 judge를 구현하였습니다:

# 단일 라운드 평가 (빠른 테스트)
python eval_judge.py --input results/eval_results.json --rounds 1

# 논문 재현 (4라운드 다수결 투표)
python eval_judge.py --input results/eval_results.json --rounds 4

이 judge는 Action 이름의 정확한 일치가 아닌, 도구 선택의 논리적 적절성을 기준으로 판정합니다.


데이터 형식

본 모델은 ChatML 형식의 대화 데이터로 학습되었습니다.

{
  "messages": [
    {"role": "system", "content": "You are AutoGPT, you can use many tools(functions) to do the following task..."},
    {"role": "user", "content": "사용자 쿼리..."},
    {"role": "assistant", "content": "Thought: ...\nAction: tool_name\nAction Input: {\"param\": \"value\"}"},
    {"role": "tool", "content": "{\"response\": ...}"}
  ]
}

원본 ToolBench의 "function" 역할은 LFM2 ChatML 호환을 위해 "tool"로 변환하였습니다.


실전 배포 전략

354M 모델의 강점(빠른 추론, 저비용, 엣지 배포)을 살리려면, 라우터 + 도구 선별 주입 파이프라인이 가장 효과적입니다.

사용자 쿼리 → [라우터: 인텐트 분류] → 해당 도구 2~3개만 프롬프트에 주입 → LFM2-350M → ~100% 정확도
비교 도구 전체 주입 (ToolBench식) 라우터 + 선별 주입
프롬프트 도구 수 10~50개 2~3개
Action Match ~56% ~100%
프롬프트 길이 수천 토큰 수백 토큰
추론 속도 느림 빠름
# Stage 1: 라우터 (인코더 모델 / 임베딩 유사도 기반)
TOOL_CATEGORIES = {
    "weather": ["get_current_weather_for_weatherapi", "get_forecast_for_weatherapi"],
    "finance": ["ultimateoscillator_for_qvantana", "typicalprice_for_qvantana"],
    "news":    ["search_news_for_newsapi", "get_top_headlines_for_newsapi"],
}

# Stage 2: 해당 도구만 프롬프트에 주입 → LFM2-350M → ~100%
category = classify_query(user_query)   # BERT, 코사인 유사도 등
tools = TOOL_CATEGORIES[category]       # 2~3개만 선택
prompt = build_prompt(tools, user_query)
response = model.generate(prompt)       # → 거의 100%

이것이 논문에서 말하는 **"Small Language Models for Efficient Agentic Tool Calling"**의 실전적 의미입니다: 작은 모델 + 스마트 라우팅 = 큰 모델보다 빠르고 저렴하고 정확한 시스템.


제한 사항

  • 📊 평가 지표 한계 — Action Match Rate는 엄격한 문자열 매칭으로, 의미적으로 동일한 도구를 다른 이름으로 예측한 경우 실패로 처리
  • 🧠 단일 턴 평가 — 멀티 턴 에이전트 시나리오(도구 결과를 받고 다시 추론)에 대한 평가 미실시
  • 🏗️ 아키텍처 차이 — 논문은 OPT-350M(Transformer), 본 모델은 LFM2-350M(LNN 기반)으로 직접적인 아키텍처 비교는 불가

향후 개선 방향: 길이별 분할 학습

현재 학습은 모든 샘플을 max_length=8192로 통일하여 처리합니다. 하지만 ToolBench 데이터의 대부분은 1K~4K 토큰 범위이므로, 길이별로 버킷을 나눠 배치 사이즈를 다르게 설정하면 학습 속도를 크게 개선할 수 있습니다.

버킷 context length batch size 효과
짧은 데이터 ~1024 64~128 패딩 낭비 최소화
중간 데이터 ~2048–4096 16~32 대부분의 데이터 해당
긴 데이터 ~8192 8 현재와 동일
  • 속도: 전체 학습 시간 5060% 단축 가능 (4시간 → ~1.52시간)
  • 성능: 패딩 감소로 gradient 추정이 안정적이 되어, 동일하거나 약간 나은 성능 기대
  • 간편한 대안: TRL의 packing=True 옵션 — 짧은 시퀀스 여러 개를 하나에 이어붙여 패딩 낭비를 없앨 수 있음

학습 재현

사전 준비

  1. HuggingFace 토큰 (Write 권한) — 발급 링크
  2. GPU 환경 — NVIDIA H100/A100 80GB 권장

학습 실행

# VESSL AI에서 실행
vessl run create -f vessl_notebook.yaml
# → Jupyter에서 notebooks/01_train_trl.ipynb 업로드 후 순서대로 실행

평가 실행

# VESSL AI 같은 인스턴스에서
# → notebooks/03_evaluate.ipynb 업로드 후 순서대로 실행

참고 자료


Citation

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
@article{qin2023toolllm,
    title   = {ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs},
    author  = {Qin, Yujia and others},
    journal = {arXiv preprint arXiv:2307.16789},
    year    = {2023}
}
@article{liquidai2025lfm2,
 title={LFM2 Technical Report},
 author={Liquid AI},
 journal={arXiv preprint arXiv:2511.23404},
 year={2025}
}


🇺🇸 English

A 354M parameter ReAct agent for Tool Calling, fine-tuned on ToolBench

Base Model Dataset Paper

This model is a full SFT of Liquid AI's LFM2-350M on the ToolBench dataset (187,494 training examples), trained to perform Tool Calling in the ReAct (Thought/Action/Action Input) format.

It reproduces the experiment from "Small Language Models for Efficient Agentic Tool Calling" (arXiv:2512.15943v1) using the LFM2 architecture (LNN-based).


Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "gyung/LFM2-350M-ToolLLaMA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

system_prompt = """You are AutoGPT, you can use many tools(functions) to do the following task.
At each step, you need to give your thought to analyze the status now and what to do next, with a function call to actually execute your step.
Your output should follow this format:
Thought:
Action:
Action Input:"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What is the current weather in Seoul, South Korea?"},
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.1)

result = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print(result)

Actual Output Example:

Thought: To find the current weather in Seoul, South Korea, I need to use the "weather" API.
I will call the "weather" function with the argument "location" set to "Seoul, South Korea"
to retrieve the weather information for that location.
Action: weather
Action Input: {"location": "Seoul, South Korea"}

Model Details

Item Details
Base Model LiquidAI/LFM2-350M (LNN-based)
Parameters 354M
Training Data gyung/toolbench-lfm-chatml (187,494 examples)
Output Format ReAct — Thought: / Action: / Action Input:
Training Method Full SFT (TRL SFTTrainer)
Precision BF16
Training Hardware NVIDIA H100 80GB × 1 (VESSL AI)

Training Configuration

Hyperparameters reproduced from arXiv:2512.15943v1.

SFTConfig(
    output_dir="./sft-output",

    # === Training Parameters (Paper-based) ===
    num_train_epochs=1,
    per_device_train_batch_size=8,       # H100 80GB: 350M BF16 fits 16 easily
    gradient_accumulation_steps=4,       # effective batch size = 8 × 4 = 32
    gradient_checkpointing=True,

    # === Optimization (Paper Settings) ===
    learning_rate=5e-5,                  # Paper: 5×10⁻⁵
    lr_scheduler_type="cosine",
    warmup_steps=100,                    # Paper: 100 warmup steps
    max_grad_norm=0.3,                   # Paper: aggressive clipping
    weight_decay=0.01,                   # Paper: AdamW 0.01
    optim="adamw_torch",

    # === Precision (H100: native BF16) ===
    bf16=True,
    fp16=False,

    # === Sequence Settings ===
    max_length=8192,                     # Paper: 8192 tokens

    # === Checkpointing + Hub Upload ===
    # Total ~5,800 steps → save every 500 → ~12 checkpoints
    logging_steps=50,
    save_strategy="steps",
    save_steps=500,                      # Frequent saves (fault tolerance + perf comparison)
    save_total_limit=5,

    # === HuggingFace Hub Upload ===
    push_to_hub=True,
    hub_model_id="gyung/LFM2-350M-ToolLLaMA",
    hub_strategy="every_save",           # Upload to HF at every checkpoint

    # === Dataset Settings ===
    dataset_text_field="messages",
    packing=False,
    report_to="none",
)
  • Total Training Steps: 5,891 (187,494 / 32)
  • Training Time: 4h 4m 14s (H100 80GB × 1)
  • Final Loss: 0.1668
  • Checkpoints: Auto-uploaded to HuggingFace every 500 steps

Training Loss Curve

Step Loss Step Loss Step Loss
50 1.5385 500 0.3496 2000 0.1959
100 0.6595 1000 0.2441 3000 0.1717
200 0.4875 1500 0.2223 4000 0.1746
300 0.4343 1800 0.2081 5000 0.1743
5850 0.1668

Rapid loss drop in early training, stabilizing after ~1000 steps. Final loss: 0.1668.

Training Statistics

Item Value
Peak VRAM 53.73 GB / 80 GB
Avg. Throughput 12.86 samples/s, 0.40 steps/s
Avg. Train Loss 0.2236
Mean Token Accuracy 95.64%
Total Tokens Trained ~500M tokens
Total FLOPs 1.42 × 10¹⁸

Framework Versions

  • TRL: 0.20.0
  • Transformers: 4.54.0
  • PyTorch: 2.9.1+cu128
  • Datasets: 4.6.0
  • Tokenizers: 0.21.4

Final Model Test

1) Free generation (no tool list)

Query Thought Action Action Input Rating
"What is the current weather in Seoul?" ✅ Logical weather {"location": "Seoul, South Korea"} ⭐⭐⭐⭐⭐
"Search for the latest news about AI" ✅ Logical news_api Appropriate params ⭐⭐⭐⭐
"Find info about movie 'Inception'" ✅ Logical search {"query": "Inception"} ⭐⭐⭐⭐

Generic tool names are expected since no tool list was given. Format and reasoning are perfect.

2) With tool list in prompt — 4/4 (100%)

When available tools are specified in the system prompt (ToolBench style):

Query Expected Action Predicted Action Match
Current weather in Seoul get_current_weather_for_weatherapi get_current_weather_for_weatherapi
Latest news about AI search_news_for_newsapi search_news_for_newsapi
Movie info for 'Inception' search_movie_for_imdb search_movie_for_imdb
Ultimate Oscillator BTCUSDT ultimateoscillator_for_qvantana ultimateoscillator_for_qvantana
📝 Actual response example (click to expand)

Query: "Get the Ultimate Oscillator data for BTCUSDT on Binance with daily interval."

Thought: I need to call the 'ultimateoscillator_for_qvantana' function to get the
Ultimate Oscillator data for BTCUSDT on Binance with a daily interval. The current
arguments for the function call are: exchange - "Binance", market - "BTCUSDT",
and interval - "1d". This will allow me to retrieve the necessary data for analysis.
Action: ultimateoscillator_for_qvantana
Action Input: {
  "exchange": "Binance",
  "market": "BTCUSDT",
  "interval": "1d"
}

💡 100% accuracy with 2–3 clear tool choices. For complex scenarios (10+ tools), ToolBench eval shows ~55% Action Match Rate.


Evaluation Results

Evaluation Setup

  • Hardware: NVIDIA A100-SXM4-80GB
  • Eval Data: gyung/toolbench-lfm-chatml (eval split, 762 samples)
  • Generation Settings: max_new_tokens=512, temperature=0.1

Metrics

Metric Description
Completion Rate Ratio of non-empty responses (≥10 chars)
Format Accuracy Ratio containing ReAct format (Thought: + Action:)
Action Match Rate Exact string match with ground truth Action name

Results Comparison

Model Params Completion Format Acc. Action Match
LFM2-350M (Base) 354M 100% 62.7% 0.0%
LFM2-350M-ToolLLaMA (1K step) 354M 100% 100.0% 55.1%
LFM2-350M-ToolLLaMA (2.5K step) 354M 100% 100.0% 54.1%
LFM2-350M-ToolLLaMA (5.9K step, Final) 354M 100% 99.9% 56.3%
LFM2-1.2B-Tool (Liquid AI Official) 1.2B 100% 97.9% 0.0%
GPT-5-Nano (OpenAI, no ToolBench training) 100%* 78.6%* 8.5%*
ToolLLaMA-2-7b-v2 (Paper Original) 6.7B 100% 99.7% 75.3%

Key Insights

  1. SFT Effect: Base 0.0% → Final 56.3% Action Match Rate, Format Accuracy 99.9%
  2. Best checkpoint: 1K(55.1%) → 2.5K(54.1%) → 5.9K(56.3%) — final model achieves the highest score
  3. vs LFM2-1.2B-Tool: Outperforms 3.5× larger official model (0.0%) — importance of training data format alignment
  4. vs ToolLLaMA-7B: 56.3% vs 75.3% with a 20× smaller model (74.8% of 7B performance)

Difference from Paper's Evaluation

The paper's reported OPT-350M Pass Rate of 77.55% and this model's Action Match Rate are not directly comparable.

Aspect Paper (ToolEval) This Project
Method ChatGPT-based judge (4 rounds + majority vote) Exact string match
Criteria Solution path adequacy (Pass/Fail) Exact Action name match
Test Set 1,100 queries (G1~G3, 6 categories) 762 samples (eval split)
Execution Multi-step reasoning (up to 10 iterations) Single-turn generation

💡 The paper's ToolEval evaluates the entire multi-turn tool-calling process via ChatGPT judge. This project's Action Match Rate only measures exact first-turn tool name match, making it a more conservative (stricter) metric.

LLM-as-Judge Evaluation (ToolEval Style)

For evaluation closer to the paper's ToolEval, a GPT-based judge is implemented in eval_judge.py:

# Single round (quick test)
python eval_judge.py --input results/eval_results.json --rounds 1

# Paper reproduction (4-round majority voting)
python eval_judge.py --input results/eval_results.json --rounds 4

This judge evaluates logical appropriateness of tool selection rather than exact string matching.


Data Format

The model was trained on ChatML-formatted conversation data.

{
  "messages": [
    {"role": "system", "content": "You are AutoGPT, you can use many tools(functions) to do the following task..."},
    {"role": "user", "content": "User query..."},
    {"role": "assistant", "content": "Thought: ...\nAction: tool_name\nAction Input: {\"param\": \"value\"}"},
    {"role": "tool", "content": "{\"response\": ...}"}
  ]
}

The original ToolBench "function" role was converted to "tool" for LFM2 ChatML compatibility.


Practical Deployment Strategy

To maximize the 354M model's strengths (fast inference, low cost, edge deployment), a router + selective tool injection pipeline is most effective.

User Query → [Router: Intent Classification] → Inject 2-3 relevant tools → LFM2-350M → ~100% accuracy
Comparison All tools injected (ToolBench style) Router + Selective injection
Tools in prompt 10–50 2–3
Action Match ~56% ~100%
Prompt length Thousands of tokens Hundreds of tokens
Inference speed Slow Fast
# Stage 1: Router (encoder model / embedding similarity)
TOOL_CATEGORIES = {
    "weather": ["get_current_weather_for_weatherapi", "get_forecast_for_weatherapi"],
    "finance": ["ultimateoscillator_for_qvantana", "typicalprice_for_qvantana"],
    "news":    ["search_news_for_newsapi", "get_top_headlines_for_newsapi"],
}

# Stage 2: Inject only relevant tools → LFM2-350M → ~100%
category = classify_query(user_query)   # BERT, cosine similarity, etc.
tools = TOOL_CATEGORIES[category]       # Select only 2-3
prompt = build_prompt(tools, user_query)
response = model.generate(prompt)       # → Near 100%

This is the practical meaning of "Small Language Models for Efficient Agentic Tool Calling": small model + smart routing = faster, cheaper, and more accurate than a single large model.


Limitations

  • 📊 Evaluation metric limitation — Action Match Rate uses strict string matching; semantically equivalent tools predicted under different names count as failures
  • 🧠 Single-turn evaluation — Multi-turn agent scenarios (receiving tool results and reasoning again) not yet evaluated
  • 🏗️ Architecture difference — Paper uses OPT-350M (Transformer), this model uses LFM2-350M (LNN-based); direct architecture comparison not possible

Potential Improvements: Length-Bucketed Training

The current training uses a uniform max_length=8192 for all samples. However, most ToolBench data falls in the 1K–4K token range, meaning significant compute is wasted on padding. Bucketing by sequence length with different batch sizes could substantially improve training speed.

Bucket Context Length Batch Size Effect
Short data ~1024 64–128 Minimal padding waste
Medium data ~2048–4096 16–32 Covers majority of data
Long data ~8192 8 Same as current
  • Speed: Could reduce total training time by 50–60% (4h → ~1.5–2h)
  • Performance: Reduced padding leads to more stable gradient estimation — expect equal or slightly better performance
  • Simple alternative: TRL's packing=True — concatenates multiple short sequences into one, eliminating padding waste

Reproduction

Prerequisites

  1. HuggingFace Token (Write permission) — Get token
  2. GPU Environment — NVIDIA H100/A100 80GB recommended

Training

# Run on VESSL AI
vessl run create -f vessl_notebook.yaml
# → Upload notebooks/01_train_trl.ipynb in Jupyter and run cells sequentially

Evaluation

# On the same VESSL AI instance
# → Upload notebooks/03_evaluate.ipynb and run cells sequentially

References


Citation

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
@article{qin2023toolllm,
    title   = {ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs},
    author  = {Qin, Yujia and others},
    journal = {arXiv preprint arXiv:2307.16789},
    year    = {2023}
}
@article{liquidai2025lfm2,
 title={LFM2 Technical Report},
 author={Liquid AI},
 journal={arXiv preprint arXiv:2511.23404},
 year={2025}
}
Downloads last month
178
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gyung/LFM2-350M-ToolLLaMA

Base model

LiquidAI/LFM2-350M
Finetuned
(52)
this model
Quantizations
1 model

Dataset used to train gyung/LFM2-350M-ToolLLaMA

Papers for gyung/LFM2-350M-ToolLLaMA