LFM2.5-Audio-1.5B — Tool-Aware Fine-Tune (v2)

A full fine-tune of LiquidAI/LFM2.5-Audio-1.5B that handles both turns of a tool-augmented voice flow:

Turn Trigger Behavior
1 — acknowledge user audio + Tools available: … system prompt Short ack ("setting your alarm now.") then stop
2 — narrate same audio + Known facts you must use… block injected via set_context() Speaks the result naturally ("your alarm is set for 7am.")
Other classes (any) Refusal on missing tool, normal answer on general knowledge, chitchat reply

v2 is the successor to matbee/lfm2.5-audio-tool-aware-v1, which mastered turn 1 but regressed to always-ack on turn 2 (0/20 narration on injected context). v2 adds the tool_result_speak class to the training mix and lifts narration from 0/20 → 20/20 while improving ack accuracy from 93.3% → 100%.

How turn 2 works

The s2s dispatcher in your voice-assistant pipeline does:

# turn 1 — model emits "let me check the weather for you" and stops
# coordinator runs the weather tool, gets "Weather in Tokyo: 72°F, sunny."
await ctrl.audio_node.set_context("Weather in Tokyo: 72°F, sunny.")
# (no reset_history — same session continues)
# turn 2 — model says "it's 72 and sunny in tokyo"

Under the hood set_context() appends a Known facts you must use when relevant:\n{result} block to the system prompt before turn 2 generation. v2 is trained to read that block and produce a natural narration without regenerating an ack.

System-prompt format

Respond with interleaved text and audio.

Tools available:
- weather: get current weather and forecasts for a location
- alarm: set or cancel alarms
- music: play, pause, or skip music
…

If a request needs one of these tools, acknowledge briefly and stop.
If known facts are provided below, use them to answer the user directly
without acknowledging again. Otherwise answer normally.

Known facts you must use when relevant:
Weather in Tokyo: 72°F, sunny.

The Known facts block is optional — present it for turn-2 narration, omit it for turn-1 ack.

Eval results

On the matbee/lfm2-tool-aware-dataset-v2 eval split (149 parsed rows across 5 classes):

Class v1 v2 Notes
tool_match (ack) 93.3% 100.0% v1 state-query failures fixed (e.g., "what's the thermostat set to")
tool_result_speak (narrate) n/a (0/20 strict) 89.7% New behavior unlocked. Failures are all call scenario fact-vs-query overrides
tool_miss (refuse) 93.1% 96.7%
general (answer normally) 100.0% 100.0% No baseline regression
chitchat (chat) 100.0% 100.0% (after correcting scorer false-positives on "doing well, thanks")

Overall: ~97% across 5 classes, with the new turn-2 narration capability working cleanly.

Known failure modes

  • call scenario fact-vs-query override: "call mom" + injected fact "Calling your office" → narration says "connecting you to your office". The model trusts the fact verbatim — correct general behavior but for call the contact name shouldn't be overridden. In production the dispatcher provides query-aligned facts; this only manifests when synthesis mismatches them.
  • Search-adjacent generalization (carryover from v1): "recipe for miso soup" with search listed but not recipe → "searching for miso soup recipe". Arguably correct — search can find recipes.

Usage

from pathlib import Path
import torch
import torchaudio
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState

# Pass a Path (not str) so liquid-audio takes the local-checkpoint branch
local = Path("./lfm2.5-audio-tool-aware-v2")
processor = LFM2AudioProcessor.from_pretrained(local, device="cuda").eval()
model = LFM2AudioModel.from_pretrained(
    local, device="cuda", dtype=torch.bfloat16
).eval()

def respond(system_prompt: str, wav_path: str) -> str:
    chat = ChatState(processor)
    chat.new_turn("system"); chat.add_text(system_prompt); chat.end_turn()
    wav, sr = torchaudio.load(wav_path)
    if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True)
    chat.new_turn("user"); chat.add_audio(wav, sr); chat.end_turn()
    chat.new_turn("assistant")
    pieces = []
    for token in model.generate_interleaved(
        **chat, max_new_tokens=120, audio_temperature=1.0, audio_top_k=4
    ):
        if token.numel() == 1:
            pieces.append(processor.text.decode(token))
    return "".join(pieces).strip()

# Turn 1 — ack
TOOLS = (
    "Respond with interleaved text and audio.\n\n"
    "Tools available:\n- weather: get current weather...\n\n"
    "If a request needs one of these tools, acknowledge briefly and stop. "
    "If known facts are provided below, use them to answer the user "
    "directly without acknowledging again. Otherwise answer normally."
)
print(respond(TOOLS, "user_says_whats_the_weather.wav"))
# → "let me check the weather in tokyo."

# Turn 2 — narrate (after dispatcher returned the result)
WITH_FACTS = TOOLS + "\n\nKnown facts you must use when relevant:\n" \
                     "Weather in Tokyo: 72°F, sunny."
print(respond(WITH_FACTS, "user_says_whats_the_weather.wav"))
# → "it's 72 and sunny in tokyo."

Training

  • Base: LiquidAI/LFM2.5-Audio-1.5B (1.45B params, bf16)
  • Data: 3000 train + 400 eval examples — see matbee/lfm2-tool-aware-dataset-v2. v2 mix: 28% tool_match / 29% tool_result_speak / 14% tool_miss / 18% general / 11% chitchat
  • Hardware: 2× RTX 4090 (DDP, CUDA_LAUNCH_BLOCKING=1, NCCL_P2P_DISABLE=1 for the no-NVLink 4090 pair)
  • Trainer: upstream liquid_audio.trainer.Trainer — full fine-tune, bf16 mixed precision
  • Hyperparams: AdamW, lr 5e-5, cosine schedule with 50-step warmup, batch_size 4 per GPU (effective 8), 700 steps (~3.5 epochs), context_length 320
  • Wall clock: ~21 minutes
  • Loss: train 2.0 → 0.25-0.55 (jagged from the 5-class mix), val 0.85 → 0.50 at step 700

The v2 val_loss (0.50) is higher than v1 (0.33) because v2 includes the harder tool_result_speak class. Per-class accuracy is the fairer comparison and shows uniform improvement.

Limitations

Same as v1, plus:

  • call scenario fact-vs-query alignment — see Known failure modes above.
  • Synthesis-side mismatches in tool_result_speak training data: synthetic facts are randomly drawn per scenario, not query-aligned. This trains the model to trust the injected fact over the user's query when they conflict. Correct in production (where dispatcher provides aligned facts), occasionally surfaces during eval.

Inherits from v1:

  • All training audio synthetic (Kokoro). Real human speech untested.
  • Assistant audio voice is am_adam (Kokoro male, American English).
  • 20-scenario SLURP-inspired tool taxonomy. Unseen tool names untested.
  • English only.
  • Single-turn behaviour per call (the two-turn flow is composed by the coordinator across two process() calls).

License

LFM Open License v1.0, inherited from the base model. See LICENSE.

Citation

@misc{liquidai2025lfm25audio,
  title={LFM2.5-Audio: Speech-to-Speech Foundation Model},
  author={Liquid AI},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B}
}

Dataset, recipe, two-turn flow design described in matbee/lfm2-tool-aware-dataset-v2.

Downloads last month
9
Safetensors
Model size
1B params
Tensor type
I64
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for matbee/lfm2.5-audio-tool-aware-v2

Finetuned
(4)
this model
Finetunes
1 model

Dataset used to train matbee/lfm2.5-audio-tool-aware-v2