LFM2-24B-A2B / README.md
mlabonne's picture
Update README.md (#1)
c66618d
metadata
library_name: transformers
license: other
license_name: lfm1.0
license_link: LICENSE
language:
  - en
  - ar
  - zh
  - fr
  - de
  - ja
  - ko
  - es
  - pt
pipeline_tag: text-generation
tags:
  - liquid
  - lfm2
  - edge
Liquid AI
Try LFM β€’ Documentation β€’ LEAP

LFM2-24B-A2B

LFM2 is a family of hybrid models designed for on-device deployment. LFM2-24B-A2B is the largest model in the family, scaling the architecture to 24 billion parameters while keeping inference efficient.

  • Best-in-class efficiency: A 24B MoE model with only 2B active parameters per token, fitting in 32 GB of RAM for deployment on consumer laptops and desktops.
  • Fast edge inference: 112 tok/s decode on AMD CPU, 293 tok/s on H100. Fits in 32B GB of RAM with day-one support llama.cpp, vLLM, and SGLang.
  • Predictable scaling: Quality improves log-linearly from 350M to 24B total parameters, confirming the LFM2 hybrid architecture scales reliably across nearly two orders of magnitude.

image

Find more information about LFM2-24B-A2B in our blog post.

πŸ—’οΈ Model Details

LFM2-24B-A2B is a general-purpose instruct model (without reasoning traces) with the following features:

Property LFM2-8B-A1B LFM2-24B-A2B
Total parameters 8.3B 24B
Active parameters 1.5B 2.3B
Layers 24 (18 conv + 6 attn) 40 (30 conv + 10 attn)
Context length 32,768 tokens 32,768 tokens
Vocabulary size 65,536 65,536
Training precision Mixed BF16/FP8 Mixed BF16/FP8
Training budget 12 trillion tokens 17 trillion tokens
License LFM Open License v1.0 LFM Open License v1.0

Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, Spanish, Portuguese

Generation parameters:

  • temperature: 0.1
  • top_k: 50
  • repetition_penalty: 1.05

We recommend the following use cases:

  • Agentic tool use: Native function calling, web search, structured outputs. Ideal as the fast inner-loop model in multi-step agent pipelines.
  • Offline document summarization and Q&A: Run entirely on consumer hardware for privacy-sensitive workflows (legal, medical, corporate).
  • Privacy-preserving customer support agent: Deployed on-premise at a company, handles multi-turn support conversations with tool access (database lookups, ticket creation) without data leaving the network.
  • Local RAG pipelines: Serve as the generation backbone in retrieval-augmented setups on a single machine without GPU servers.

Chat Template

LFM2-24B-A2B uses a ChatML-like format. See the Chat Template documentation for details. Example:

<|startoftext|><|im_start|>system
You are a helpful assistant trained by Liquid AI.<|im_end|>
<|im_start|>user
What is C. elegans?<|im_end|>
<|im_start|>assistant

You can use tokenizer.apply_chat_template() to format your messages automatically.

Tool Use

LFM2-24B-A2B supports function calling as follows:

  1. Function definition: We recommend providing the list of tools as a JSON object in the system prompt. You can also use the tokenizer.apply_chat_template() function with tools.
  2. Function call: By default, LFM2-24B-A2B writes Pythonic function calls (a Python list between <|tool_call_start|> and <|tool_call_end|> special tokens), as the assistant answer. You can override this behavior by asking the model to output JSON function calls in the system prompt.
  3. Function execution: The function call is executed, and the result is returned as a "tool" role.
  4. Final answer: LFM2-24B-A2B interprets the outcome of the function call to address the original user prompt in plain text.

See the Tool Use documentation for the full guide. Example:

<|startoftext|><|im_start|>system
List of tools: [{"name": "get_candidate_status", "description": "Retrieves the current status of a candidate in the recruitment process", "parameters": {"type": "object", "properties": {"candidate_id": {"type": "string", "description": "Unique identifier for the candidate"}}, "required": ["candidate_id"]}}]<|im_end|>
<|im_start|>user
What is the current status of candidate ID 12345?<|im_end|>
<|im_start|>assistant
<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the current status of candidate ID 12345.<|im_end|>
<|im_start|>tool
[{"candidate_id": "12345", "status": "Interview Scheduled", "position": "Clinical Research Associate", "date": "2023-11-20"}]<|im_end|>
<|im_start|>assistant
The candidate with ID 12345 is currently in the "Interview Scheduled" stage for the position of Clinical Research Associate, with an interview date set for 2023-11-20.<|im_end|>

πŸƒ Inference

LFM2-24B-A2B is supported by many inference frameworks. See the Inference documentation for the full list.

Name Description Docs Notebook
Transformers Simple inference with direct access to model internals. Link Colab link
vLLM High-throughput production deployments with GPU. Link Colab link
llama.cpp Cross-platform inference with CPU offloading. Link Colab link
MLX Apple's machine learning framework optimized for Apple Silicon. Link β€”
LM Studio Desktop application for running LLMs locally. Link β€”

Here's a quick start example with Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_id = "LiquidAI/LFM2-24B-A2B"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
#   attn_implementation="flash_attention_2" <- uncomment on compatible GPU
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "What is C. elegans?"

input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
).to(model.device)

output = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_new_tokens=512,
    streamer=streamer,
)

πŸ”§ Fine-Tuning

Name Description Docs Notebook
CPT (Unsloth) Continued Pre-Training using Unsloth for text completion. Link Colab link
CPT (Unsloth) Continued Pre-Training using Unsloth for translation. Link Colab link
SFT (Unsloth) Supervised Fine-Tuning with LoRA using Unsloth. Link Colab link
SFT (TRL) Supervised Fine-Tuning with LoRA using TRL. Link Colab link
DPO (TRL) Direct Preference Optimization with LoRA using TRL. Link Colab link
GRPO (Unsloth) GRPO with LoRA using Unsloth. Link Colab link
GRPO (TRL) GRPO with LoRA using TRL. Link Colab link

πŸ“Š Performance

CPU Inference

We compared LFM2-24B-A2B against two popular MoE models of similar size: Qwen3-30B-A3B-Instruct-2507 (30.5B total, 3.3B active parameters) and gpt-oss-20b (21B total, 3.6B active parameters). We measured both prefill and decode throughputs with Q4_K_M versions of these models using llama.cpp on AMD Ryzen AI Max+ 395.

image

image

GPU Inference

We also report throughput (total tokens / wall time) achieved with vLLM on a single H100 SXM5 GPU.

image

Contact

For enterprise solutions and edge deployment, contact sales@liquid.ai.

Citation

@article{liquidAI202624B,
  author = {Liquid AI},
  title = {LFM2.5-24B-A2B: Scaling Up the LFM2 Architecture},
  journal = {Liquid AI Blog},
  year = {2026},
  note = {www.liquid.ai/blog/},
}
@article{liquidai2025lfm2,
  title={LFM2 Technical Report},
  author={Liquid AI},
  journal={arXiv preprint arXiv:2511.23404},
  year={2025}
}