Instructions to use LLM-OS-Models/Fabliq-8B-Agent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LLM-OS-Models/Fabliq-8B-Agent with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="LLM-OS-Models/Fabliq-8B-Agent")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("LLM-OS-Models/Fabliq-8B-Agent")
model = AutoModelForCausalLM.from_pretrained("LLM-OS-Models/Fabliq-8B-Agent")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use LLM-OS-Models/Fabliq-8B-Agent with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LLM-OS-Models/Fabliq-8B-Agent"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLM-OS-Models/Fabliq-8B-Agent",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LLM-OS-Models/Fabliq-8B-Agent

SGLang

How to use LLM-OS-Models/Fabliq-8B-Agent with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "LLM-OS-Models/Fabliq-8B-Agent" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLM-OS-Models/Fabliq-8B-Agent",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "LLM-OS-Models/Fabliq-8B-Agent" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLM-OS-Models/Fabliq-8B-Agent",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use LLM-OS-Models/Fabliq-8B-Agent with Docker Model Runner:
```
docker model run hf.co/LLM-OS-Models/Fabliq-8B-Agent
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Fabliq-8B-Agent 🌊

Fab·liq = Fable + Liquid. A compact, fast agentic terminal coding model fine-tuned from LLM-OS-Models/LFM2.5-8B-A1B-Terminal-ToolBench-Full-SFT-1Epoch on real Claude Code sessions from the Fable-5-traces dataset. The base LiquidAI LFM2.5-8B-A1B is an 8B MoE (~1B active), so Fabliq inherits the speed and low VRAM of MoE inference plus the agentic distillation.

Fabliq thinks before it acts: it reads the conversation, reasons inside <think>...</think>, then either calls a tool with LFM's native tool-call format or replies with text. Trained on 4,047 real multi-turn terminal trajectories (Bash, Edit, Read, Write, Glob, Grep, WebSearch) — the kind of read → reason → act → verify loop a real coding agent runs.

✨ Why Fabliq?

🐠 Tiny footprint, agent-class behavior. LFM2.5-8B-A1B is a Mixture-of-Experts model — only ~1B parameters activate per token. That means fast inference, low VRAM, and the agentic distillation still takes.
🛠 Native tool calling. No wrapper needed — Fabliq emits <|tool_call_start|>[ToolName(arg=value)]<|tool_call_end|> per LFM's official format. Plug it into a harness that parses and executes those calls and you have a working terminal agent.
🧠 Reasoning-first. Every assistant turn opens with a <think> block — the chain-of-thought from the original Claude traces, preserved verbatim. The model self-explains before each action.
🔗 Clean lineage. This is a direct fine-tune of LLM-OS-Models/LFM2.5-8B-A1B-Terminal-ToolBench-Full-SFT-1Epoch, which is itself a fine-tune of LiquidAI/LFM2.5-8B-A1B. Fabliq adds 3 epochs of Fable-5 agentic distillation on top of the ToolBench foundation.

Sibling models:

Fabliq-8B-Agent-Reasoning — adds general/deep reasoning (WithinUs + Helio) on top of this base.

🧪 Model details


Architecture	Lfm2MoeForCausalLM (24 layers, 32 experts, 4 experts/token)
Parameters	~8B total / ~1B active (MoE)
Context	8,192 trained · 128K native (`rope_theta=5e6`)
Precision	bfloat16
Fine-tune type	Full-parameter SFT (FSDP `full_shard` + activation checkpointing)
License	Apache 2.0

📚 Training data

Source	Rows	Description
Glint-Research/Fable-5-traces	4,047	Real Claude Code terminal sessions — multi-turn tool-use trajectories

Preprocessing pipeline (build_fable5_to_lfm_sft.py):

Parse context → multi-turn USER / ASSISTANT (message) messages
Strip slash-command metadata blocks (<local-command-caveat>, <command-*>)
Convert {tool, input} structured output → LFM native tool-call syntax <|tool_call_start|>[ToolName(arg='value')]<|tool_call_end|>
Wrap chain-of-thought in <think>...</think> for assistant reasoning training
Drop 618 rows with <3 messages after parsing
Max sequence length 8,192 tokens (98.6% coverage without truncation)

Full lineage, data composition, and per-dataset metadata: see the Fable Distillation docs.

🔧 Training procedure

Hyperparameter	Value
Schedule	3 epochs, constant LR
Max sequence length	8,192
Per-device batch size	2
Gradient accumulation	4
GPUs	8× H200 (effective batch 64)
Learning rate	5e-7 (AdamW)
Precision	bf16
FSDP	`full_shard`, activation checkpointing, `Lfm2MoeDecoderLayer` auto-wrap
Loss	NLL, chat-template masked (assistant-only effective)
Final train_loss	1.277
Train runtime	831 seconds (~14 min)
Global steps	192

💬 System prompt

You are an agentic coding assistant. Read the conversation history and tool results,
think step by step inside <think>...</think>, then either call a tool using
<|tool_call_start|>[ToolName(arg=value)]<|tool_call_end|> or respond with text.
Use available tools (Bash, Edit, Read, Write, Glob, Grep, WebSearch, WebFetch, etc.)
to accomplish the user's task. Be concise but thorough.

🚀 How to use

transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "LLM-OS-Models/Fabliq-8B-Agent"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto"
)

SYSTEM = (
    "You are an agentic coding assistant. Read the conversation history and tool results, "
    "think step by step inside <think>...</think>, then either call a tool using "
    "<|tool_call_start|>[ToolName(arg=value)]<|tool_call_end|> or respond with text. "
    "Use available tools (Bash, Edit, Read, Write, Glob, Grep, WebSearch, WebFetch, etc.) "
    "to accomplish the user's task. Be concise but thorough."
)

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": "List the Python files in /tmp and report the line count of the largest one."},
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=False,
    repetition_penalty=1.05,
)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))

Expected output — the model reasons, then emits a tool call:

<think>
I need to find Python files in /tmp. I'll use Bash with ls piped into wc -l.
</think>

<|tool_call_start|>[Bash(command='ls -1 /tmp/*.py 2>/dev/null | xargs wc -l 2>/dev/null | sort -n | tail -1')]<|tool_call_end|>

vLLM

vllm serve LLM-OS-Models/Fabliq-8B-Agent \
  --max-model-len 8192 --dtype bfloat16 --gpu-memory-utilization 0.9

🎯 Intended use

Local coding agent on top of an MoE-efficient backbone (~1B active params — runs comfortably on a single consumer GPU)
Terminal / file-system agentic loops (read, edit, run, verify)
Research on distillation from frontier closed models into open-weight MoE backbones

⚠️ Limitations

Agentic specialization. Focused fine-tune for terminal/coding work. General-knowledge benchmarks may sit slightly below the LFM2.5-8B-A1B base — that's the expected trade-off for a focused agentic distill.
No safety alignment. Trained on raw tool-use traces; add your own guardrails for production.
Tool-format lock-in. Emits LFM-native tool-call syntax. A harness that parses <|tool_call_start|>...<|tool_call_end|> and actually executes the call is required for the agentic loop to work.
Max seq 8,192 at training. Behavior beyond 8K context is unverified for this checkpoint.
English-centric.

📜 License

Apache 2.0, inherited from the LiquidAI LFM2.5-8B-A1B base.

🌳 Model tree

This is a fine-tune (not a merge or adapter). Direct parent: LLM-OS-Models/LFM2.5-8B-A1B-Terminal-ToolBench-Full-SFT-1Epoch.

LiquidAI/LFM2.5-8B-A1B                                          (LiquidAI base)
  └─ LLM-OS-Models/LFM2.5-8B-A1B-Terminal-ToolBench-Full-SFT-1Epoch  (ToolBench foundation)
      └─ LLM-OS-Models/Fabliq-8B-Agent                           ← this model (Fable-5 agentic SFT, 3 epochs)
          └─ LLM-OS-Models/Fabliq-8B-Agent-Reasoning             ← sibling (+ WithinUs + Helio reasoning)