SmolLM3-3B-LAM: A Large Action Model for AI Agent Creation
A fine-tuned SmolLM3-3B model that creates complete AI agent definitions from natural language requests. Given a description like "Build me a bot that monitors S3 for sensitive data exposure", it outputs a structured JSON agent specification with tools, skills, constraints, and architectural reasoning.
What It Does
This is not a general-purpose chatbot or a function-calling model. It does one thing: design AI agents from scratch.
Input:
Build an agent that monitors my S3 buckets for sensitive data exposure and auto-remediates
Output:
{
"reasoning": "This is a casual, vague request. The user wants a simple, focused agent that watches S3 buckets for sensitive data and automatically fixes issues...",
"agent": {
"name": "s3-data-guardian",
"description": "Monitors S3 buckets for sensitive data exposure and auto-remediates",
"role": "cloud security monitor",
"tools": [
{
"name": "scan_s3_objects",
"description": "Scans S3 objects for sensitive data patterns",
"parameters": [{ "name": "bucket_name", "type": "string", "description": "S3 bucket to scan", "required": true }],
"returns": "List of objects containing sensitive data with classification"
}
],
"skills": [
{
"name": "detect-and-remediate",
"description": "Scans buckets and auto-remediates exposure",
"trigger": "Scheduled scan or S3 event notification",
"inputs": [{ "name": "bucket_name", "type": "string", "description": "Target bucket", "required": true }],
"steps": [
{ "action": "Scan all objects in bucket for sensitive data", "tool": "scan_s3_objects" },
{ "action": "Restrict public access on flagged objects", "tool": "update_bucket_policy" },
{ "action": "Send alert with remediation summary", "tool": "send_alert" }
],
"output": "Remediation report with actions taken"
}
],
"constraints": [
"Never delete S3 objects, only restrict access",
"Log all remediation actions to audit trail"
]
}
}
Benchmark Results
Held-Out Validation Set (20 samples, never seen during training)
| Model | Avg Score | Min | Max | Valid JSON % |
|---|---|---|---|---|
| SmolLM3-3B-LAM (this model) | 96.3 | 70 | 100 | 100% |
| SmolLM3-3B (base) | 79.3 | 60 | 90 | 100% |
| xLAM-1B-fc-r (Salesforce) | 27.5 | 20 | 40 | 100% |
3-Way Comparison (Hand-Crafted Prompts)
| Model | Params | T1 | T2 | T3 | Avg |
|---|---|---|---|---|---|
| SmolLM3-3B-LAM | 3B | 95 | 100 | 100 | 98.3 |
| SmolLM3-3B (base) | 3B | 90 | 70 | 85 | 81.7 |
| xLAM-1B-fc-r | 1B | 20 | 40 | 40 | 33.3 |
Key findings:
- +21.4% improvement over the base SmolLM3-3B model
- +250% improvement over Salesforce's xLAM-1B (a purpose-built Large Action Model)
- 100/100 on 13 out of 20 held-out validation examples
- Goes straight to clean structured JSON without
<think>wrapper tags - Learned to characterize user tone and adjust agent complexity accordingly
Scoring Methodology
Each output scored 0-100: valid JSON (20pts) + presence of key schema fields: reasoning (10), agent (10), tools (10), skills (10), constraints (10), steps (10), trigger (5), parameters (5), on_failure (5), description (5).
Training Details
| Parameter | Value |
|---|---|
| Base model | HuggingFaceTB/SmolLM3-3B |
| Method | QLoRA (8-bit quantized base + full-precision adapters) |
| Framework | Apple MLX |
| Trainable parameters | 6.7M / 3,075M (0.218%) |
| Training iterations | 500 |
| Batch size | 2 |
| Learning rate | 1e-5 |
| LoRA layers | 16 |
| Max sequence length | 8,192 |
| Peak memory | 59.5 GB (Apple Silicon unified memory) |
| Training time | ~40 minutes on M-series Mac |
| Best val loss | 0.559 (iter 250) |
| Final val loss | 0.625 (iter 500) |
Training Loss Curve
| Iter | Train Loss | Val Loss |
|---|---|---|
| 1 | — | 1.069 |
| 100 | 0.503 | 0.688 |
| 250 | 0.590 | 0.559 |
| 500 | 0.482 | 0.625 |
Training Data
~2,000 examples total:
| Source | Count | Purpose |
|---|---|---|
| Synthetic agent-creation pairs | 992 | Core task: natural language to agent definition |
| ToolACE (ICLR 2025) | 500 | Structured JSON tool-calling patterns |
| Alpaca-Cleaned | 500 | General instruction following (prevents catastrophic forgetting) |
The synthetic data was generated using Claude Sonnet 4.6 via the Anthropic Batch API with an instruction repetition technique that improved output quality by 7.1% in A/B testing (specifically improving reasoning quality -- the repeated instruction variant was the only one that produced reasoning explaining why an architecture fits, not just what it is).
Usage
With MLX (Apple Silicon)
from mlx_lm import load, generate
model, tokenizer = load("chendren/smollm3-3b-lam")
prompt = """You are a Large Action Model that creates AI agents and skills from user requests.
When given a request, you:
1. Reason about what agent architecture best serves the need
2. Define the tools the agent requires
3. Define skills as composable, multi-step workflows
4. Set constraints to keep the agent safe and focused
Respond with a JSON object containing:
- reasoning: your thought process for the design
- agent: the complete agent definition with name, description, role, tools, skills, and constraints
User request: Create an agent that reviews PRs for security vulnerabilities"""
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
print(response)
Output Schema
The model generates JSON conforming to this structure:
{
reasoning: string, // WHY this architecture fits
agent: {
name: string, // kebab-case agent name
description: string, // what the agent does
role: string, // primary role in one phrase
tools: [{ // tools the agent needs
name: string, // snake_case tool name
description: string,
parameters: [{ name, type, description, required }],
returns: string
}],
skills: [{ // composable multi-step workflows
name: string, // kebab-case skill name
description: string,
trigger: string, // when the skill activates
inputs: [{ name, type, description, required }],
steps: [{ action, tool?, input?, on_failure? }],
output: string
}],
constraints: string[] // behavioral guardrails
}
}
Limitations
- Trained on 20 categories of agent types -- may produce lower quality output for highly specialized domains not represented in training
- Generates agent definitions, not executable code -- the output is a specification that needs a runtime to execute
- Best validation loss was at iteration 250; the final model (iteration 500) shows slight overfitting -- the iteration 250 checkpoint may perform marginally better
- Scoring is structural (checks for field presence), not semantic -- a high score does not guarantee the agent design is good, only that it is complete
Citation
@misc{smollm3-3b-lam-2026,
title={SmolLM3-3B-LAM: Fine-Tuning a 3B Model as a Large Action Model for AI Agent Creation},
author={Chad Hendren},
year={2026},
url={https://huggingface.co/chendren/smollm3-3b-lam}
}
Acknowledgments
- HuggingFace SmolLM3 -- base model
- Apple MLX -- training framework
- Salesforce xLAM -- Large Action Model research
- ToolACE -- tool-calling training data
- Anthropic Claude -- synthetic data generation
- Downloads last month
- 117
8-bit