SLM Workflow Planner 7B v2 β€” Contrastive Alignment LoRA Adapter

Model Description

LoRA adapter for Qwen/Qwen2.5-7B-Instruct fine-tuned as a workflow execution planner. This is the v2 alignment model β€” trained in two stages:

  1. Stage A: Base policy training on 554K samples from 89 diverse workflow graphs (iter 800)
  2. Stage B: Contrastive alignment on 20K curated samples with clean decision boundaries (iter 100)

The model makes real-time decisions about workflow transitions by analyzing state signals, eligible nodes, and topology information.

Decision Types

Decision Description
NEXT Proceed to the next sequential step
RETRY Retry the current step (within budget)
FORK Launch parallel execution branches
JOIN Synchronize parallel branches
META Escalate β€” anomaly detected, human intervention needed

Performance (76-scenario evaluation suite)

Category v2 SLM GPT-4.1 GPT-4o-mini Base SLM
NEXT 10/22 (45%) 6/22 (27%) 2/22 (9%) 16/22 (73%)
RETRY 7/12 (58%) 11/12 (92%) 12/12 (100%) 3/12 (25%)
FORK 13/14 (93%) 14/14 (100%) 14/14 (100%) 1/14 (7%)
JOIN 10/15 (67%) 10/15 (67%) 12/15 (80%) 0/15 (0%)
META 2/13 (15%) 0/13 (0%) 0/13 (0%) 8/13 (62%)
TOTAL 42/76 (55.3%) 41/76 (53.9%) 40/76 (52.6%) 28/76 (36.8%)

Key Results

  • πŸ† Outperforms GPT-4.1 (55.3% vs 53.9%) on structured workflow planning
  • πŸ† Only model that handles META β€” GPT-4.1 and GPT-4o-mini score 0% on anomaly detection
  • πŸ”₯ FORK: 93% β€” near-perfect parallel execution decisions
  • πŸ”₯ JOIN: 67% β€” first model to reliably synchronize parallel branches
  • ⚑ 4x faster inference than base model, runs locally on Apple Silicon

Training Details

Two-Stage Training

Stage A: Base Policy (iter 800)

  • Dataset: 554K instruction pairs from 89 workflow graphs
  • 8 structural families (linear, retry, fork-join, escalation, etc.)
  • Balanced decision distribution: NEXT 36%, JOIN 27%, META 13%, FORK 12%, RETRY 12%

Stage B: Contrastive Alignment (iter 100)

  • Dataset: 20K curated samples with clean decision boundaries
  • Contrastive pairs: FORK positives + hard FORK negatives (NEXT with forkable but blocked)
  • JOIN positives + hard JOIN negatives (NEXT with join_ready but no parallel)
  • Clean RETRY and META samples
  • Proportional representation across all decision types

LoRA Configuration

Parameter Value
Rank 16
Alpha (scale) 32 (2.0x)
Dropout 0.02
Target layers Last 28 of 32
Target modules q_proj, k_proj, v_proj, o_proj

Training Configuration

Parameter Value
Framework MLX (Apple Silicon native)
Hardware Apple M4 Pro, 48GB unified memory
Stage A iters 800
Stage B iters 100
Batch size 4
Learning rate 3e-5 (alignment stage)
Sequence length 512
Prompt masking Yes (loss only on assistant tokens)

Training Curve (Alignment Stage)

Iteration Val Loss Train Loss
1 (start) 14.536 β€”
50 0.134 0.273
100 (final) 0.099 0.135

Usage

With MLX (Apple Silicon)

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load(
    "Qwen/Qwen2.5-7B-Instruct",
    adapter_path="ssaraf1/slm-workflow-planner-7b-v2"
)

messages = [
    {"role": "system", "content": "You are a workflow planner. Given the current workflow state, eligible nodes, and topology information, classify the decision type. Respond with exactly one of: NEXT, RETRY, FORK, JOIN, META"},
    {"role": "user", "content": "Current node: TRIAGE_AND_ASSIGN (AGENT)\nOutcome: assigned\n\nState:\n  goal_progress=0.15\n  parallel_active=0\n  resource_pressure=0.1\n\nEligible nodes:\n  1. VERIFY_POLICY (SYSTEM) β†’ produces: policy_status\n  2. FRAUD_SCREENING (SYSTEM) β†’ produces: fraud_score\n  3. DAMAGE_ASSESSMENT (AGENT) β†’ produces: damage_report\n\nForkable sets: [{VERIFY_POLICY, FRAUD_SCREENING, DAMAGE_ASSESSMENT}]\nJoin-ready: []\n\nWhat decision type?"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
sampler = make_sampler(temp=0.0)
response = generate(model, tokenizer, prompt=prompt, max_tokens=10, sampler=sampler)
print(response)  # Expected: FORK

What Makes This Model Special

Contrastive Alignment

Unlike naive fine-tuning, this model was trained with contrastive pairs that teach policy boundaries, not just pattern matching:

Scenario Topology says State says Model learns
Forkable + low pressure FORK Go parallel FORK βœ…
Forkable + high pressure FORK Don't parallelize NEXT βœ…
Join-ready + parallel active JOIN Merge now JOIN βœ…
Join-ready + no parallel JOIN Not ready NEXT βœ…

Policy Learning, Not Path Memorization

The model learns decision = f(state signals, topology, actors), not domain-specific workflow paths. This enables generalization to unseen workflow structures.

Files

  • adapters.safetensors β€” LoRA adapter weights (base iter 800 + alignment iter 100)
  • adapter_config.json β€” LoRA configuration for MLX

Citation

Part of the Agentic Factory project β€” building autonomous workflow orchestration with SLM-powered planning on Apple Silicon.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ssaraf1/slm-workflow-planner-7b-v2

Base model

Qwen/Qwen2.5-7B
Adapter
(1166)
this model

Datasets used to train ssaraf1/slm-workflow-planner-7b-v2