v8: Stage 2 Context-Contract Planning (MLX LoRA, iter 1000, val=0.032)

d7cca15 verified 20 days ago

5.94 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-7B-Instruct
	tags:
	- workflow-planner
	- slm
	- lora
	- mlx
	- context-contract-planning
	- tier-2-eligibility
	language:
	- en
	pipeline_tag: text-generation
	---

	# SLM Workflow Planner v8 — Context-Contract Planning (MLX LoRA)

	## Overview

	v8 is a Stage 2 enhancement of the SLM Workflow Planner. It extends the v3-best checkpoint with context-contract planning — the ability to make routing decisions based on the `required_context` and `produces_context` of ALL nodes in a workflow graph, not just directly connected edges.

	This enables three new capabilities:
	- Recovery Routing (Backjump): On failure, jump backward to an earlier context-satisfiable node
	- Stage Skipping: Skip unnecessary stages when required context is already available (e.g., walk-in customers)
	- Non-Adjacent Parallelism: Fork two independent context-satisfiable nodes that aren't connected by fork-edges

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base Model \| Qwen/Qwen2.5-7B-Instruct \|
	\| Fine-tune Type \| LoRA (MLX format) \|
	\| LoRA Rank \| 16 \|
	\| LoRA Scale \| 2.0 \|
	\| LoRA Dropout \| 0.02 \|
	\| Tuned Layers \| 28/32 \|
	\| Trainable Parameters \| 40.37M (0.53%) \|
	\| Framework \| MLX (Apple Silicon) \|

	## Training

	\| Property \| Value \|
	\|---\|---\|
	\| Lineage \| base(8000) → v2(100) → v3(200) → v3-cont → v3-best → v8(1000) \|
	\| Resume Checkpoint \| v3-best (59.2% on 76-scenario suite) \|
	\| Training Iterations \| 1000 (stopped early — val loss converged) \|
	\| Learning Rate \| 2e-5 (cosine decay to 1e-6, 100-step warmup) \|
	\| Batch Size \| 4 (effective 8 with grad accumulation) \|
	\| Max Sequence Length \| 768 tokens \|
	\| Dataset \| 696K samples from 150 workflows \|
	\| Val Loss \| 0.032 (from 0.272 starting) \|

	### Training Data Distribution

	\| Category \| Count \| % \| Description \|
	\|---\|---\|---\|---\|
	\| META \| 187K \| 26.9% \| Dead-end escalation \|
	\| NEGATIVE \| 187K \| 26.9% \| Tier-2 visible but edge chosen ("satisfiable ≠ sensible") \|
	\| NEXT_EDGE \| 116K \| 16.7% \| Normal edge progression \|
	\| NEXT_SKIP 🛡 \| 55K \| 8.0% \| Forward dead-end recovery (Tier-2) \|
	\| RETRY \| 36K \| 5.2% \| Edge retry on failure \|
	\| JOIN \| 30K \| 4.3% \| Parallel branch merge \|
	\| NEXT_BACKJUMP 🛡 \| 28K \| 4.0% \| Failure recovery to earlier node (Tier-2) \|
	\| FORK_EDGE \| 28K \| 4.0% \| Edge-adjacent fork \|
	\| FORK_NONADJ 🛡 \| 28K \| 4.0% \| Non-adjacent parallel fork (Tier-2) \|

	🛡 = Protected from downsampling during balancing

	## Prompt Format

	The model uses a tiered prompt with two candidate sections:

	```
	Current node: NODE_A (SYSTEM, stage 3)
	Outcome: success
	Failure type: none

	State:
	goal_progress=0.40
	retry_count=0
	...

	Produced context: {ctx_start, intake_data, assessment_score}

	Edge candidates (normal path):
	1. NODE_B (AGENT) [processor] → requires: {assessment_score} → produces: {approval}

	Context-eligible (off-path, invocable now):
	1. NODE_X (SYSTEM, stage 5, gap=+2) [validator] → requires: {intake_data} ✓ → produces: {validation}

	Forkable sets: []
	Join-ready: []

	What is the best action?
	```

	Output format: `DECISION_TYPE NODE_ID`
	- `NEXT NODE_B` — advance to NODE_B
	- `FORK NODE_A, NODE_B` — parallel fork
	- `RETRY NODE_A` — retry current
	- `JOIN NODE_A` — merge parallel branches
	- `META` — escalate to human

	## Evaluation Results

	### Section A: Stratified Test (100 held-out samples)

	\| Category \| Exact Accuracy \| Type Accuracy \|
	\|---\|---\|---\|
	\| META \| 20/20 (100%) \| 20/20 (100%) \|
	\| NEGATIVE (Tier-2 visible, edge chosen) \| 5/5 (100%) \| 5/5 (100%) \|
	\| SKIP_FORWARD \| 7/7 (100%) \| 7/7 (100%) \|
	\| RETRY \| 18/20 (90%) \| 18/20 (90%) \|
	\| JOIN \| 16/20 (80%) \| 16/20 (80%) \|
	\| FORK (non-adjacent) \| 12/18 (67%) \| 14/18 (78%) \|
	\| NEXT (edge) \| 5/8 (63%) \| 8/8 (100%) \|
	\| TOTAL \| 83/100 (83%) \| 88/100 (88%) \|

	### Section B: Tier-2 Specific (90 held-out samples)

	\| Category \| Exact Accuracy \| Type Accuracy \|
	\|---\|---\|---\|
	\| Non-Adjacent Fork \| 15/15 (100%) \| 15/15 (100%) \|
	\| META with Context \| 15/15 (100%) \| 15/15 (100%) \|
	\| Negative Contrast \| 14/15 (93%) \| 14/15 (93%) \|
	\| RETRY with Context \| 14/15 (93%) \| 14/15 (93%) \|
	\| Skip Forward \| 13/15 (87%) \| 14/15 (93%) \|
	\| JOIN with Context \| 10/15 (67%) \| 10/15 (67%) \|
	\| TOTAL \| 81/90 (90%) \| 82/90 (91%) \|

	## Key Capabilities

	1. Context-Contract Reasoning: Evaluates `required_context ⊆ produced_keys` to identify all invocable nodes
	2. Recovery Routing: Backjumps on process/resource failure when no edge retry exists
	3. Stage Skipping: Advances to forward context-eligible nodes at dead-ends
	4. Non-Adjacent Parallelism: Forks independent context-eligible nodes with different actors
	5. Negative Contrast: Learned "satisfiable ≠ sensible" — doesn't take Tier-2 when edge path is correct

	## Usage (MLX)

	```python
	from mlx_lm import load, generate

	model, tokenizer = load(
	"Qwen/Qwen2.5-7B-Instruct",
	adapter_path="sameer-saraf-quant-ai/slm-workflow-planner-v8-mlx"
	)

	messages = [
	{"role": "system", "content": "You are a workflow planner..."},
	{"role": "user", "content": "<tiered prompt>"},
	]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	response = generate(model, tokenizer, prompt=prompt, max_tokens=30)
	print(response) # "NEXT ESTIMATION_AND_APPROVAL"
	```

	## Ensemble Recommendation

	For production use, combine with GPT-4.1 arbiter for the ~10% edge cases (mainly JOIN confusion):
	- v8 handles 90%+ of decisions autonomously
	- GPT validates uncertain decisions (estimated 5-10% of traffic)

	## Architecture Context

	This adapter is part of the Agentic OS system:
	- Temporal handles durable execution and state management
	- Neo4j stores workflow graph definitions
	- SLM (this model) makes real-time routing decisions
	- Guardrails validate SLM output before execution