| | --- |
| | license: apache-2.0 |
| | base_model: Qwen/Qwen2.5-7B-Instruct |
| | tags: |
| | - workflow-planner |
| | - slm |
| | - lora |
| | - mlx |
| | - context-contract-planning |
| | - tier-2-eligibility |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # SLM Workflow Planner v8 β Context-Contract Planning (MLX LoRA) |
| |
|
| | ## Overview |
| |
|
| | **v8** is a Stage 2 enhancement of the SLM Workflow Planner. It extends the v3-best checkpoint with **context-contract planning** β the ability to make routing decisions based on the `required_context` and `produces_context` of ALL nodes in a workflow graph, not just directly connected edges. |
| |
|
| | This enables three new capabilities: |
| | - **Recovery Routing (Backjump):** On failure, jump backward to an earlier context-satisfiable node |
| | - **Stage Skipping:** Skip unnecessary stages when required context is already available (e.g., walk-in customers) |
| | - **Non-Adjacent Parallelism:** Fork two independent context-satisfiable nodes that aren't connected by fork-edges |
| |
|
| | ## Model Details |
| |
|
| | | Property | Value | |
| | |---|---| |
| | | **Base Model** | Qwen/Qwen2.5-7B-Instruct | |
| | | **Fine-tune Type** | LoRA (MLX format) | |
| | | **LoRA Rank** | 16 | |
| | | **LoRA Scale** | 2.0 | |
| | | **LoRA Dropout** | 0.02 | |
| | | **Tuned Layers** | 28/32 | |
| | | **Trainable Parameters** | 40.37M (0.53%) | |
| | | **Framework** | MLX (Apple Silicon) | |
| |
|
| | ## Training |
| |
|
| | | Property | Value | |
| | |---|---| |
| | | **Lineage** | base(8000) β v2(100) β v3(200) β v3-cont β v3-best β **v8(1000)** | |
| | | **Resume Checkpoint** | v3-best (59.2% on 76-scenario suite) | |
| | | **Training Iterations** | 1000 (stopped early β val loss converged) | |
| | | **Learning Rate** | 2e-5 (cosine decay to 1e-6, 100-step warmup) | |
| | | **Batch Size** | 4 (effective 8 with grad accumulation) | |
| | | **Max Sequence Length** | 768 tokens | |
| | | **Dataset** | 696K samples from 150 workflows | |
| | | **Val Loss** | 0.032 (from 0.272 starting) | |
| |
|
| | ### Training Data Distribution |
| |
|
| | | Category | Count | % | Description | |
| | |---|---|---|---| |
| | | META | 187K | 26.9% | Dead-end escalation | |
| | | NEGATIVE | 187K | 26.9% | Tier-2 visible but edge chosen ("satisfiable β sensible") | |
| | | NEXT_EDGE | 116K | 16.7% | Normal edge progression | |
| | | NEXT_SKIP π‘ | 55K | 8.0% | Forward dead-end recovery (Tier-2) | |
| | | RETRY | 36K | 5.2% | Edge retry on failure | |
| | | JOIN | 30K | 4.3% | Parallel branch merge | |
| | | NEXT_BACKJUMP π‘ | 28K | 4.0% | Failure recovery to earlier node (Tier-2) | |
| | | FORK_EDGE | 28K | 4.0% | Edge-adjacent fork | |
| | | FORK_NONADJ π‘ | 28K | 4.0% | Non-adjacent parallel fork (Tier-2) | |
| | |
| | π‘ = Protected from downsampling during balancing |
| | |
| | ## Prompt Format |
| | |
| | The model uses a **tiered prompt** with two candidate sections: |
| | |
| | ``` |
| | Current node: NODE_A (SYSTEM, stage 3) |
| | Outcome: success |
| | Failure type: none |
| |
|
| | State: |
| | goal_progress=0.40 |
| | retry_count=0 |
| | ... |
| |
|
| | Produced context: {ctx_start, intake_data, assessment_score} |
| | |
| | Edge candidates (normal path): |
| | 1. NODE_B (AGENT) [processor] β requires: {assessment_score} β produces: {approval} |
| | |
| | Context-eligible (off-path, invocable now): |
| | 1. NODE_X (SYSTEM, stage 5, gap=+2) [validator] β requires: {intake_data} β β produces: {validation} |
| | |
| | Forkable sets: [] |
| | Join-ready: [] |
| | |
| | What is the best action? |
| | ``` |
| | |
| | **Output format:** `DECISION_TYPE NODE_ID` |
| | - `NEXT NODE_B` β advance to NODE_B |
| | - `FORK NODE_A, NODE_B` β parallel fork |
| | - `RETRY NODE_A` β retry current |
| | - `JOIN NODE_A` β merge parallel branches |
| | - `META` β escalate to human |
| |
|
| | ## Evaluation Results |
| |
|
| | ### Section A: Stratified Test (100 held-out samples) |
| |
|
| | | Category | Exact Accuracy | Type Accuracy | |
| | |---|---|---| |
| | | META | 20/20 (100%) | 20/20 (100%) | |
| | | NEGATIVE (Tier-2 visible, edge chosen) | 5/5 (100%) | 5/5 (100%) | |
| | | SKIP_FORWARD | 7/7 (100%) | 7/7 (100%) | |
| | | RETRY | 18/20 (90%) | 18/20 (90%) | |
| | | JOIN | 16/20 (80%) | 16/20 (80%) | |
| | | FORK (non-adjacent) | 12/18 (67%) | 14/18 (78%) | |
| | | NEXT (edge) | 5/8 (63%) | 8/8 (100%) | |
| | | **TOTAL** | **83/100 (83%)** | **88/100 (88%)** | |
| | |
| | ### Section B: Tier-2 Specific (90 held-out samples) |
| | |
| | | Category | Exact Accuracy | Type Accuracy | |
| | |---|---|---| |
| | | Non-Adjacent Fork | 15/15 (100%) | 15/15 (100%) | |
| | | META with Context | 15/15 (100%) | 15/15 (100%) | |
| | | Negative Contrast | 14/15 (93%) | 14/15 (93%) | |
| | | RETRY with Context | 14/15 (93%) | 14/15 (93%) | |
| | | Skip Forward | 13/15 (87%) | 14/15 (93%) | |
| | | JOIN with Context | 10/15 (67%) | 10/15 (67%) | |
| | | **TOTAL** | **81/90 (90%)** | **82/90 (91%)** | |
| | |
| | ## Key Capabilities |
| | |
| | 1. **Context-Contract Reasoning:** Evaluates `required_context β produced_keys` to identify all invocable nodes |
| | 2. **Recovery Routing:** Backjumps on process/resource failure when no edge retry exists |
| | 3. **Stage Skipping:** Advances to forward context-eligible nodes at dead-ends |
| | 4. **Non-Adjacent Parallelism:** Forks independent context-eligible nodes with different actors |
| | 5. **Negative Contrast:** Learned "satisfiable β sensible" β doesn't take Tier-2 when edge path is correct |
| | |
| | ## Usage (MLX) |
| | |
| | ```python |
| | from mlx_lm import load, generate |
| |
|
| | model, tokenizer = load( |
| | "Qwen/Qwen2.5-7B-Instruct", |
| | adapter_path="sameer-saraf-quant-ai/slm-workflow-planner-v8-mlx" |
| | ) |
| | |
| | messages = [ |
| | {"role": "system", "content": "You are a workflow planner..."}, |
| | {"role": "user", "content": "<tiered prompt>"}, |
| | ] |
| | prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | response = generate(model, tokenizer, prompt=prompt, max_tokens=30) |
| | print(response) # "NEXT ESTIMATION_AND_APPROVAL" |
| | ``` |
| | |
| | ## Ensemble Recommendation |
| |
|
| | For production use, combine with GPT-4.1 arbiter for the ~10% edge cases (mainly JOIN confusion): |
| | - v8 handles 90%+ of decisions autonomously |
| | - GPT validates uncertain decisions (estimated 5-10% of traffic) |
| |
|
| | ## Architecture Context |
| |
|
| | This adapter is part of the **Agentic OS** system: |
| | - **Temporal** handles durable execution and state management |
| | - **Neo4j** stores workflow graph definitions |
| | - **SLM (this model)** makes real-time routing decisions |
| | - **Guardrails** validate SLM output before execution |
| |
|
| |
|