01 / 12

ForgeAI

An Autonomous Multi-Agent Framework that Transforms
Natural-Language Specifications into Fully Tested Software

Phase 1 — Proposed Design Document  |  Itanta Hackathon 2026

7
Specialized Agents
16
FSM States
5
Complexity Tiers
0
Manual Code Lines
🧠 Systems Design ✅ TDD-First 🤖 Gemini 2.5 Flash 🔄 Auto-Recovery
02 / 12 PROBLEM STATEMENT

The Gap: From Reactive Assistants
to Proactive Engineers

❌ Current AI Coding Tools

  • Developers manually decompose projects
  • Hand-feed each task to the AI one by one
  • Evaluate, debug, and integrate each output
  • Handle failures and retries manually
  • No persistent state across tasks
  • Does not scale to complex, multi-system projects

✅ ForgeAI's Approach

  • Automated decomposition into atomic tasks
  • Specialized agents work autonomously
  • TDD-first verification after every task
  • Intelligent failure recovery with cascading strategies
  • Full state tracking with Pydantic models
  • Scales across all 5 complexity tiers
💬 NL Spec User Input
🎯 Intake Clarify & Refine
🏗️ Architect Design
📋 Plan Decompose
QA Tests First
💻 Coder Generate
🔒 Security Audit
📊 Report Summary
03 / 12 SECTION 1 — AGENT ARCHITECTURE

Layered System Architecture

UI LAYER 🖥️ Rich CLI Interface 🌐 FastAPI Web Dashboard ORCHESTRATION LAYER 🧠 Orchestrator 16-State Finite State Machine ⚙️ Config Manager YAML Guardrails (NFR-03) 📝 Activity Logger Append-Only (NFR-02) 📦 Workflow State Pydantic-Typed AGENT LAYER — 7 SPECIALIZED AGENTS 🎯 Intake 🏗️ Architect 📋 Planner QA (TDD) 💻 Coder 🔒 Security 🔄 Recovery TOOL LAYER 🤖 LLM Gateway (Gemini) 📂 File Manager 🧪 Test Runner (pytest) 🐳 Docker Builder Agents only communicate through AgentContext → AgentResult contracts All filesystem I/O goes through the File Manager (sandboxed)
04 / 12 SECTION 1 — AGENT ARCHITECTURE

7 Agents, Clear Responsibilities

🎯
Intake Agent
  • Parse natural-language spec
  • Detect ambiguities & gaps
  • Generate clarifying questions
  • Produce StructuredSpecification
FR-01, FR-02
🏗️
Architect Agent
  • Design project structure
  • Define directory layout
  • Data models & API contracts
  • Technology decisions
FR-04
📋
Planner Agent
  • Decompose into atomic tasks
  • Build dependency graph
  • Assign risk levels
  • Set checkpoint flags
FR-05, FR-06
QA Agent (TDD)
  • Write failing tests FIRST
  • pytest format with assertions
  • Edge case coverage
  • One test file per task
FR-11
💻
Coder Agent
  • Generate production code
  • Pass all failing tests
  • Follow architecture design
  • Present diffs for review
FR-08, FR-09
🔒
Security Agent
  • Scan for injection vulns
  • Auth & RBAC flaws
  • Hardcoded secrets check
  • Generate audit report
FR-14 (Extended)
🔄
Recovery Agent
  • Diagnose root cause
  • Classify error type
  • Provide fix instructions
  • Decide: retry/skip/escalate
FR-15, FR-17

🔑 Uniform Contract: Every agent extends BaseAgent and implements exactly 3 methods: build_system_prompt(), build_user_prompt(), parse_response(). All accept AgentContext → return AgentResult. No shared mutable state.

05 / 12 SECTION 1 — AGENT ARCHITECTURE

Agent Data Flow & Contracts

📥 AgentContext (Input) role: AgentRole specification: StructuredSpecification architecture: dict current_task: AtomicTask existing_files: {path → content} error_message: str error_traceback: str user_input: str clarification_responses: dict retry_count: int BaseAgent build_prompt() → LLM.generate() → parse_response() 📤 AgentResult (Output) success: bool specification: StructuredSpecification architecture: dict implementation_plan: ImplementationPlan generated_files: {path → content} test_results: dict security_report: dict clarifying_questions: list[str] requires_human_input: bool api_calls_made: int, error: str

🔐 Isolation Guarantee: Agents have ZERO filesystem access — they return file dicts {path → content}, and the Orchestrator writes through the sandboxed FileManager. No agent can mutate another agent's output.

06 / 12 SECTION 2 — WORKFLOW DESIGN

16-State Finite State Machine

IDLE INTAKE CLARIFICATION ambiguities found SPECIFICATION ARCHITECTURE PLANNING PLAN_REVIEW 🔒 CHECKPOINT TDD EXECUTION LOOP (per task) EXECUTION TASK_QA TASK_CODE TASK_TEST pass ✅ fail ❌ TASK_RECOVERY retry SECURITY_AUDIT SUMMARY DONE ✓ all tasks complete ERROR LEGEND: Requirements Design Checkpoint TDD Loop Recovery ⚠️ All transitions are validated — invalid transitions are rejected and logged
07 / 12 SECTION 2 — WORKFLOW DESIGN

Agent Handoffs & Human Checkpoints

How Agents Hand Off Work

The Orchestrator is the sole coordinator. Agents never invoke each other directly. Each phase produces artifacts consumed by the next:

Phase Agent Produces Consumed By
1 Intake StructuredSpec Architect, all
2 Architect Architecture dict Planner, QA, Coder
3 Planner ImplementationPlan Orchestrator loop
4 QA Test files Coder, Test Runner
5 Coder Production files Test Runner, Security
6 Recovery Fix instructions Coder (retry)
7 Security Audit report Summary

🔒 Human-in-the-Loop Checkpoints

ForgeAI pauses at configurable checkpoints for human review. All controllable via a single YAML file:

Checkpoint 1: After Specification
User reviews structured spec: requirements, models, API contracts. Can request changes before architecture begins.
Checkpoint 2: After Architecture
User reviews directory layout, tech decisions. Approves before planning starts.
Checkpoint 3: After Plan (FR-06)
User reviews ordered task list with risk levels, dependencies. Must approve before code generation.
Per-Diff Review (FR-09)
Each code change is shown as a diff summary before filesystem write.
# default_config.yaml — Single file to control all behavior (NFR-03)
workflow:
  checkpoints:
    - "after_specification"
    - "after_architecture"
    - "after_plan"
  auto_approve_checkpoints: false   # Set true for zero-touch demo mode
  max_retries: 3                 # Configurable retry limit per task
  retry_delay_seconds: 2
08 / 12 SECTION 2 — WORKFLOW DESIGN

TDD-First Execution Pipeline

STEP 1 — TDD QA Agent Writes failing tests BEFORE any code → tests/test_*.py STEP 2 — CODE 💻 Coder Agent Generates code to pass failing tests → src/*.py STEP 3 — TEST 🧪 Test Runner Executes pytest suite Pass/Fail verdict → {passed, failed, output} ✅ TASK PASSED Move to next task PASS FAIL STEP 4 — RECOVERY 🔄 Recovery Agent Diagnose → Fix → Strategy RETRY | MODIFY | SKIP | ESCALATE Retry with error context 📊 Error Context Accumulation (per retry) Attempt 1: spec + arch + task + tests Attempt 2: + error msg + traceback Attempt 3: + all prior errors + fix instructions Attempt 4: exhausted → SKIP or ESCALATE
09 / 12 SECTION 3 — FAILURE STRATEGY

4-Tier Cascading Recovery System

❌ Test Failure Detected 🔍 Recovery Agent Diagnoses Error Classification 🔧 RETRY_WITH_FIX Confidence: HIGH • Specific fix instructions • Error + traceback → Coder • Modify tests if bug in test Triggers: syntax, import 🔄 MODIFY_APPROACH Confidence: MEDIUM • Different algorithm/pattern • Revised test expectations • Fresh code generation Triggers: logic, approach ⏭️ SKIP_TASK Confidence: N/A • Task is non-blocking • No downstream deps • Log and proceed Triggers: retries exhausted 🚨 ESCALATE Confidence: LOW • Pause pipeline • Show diagnostics to user • User decides next step Triggers: human judgment

🧠 Diagnostic Output Example:

{
  "diagnosis": {
    "root_cause": "ImportError: no 'APIRoute'",
    "error_type": "import",
    "error_in": "production_code"
  },
  "strategy": "RETRY_WITH_FIX",
  "fix_instructions": "Use APIRouter",
  "confidence": 0.95
}

Guardrail Safety (Configurable via YAML)

guardrails:
  max_files_per_task: 8
  max_lines_per_file: 600
  blocked_commands:
    - "rm -rf /"
    - "del /s /q C:\\"
  require_approval_for:
    - "database_schema_changes"
    - "security_sensitive_patterns"
10 / 12 SECTION 4 — TECH STACK JUSTIFICATION

Every Choice is Deliberate

🤖 LLM: Google Gemini 2.5 Flash

Criterion Gemini 2.5 GPT-4o Claude 3.5
Speed ⚡ Fastest 🔵 Fast 🟡 Medium
Context 1M tokens 128K 200K
Cost Free tier Paid Paid
JSON Mode ✅ Native
Code Quality Excellent Excellent Excellent

💡 Why 1M context? Our agents pass full project state — spec, architecture, existing files, error history — to the LLM. A large context window is critical for maintaining coherence across complex multi-file projects.

🔧 Framework & Libraries

Python 3.11+
Hackathon requirement. Rich ecosystem, type hints.
Pydantic v2
Strict contracts at every agent boundary. Auto-validation.
pytest
Industry standard. Rich assertions. Coverage plugin.
Rich
Premium terminal UX. Live tables, spinners, syntax highlighting.
FastAPI
Real-time web dashboard. Async. Auto-docs.
PyYAML
Human-readable config. NFR-03 compliance.

⚠️ Why NOT LangChain / CrewAI / AutoGen?

Abstraction overhead. Custom FSM gives us full control over state transitions and recovery logic.

Debugging difficulty. With LangChain, debugging agent failures through deep abstraction layers is painful.

Evaluation criteria. The hackathon evaluates our systems design, not a framework we imported.

11 / 12 SECTION 5 — RISK ASSESSMENT

Honest Risk Assessment & Mitigation

Risk Impact vs Probability Matrix ← IMPACT → ← PROBABILITY → HIGH LOW MONITOR ⚠️ MITIGATE NOW ACCEPT PLAN CONTINGENCY R1 LLM Hallucination R2 Complex Tier Failure R3 Test Flakiness R4 API Rate Limits R5 Context Overflow
# Risk Impact Probability Mitigation Strategy
R1 LLM Hallucination 🔴 High 🔴 High TDD-first ensures every line is validated. Recovery agent re-diagnoses with accumulated error context across retries.
R2 Complex Tier Failure 🔴 High 🟡 Med Planner assigns risk levels per task. High-risk tasks get extra retries. Gracefully skip non-critical tasks.
R3 Test Flakiness 🟡 Med 🟡 Med Recovery Agent can modify test code if bug is in test. Double-validate test design against spec.
R4 API Rate Limiting 🟡 Med 🟡 Med LLM Gateway has exponential backoff (2s→4s→8s). Max 3 retries per API call. Proactive token tracking.
R5 Context Overflow 🟡 Med 🟢 Low Gemini's 1M token context virtually eliminates this. Files truncated to 1500 chars when injected into prompts.
12 / 12 SUMMARY

ForgeAI: Ready to Build

📊 Evaluation Alignment

Criterion Pts Our Strategy Status
Agentic Autonomy 30 Zero-touch pipeline. Auto-approve mode.
TDD & Verification 25 QA writes tests BEFORE Coder writes code.
Complex Logic & State 20 16-state FSM. Dependency-aware scheduling.
Failure Recovery 10 4-tier cascade: Retry→Modify→Skip→Escalate.
Code Quality 10 Modular, typed, documented, config-driven.
Extended Features 5 Security audit, Docker, Web dashboard.

📁 Output Artifacts Per Run

📄 structured_specification.yaml
Machine + human-readable refined spec (FR-01, FR-02)
🏗️ architecture.json
Project structure, models, API contracts (FR-04)
📋 implementation_plan.json
Ordered tasks with risks & dependencies (FR-05)
📝 forgeai_activity.log
Timestamped log of every action (NFR-02)
📦 generated_project/ + tests/
Complete runnable project with TDD test suite
📊 workflow_summary.json + security_report.json
Summary: tasks, files, tests, API calls (NFR-06)

🚀 ForgeAI is architecturally complete and ready for Phase 2 implementation.
7 agents. 16-state FSM. TDD-first. Intelligent recovery. Built from scratch — no LangChain, no shortcuts, pure systems engineering.