01 / 12 ⚡

ForgeAI

An Autonomous Multi-Agent Framework that Transforms
Natural-Language Specifications into Fully Tested Software

Phase 1 — Proposed Design Document | Itanta Hackathon 2026

7

Specialized Agents

16

FSM States

5

Complexity Tiers

0

Manual Code Lines

🧠 Systems Design ✅ TDD-First 🤖 Gemini 2.5 Flash 🔄 Auto-Recovery

02 / 12 PROBLEM STATEMENT

The Gap: From Reactive Assistants
to Proactive Engineers

❌ Current AI Coding Tools

Developers manually decompose projects
Hand-feed each task to the AI one by one
Evaluate, debug, and integrate each output
Handle failures and retries manually
No persistent state across tasks
Does not scale to complex, multi-system projects

✅ ForgeAI's Approach

Automated decomposition into atomic tasks
Specialized agents work autonomously
TDD-first verification after every task
Intelligent failure recovery with cascading strategies
Full state tracking with Pydantic models
Scales across all 5 complexity tiers

💬 NL Spec User Input

🎯 Intake Clarify & Refine

🏗️ Architect Design

📋 Plan Decompose

✅ QA Tests First

💻 Coder Generate

🔒 Security Audit

📊 Report Summary

03 / 12 SECTION 1 — AGENT ARCHITECTURE

Layered System Architecture

04 / 12 SECTION 1 — AGENT ARCHITECTURE

7 Agents, Clear Responsibilities

🎯

Intake Agent

Parse natural-language spec
Detect ambiguities & gaps
Generate clarifying questions
Produce StructuredSpecification

FR-01, FR-02

🏗️

Architect Agent

Design project structure
Define directory layout
Data models & API contracts
Technology decisions

FR-04

📋

Planner Agent

Decompose into atomic tasks
Build dependency graph
Assign risk levels
Set checkpoint flags

FR-05, FR-06

✅

QA Agent (TDD)

Write failing tests FIRST
pytest format with assertions
Edge case coverage
One test file per task

FR-11

💻

Coder Agent

Generate production code
Pass all failing tests
Follow architecture design
Present diffs for review

FR-08, FR-09

🔒

Security Agent

Scan for injection vulns
Auth & RBAC flaws
Hardcoded secrets check
Generate audit report

FR-14 (Extended)

🔄

Recovery Agent

Diagnose root cause
Classify error type
Provide fix instructions
Decide: retry/skip/escalate

FR-15, FR-17

🔑 Uniform Contract: Every agent extends BaseAgent and implements exactly 3 methods: build_system_prompt(), build_user_prompt(), parse_response(). All accept AgentContext → return AgentResult. No shared mutable state.

05 / 12 SECTION 1 — AGENT ARCHITECTURE

Agent Data Flow & Contracts

🔐 Isolation Guarantee: Agents have ZERO filesystem access — they return file dicts {path → content}, and the Orchestrator writes through the sandboxed FileManager. No agent can mutate another agent's output.

06 / 12 SECTION 2 — WORKFLOW DESIGN

16-State Finite State Machine

07 / 12 SECTION 2 — WORKFLOW DESIGN

Agent Handoffs & Human Checkpoints

How Agents Hand Off Work

The Orchestrator is the sole coordinator. Agents never invoke each other directly. Each phase produces artifacts consumed by the next:

Phase	Agent	Produces	Consumed By
1	Intake	StructuredSpec	Architect, all
2	Architect	Architecture dict	Planner, QA, Coder
3	Planner	ImplementationPlan	Orchestrator loop
4	QA	Test files	Coder, Test Runner
5	Coder	Production files	Test Runner, Security
6	Recovery	Fix instructions	Coder (retry)
7	Security	Audit report	Summary

🔒 Human-in-the-Loop Checkpoints

ForgeAI pauses at configurable checkpoints for human review. All controllable via a single YAML file:

Checkpoint 1: After Specification

User reviews structured spec: requirements, models, API contracts. Can request changes before architecture begins.

Checkpoint 2: After Architecture

User reviews directory layout, tech decisions. Approves before planning starts.

Checkpoint 3: After Plan (FR-06)

User reviews ordered task list with risk levels, dependencies. Must approve before code generation.

Per-Diff Review (FR-09)

Each code change is shown as a diff summary before filesystem write.

      # default_config.yaml — Single file to control all behavior (NFR-03)

      workflow:

        checkpoints:

          - "after_specification"

          - "after_architecture"

          - "after_plan"

        auto_approve_checkpoints: false
        # Set true for zero-touch demo mode

        max_retries: 3
                      # Configurable retry limit per task

        retry_delay_seconds: 2

08 / 12 SECTION 2 — WORKFLOW DESIGN

TDD-First Execution Pipeline

09 / 12 SECTION 3 — FAILURE STRATEGY

4-Tier Cascading Recovery System

🧠 Diagnostic Output Example:

          {

            "diagnosis": {

              "root_cause": "ImportError:
            no 'APIRoute'",

              "error_type": "import",

              "error_in": "production_code"

            },

            "strategy": "RETRY_WITH_FIX",

            "fix_instructions": "Use
            APIRouter",

            "confidence": 0.95

          }

Guardrail Safety (Configurable via YAML)

          guardrails:

            max_files_per_task: 8

            max_lines_per_file: 600

            blocked_commands:

              - "rm -rf /"

              - "del /s /q C:\\"

            require_approval_for:

              - "database_schema_changes"

              - "security_sensitive_patterns"

10 / 12 SECTION 4 — TECH STACK JUSTIFICATION

Every Choice is Deliberate

🤖 LLM: Google Gemini 2.5 Flash

Criterion	Gemini 2.5	GPT-4o	Claude 3.5
Speed	⚡ Fastest	🔵 Fast	🟡 Medium
Context	1M tokens	128K	200K
Cost	Free tier	Paid	Paid
JSON Mode	✅ Native	✅	❌
Code Quality	Excellent	Excellent	Excellent

💡 Why 1M context? Our agents pass full project state — spec, architecture, existing files, error history — to the LLM. A large context window is critical for maintaining coherence across complex multi-file projects.

🔧 Framework & Libraries

Python 3.11+

Hackathon requirement. Rich ecosystem, type hints.

Pydantic v2

Strict contracts at every agent boundary. Auto-validation.

pytest

Industry standard. Rich assertions. Coverage plugin.

Rich

Premium terminal UX. Live tables, spinners, syntax highlighting.

FastAPI

Real-time web dashboard. Async. Auto-docs.

PyYAML

Human-readable config. NFR-03 compliance.

⚠️ Why NOT LangChain / CrewAI / AutoGen?

Abstraction overhead. Custom FSM gives us full control over state transitions and recovery logic.

Debugging difficulty. With LangChain, debugging agent failures through deep abstraction layers is painful.

Evaluation criteria. The hackathon evaluates our systems design, not a framework we imported.

11 / 12 SECTION 5 — RISK ASSESSMENT

Honest Risk Assessment & Mitigation

#	Risk	Impact	Probability	Mitigation Strategy
R1	LLM Hallucination	🔴 High	🔴 High	TDD-first ensures every line is validated. Recovery agent re-diagnoses with accumulated error context across retries.
R2	Complex Tier Failure	🔴 High	🟡 Med	Planner assigns risk levels per task. High-risk tasks get extra retries. Gracefully skip non-critical tasks.
R3	Test Flakiness	🟡 Med	🟡 Med	Recovery Agent can modify test code if bug is in test. Double-validate test design against spec.
R4	API Rate Limiting	🟡 Med	🟡 Med	LLM Gateway has exponential backoff (2s→4s→8s). Max 3 retries per API call. Proactive token tracking.
R5	Context Overflow	🟡 Med	🟢 Low	Gemini's 1M token context virtually eliminates this. Files truncated to 1500 chars when injected into prompts.

12 / 12 SUMMARY

ForgeAI: Ready to Build

📊 Evaluation Alignment

Criterion	Pts	Our Strategy	Status
Agentic Autonomy	30	Zero-touch pipeline. Auto-approve mode.	✅
TDD & Verification	25	QA writes tests BEFORE Coder writes code.	✅
Complex Logic & State	20	16-state FSM. Dependency-aware scheduling.	✅
Failure Recovery	10	4-tier cascade: Retry→Modify→Skip→Escalate.	✅
Code Quality	10	Modular, typed, documented, config-driven.	✅
Extended Features	5	Security audit, Docker, Web dashboard.	✅

📁 Output Artifacts Per Run

📄 structured_specification.yaml

Machine + human-readable refined spec (FR-01, FR-02)

🏗️ architecture.json

Project structure, models, API contracts (FR-04)

📋 implementation_plan.json

Ordered tasks with risks & dependencies (FR-05)

📝 forgeai_activity.log

Timestamped log of every action (NFR-02)

📦 generated_project/ + tests/

Complete runnable project with TDD test suite

📊 workflow_summary.json + security_report.json

Summary: tasks, files, tests, API calls (NFR-06)

🚀 ForgeAI is architecturally complete and ready for Phase 2 implementation.
7 agents. 16-state FSM. TDD-first. Intelligent recovery. Built from scratch — no LangChain, no shortcuts, pure systems engineering.