agentic-comic-generator / tech_specs.md
ramsi-k's picture
docs: update and add memory handling and tech specs
7a28b51

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

Technical Specs

πŸ”„ Multi-Turn Agent Communication

Feedback Loop Implementation

  1. Initial Request: Brown sends structured prompt to Bayko
  2. Content Generation: Bayko processes via Modal + sponsor APIs
  3. Quality Validation: Brown evaluates output against original intent
  4. Iterative Refinement: Up to 3 feedback cycles with specific improvement requests
  5. Final Assembly: Brown compiles approved content into deliverable format

Agent Message Schema

{
  "message_id": "msg_001",
  "timestamp": "2025-01-15T10:30:00Z",
  "sender": "agent_brown",
  "recipient": "agent_bayko",
  "message_type": "generation_request",
  "payload": {
    "prompt": "A moody K-pop idol finds a puppy",
    "style_tags": ["studio_ghibli", "whisper_soft_lighting"],
    "panels": 4,
    "language": "korean",
    "extras": ["narration", "subtitles"]
  },
  "context": {
    "conversation_id": "conv_001",
    "iteration": 1,
    "previous_feedback": null
  }
}

πŸ“ File Organization & Data Standards

Output Directory Structure

/storyboard/
β”œβ”€β”€ session_001/
β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   β”œβ”€β”€ brown_state.json      # Agent Brown memory/state
β”‚   β”‚   β”œβ”€β”€ bayko_state.json      # Agent Bayko memory/state
β”‚   β”‚   └── conversation_log.json # Inter-agent messages
β”‚   β”œβ”€β”€ content/
β”‚   β”‚   β”œβ”€β”€ panel_1.png          # Generated images
β”‚   β”‚   β”œβ”€β”€ panel_1_audio.mp3    # TTS narration
β”‚   β”‚   β”œβ”€β”€ panel_1_subs.vtt     # Subtitle files
β”‚   β”‚   └── metadata.json        # Content metadata
β”‚   β”œβ”€β”€ iterations/
β”‚   β”‚   β”œβ”€β”€ v1_feedback.json     # Validation feedback
β”‚   β”‚   β”œβ”€β”€ v2_refinement.json   # Refinement requests
β”‚   β”‚   └── final_approval.json  # Final validation
β”‚   └── output/
β”‚       β”œβ”€β”€ final_comic.png      # Assembled comic
β”‚       β”œβ”€β”€ manifest.json        # Complete session data
β”‚       └── performance_log.json # Timing/cost metrics

Metadata Standards

{
  "session_id": "session_001",
  "created_at": "2025-01-15T10:30:00Z",
  "user_prompt": "Original user input",
  "processing_stats": {
    "total_iterations": 2,
    "processing_time_ms": 45000,
    "api_calls": {
      "openai": 3,
      "mistral": 2,
      "modal": 8
    },
    "cost_breakdown": {
      "compute": "$0.15",
      "api_calls": "$0.08"
    }
  },
  "quality_metrics": {
    "brown_approval_score": 0.92,
    "style_consistency": 0.88,
    "prompt_adherence": 0.95
  }
}

βš™οΈ Tool Orchestration & API Integration

Modal Compute Layer

# Modal function for SDXL image generation
@app.function(
    image=modal.Image.debian_slim().pip_install("diffusers", "torch"),
    gpu="A10G",
    timeout=300
)
def generate_comic_panel(prompt: str, style: str) -> bytes:
    # SDXL pipeline with HuggingFace integration
    return generated_image_bytes

Sponsor API Integration

Service Primary Use Secondary Use
OpenAI GPT-4 Agent reasoning & tool calling Dialogue generation
Mistral Code generation & execution Style adaptation
HuggingFace SDXL model hosting Model inference
Modal Serverless GPU compute Sandbox execution

Note: Investigated Mistral's experimental client.beta.agents framework for dynamic task routing, but deferred due to limited stability during hackathon timeframe.

LlamaIndex Agent Memory

from llama_index.core.agent import ReActAgent
from llama_index.core.memory import ChatMemoryBuffer

# Agent Brown with persistent memory
brown_agent = ReActAgent.from_tools(
    tools=[validation_tool, feedback_tool, assembly_tool],
    memory=ChatMemoryBuffer.from_defaults(token_limit=4000),
    verbose=True
)

🌐 Gradio-FastAPI Integration

Frontend Architecture

import gradio as gr
from fastapi import FastAPI
import asyncio

app = FastAPI()

# Gradio interface with real-time updates
def create_comic_interface():
    with gr.Blocks(theme=gr.themes.Soft()) as demo:
        # Input components
        prompt_input = gr.Textbox(label="Story Prompt")
        style_dropdown = gr.Dropdown(["Studio Ghibli", "Manga", "Western"])

        # Real-time status display
        status_display = gr.Markdown("Ready to generate...")
        progress_bar = gr.Progress()

        # Agent thinking display
        agent_logs = gr.JSON(label="Agent Decision Log", visible=True)

        # Output gallery
        comic_output = gr.Gallery(label="Generated Comic Panels")

        # WebSocket connection for real-time updates
        demo.load(setup_websocket_connection)

    return demo

Real-Time Agent Status Updates

  • Agent Thinking Display: Live JSON feed of agent decision-making
  • Progress Tracking: Visual progress bar with stage indicators
  • Error Handling: Graceful failure recovery with user feedback
  • Performance Metrics: Real-time cost and timing information

πŸš€ Deployment Configuration

Multi-Service Architecture

Component Platform Configuration
Frontend HuggingFace Spaces Gradio 4.0.0, Real-time UI
Backend Modal Functions GPU compute, persistent storage
Orchestration LlamaIndex Agent coordination & memory

Environment Variables

# Required API keys for sponsor integrations
OPENAI_API_KEY=your_openai_key
MISTRAL_API_KEY=your_mistral_key
HF_TOKEN=your_huggingface_token
MODAL_TOKEN_ID=your_modal_id
MODAL_TOKEN_SECRET=your_modal_secret

# Application settings
MAX_ITERATIONS=3
TIMEOUT_SECONDS=300
DEBUG_MODE=false

πŸ”§ Extensibility Framework

Plugin Architecture

# plugins/base.py
from abc import ABC, abstractmethod

class ContentPlugin(ABC):
    @abstractmethod
    async def generate(self, prompt: str, context: dict) -> dict:
        pass

    @abstractmethod
    def validate(self, content: dict) -> bool:
        pass

# plugins/tts_plugin.py
class TTSPlugin(ContentPlugin):
    async def generate(self, text: str, voice: str) -> bytes:
        # TTS implementation using sponsor APIs
        pass

Agent Extension Points

  • Custom Tools: Easy integration of new AI services
  • Memory Backends: Swappable persistence layers (Redis, PostgreSQL)
  • Validation Rules: Configurable content quality checks
  • Output Formats: Support for video, interactive comics, AR content

API Abstraction Layer

# services/ai_service.py
class AIServiceRouter:
    def __init__(self):
        self.providers = {
            "dialogue": OpenAIService(),
            "style": MistralService(),
            "image": HuggingFaceService(),
            "compute": ModalService()
        }

    async def route_request(self, service_type: str, payload: dict):
        return await self.providers[service_type].process(payload)

πŸ“Š Performance & Monitoring

Metrics Collection

  • Agent Performance: Decision time, iteration counts, success rates
  • API Usage: Cost tracking, rate limiting, error rates
  • User Experience: Generation time, satisfaction scores
  • System Health: Resource utilization, error logs

Cost Optimization

  • Smart Caching: Reuse similar generations across sessions
  • Batch Processing: Group API calls for efficiency
  • Fallback Strategies: Graceful degradation when services are unavailable