cora / docs /ARCHITECTURE.md
tokgae's picture
Upload folder using huggingface_hub
38ab39c verified

Architecture Overview

System Design Philosophy

Cora is built on three core principles:

  1. Graceful Degradation: Never fail completely; always serve a visual result
  2. RAG over Fine-Tuning: Use museum archives to provide context without costly training
  3. Hybrid Intelligence: Combine AI generation with curated historical data

Component Architecture

Layer 1: Interface

  • UI (Gradio): ui.py - Testing/demo interface
  • Etymology API (FastAPI): etymology_api.py - Production integration endpoint

Layer 2: Generation Pipeline

CoraCurator → CoraEngine → CoraVision → CoraMemory
   (LLM)       (SDXL)       (CLIP)      (ChromaDB)

Layer 3: Data Sources

  • Primary: Hugging Face Inference API (SDXL-Lightning)
  • Fallback: Museum Archives (Smithsonian + Met)

Data Flow

Generation Request Flow

1. User Request
   ↓
2. Curator: Refine prompt with LLM
   ↓
3. Engine: Attempt SDXL generation
   ├─ Success → Continue to step 4
   └─ 402 Error → RAG Fallback
       ↓
       Search Memory by embedding
       ↓
       Return museum artifact
   ↓
4. Vision: Generate embedding + tags
   ↓
5. Memory: Archive for future retrieval
   ↓
6. Response: Image URL + metadata

Ingestion Flow (Museums)

1. Loader (smithsonian_loader.py or met_loader.py)
   ↓
2. API Query → Download images
   ↓
3. Vision: Generate embedding + detect tags
   ↓
4. Memory: Index with metadata
   ↓
5. Persistent storage in ChromaDB

Search Strategy

Hybrid Search Algorithm

Input: Query text (e.g., "roman armor")

Process:

  1. Text → Vector: CLIP text encoder
  2. Keyword Detection: Extract cultural markers ("roman", "greek", etc.)
  3. Over-Retrieve: Fetch 3x candidates via semantic search
  4. Filter: Apply tag constraints (must contain "roman")
  5. Rank: Return top-k filtered results

Advantage: Prevents irrelevant matches (e.g., "roman" in "Roman Catholic art")


Model Details

CoraCurator (LLM)

  • Model: meta-llama/Llama-3.2-3B-Instruct
  • Purpose: Prompt refinement
  • System Instruction: Guide toward "Daily Life" or "Epic Dimension" scenes
  • Context: Etymology → Visual description

CoraEngine (Image Gen)

  • Primary Model: ByteDance/SDXL-Lightning
  • Params: guidance_scale=0.0, steps=4
  • Style: Historical Illustration / Strategy Game Art
  • Fallback: RAG → Museum artifacts

CoraVision (Embeddings)

  • CLIP Model: sentence-transformers/clip-ViT-L-14
  • Output: 768-dimensional vectors
  • YOLO: yolov8n.pt for object detection/tagging

CoraMemory (Vector DB)

  • Database: ChromaDB (persistent, local)
  • Storage: ./archive_db
  • Metadata Schema:
    • path: Local file path
    • prompt: Original search query
    • tags: Comma-separated (e.g., "roman,armor,met_museum_open_access")
    • timestamp: ISO format

API Design

Etymology API Endpoints

POST /api/v1/generate_illustration

Purpose: Single endpoint for full pipeline

Design Decisions:

  • Returns both image_url and image_base64 (flexibility)
  • Includes source field ("generated" vs "archive")
  • Auto-archives all results for future retrieval
  • CORS-enabled for cross-origin integration

GET /api/v1/search_archive

Purpose: Direct access to historical artifacts

Use Case: Browse mode in etymology app

GET /health

Purpose: Monitor component status

Returns:

{
  "status": "healthy",
  "components": {
    "engine": true,
    "curator": true,
    "vision": true,
    "memory": true
  }
}

Scaling Considerations

Current Constraints

  • Single Instance: No load balancing
  • Local Storage: ChromaDB in-process
  • API Limits: HF free tier (402 errors common)

Future Optimizations

  1. Archive Curator (Priority): Intelligent system to manage and curate the museum archive

    • Auto-Tagging: Enhance metadata with historical period, culture, object type
    • Quality Scoring: Rate artifact relevance for different etymology contexts
    • Deduplication: Detect and merge similar artifacts
    • Smart Indexing: Organize by historical timeline, geography, theme
    • Active Curation: Suggest best artifacts for specific words/contexts
    • Gap Analysis: Identify missing periods/cultures and trigger targeted ingestion
  2. Caching: Hash etymology text → serve cached images

  3. Queue System: Celery for async generation

  4. CDN: Serve archive_images/ via CloudFront/similar

  5. Model Hosting: Self-host SDXL on GPU server to avoid 402 errors


Security Notes

API Keys

  • Stored in .env (gitignored)
  • Never exposed in responses or logs

CORS

  • Currently set to allow_origins=["*"] for development
  • Production: Restrict to etymology app domain

Static Files

  • archive_images/ served directly via FastAPI
  • No authentication (museum artifacts are public domain)
  • Consider rate limiting for public deployments