cora / docs /ARCHITECTURE.md
tokgae's picture
Upload folder using huggingface_hub
38ab39c verified
# Architecture Overview
## System Design Philosophy
Cora is built on three core principles:
1. **Graceful Degradation**: Never fail completely; always serve a visual result
2. **RAG over Fine-Tuning**: Use museum archives to provide context without costly training
3. **Hybrid Intelligence**: Combine AI generation with curated historical data
---
## Component Architecture
### Layer 1: Interface
- **UI (Gradio)**: `ui.py` - Testing/demo interface
- **Etymology API (FastAPI)**: `etymology_api.py` - Production integration endpoint
### Layer 2: Generation Pipeline
```
CoraCurator → CoraEngine → CoraVision → CoraMemory
(LLM) (SDXL) (CLIP) (ChromaDB)
```
### Layer 3: Data Sources
- **Primary**: Hugging Face Inference API (SDXL-Lightning)
- **Fallback**: Museum Archives (Smithsonian + Met)
---
## Data Flow
### Generation Request Flow
```
1. User Request
2. Curator: Refine prompt with LLM
3. Engine: Attempt SDXL generation
├─ Success → Continue to step 4
└─ 402 Error → RAG Fallback
Search Memory by embedding
Return museum artifact
4. Vision: Generate embedding + tags
5. Memory: Archive for future retrieval
6. Response: Image URL + metadata
```
### Ingestion Flow (Museums)
```
1. Loader (smithsonian_loader.py or met_loader.py)
2. API Query → Download images
3. Vision: Generate embedding + detect tags
4. Memory: Index with metadata
5. Persistent storage in ChromaDB
```
---
## Search Strategy
### Hybrid Search Algorithm
**Input:** Query text (e.g., "roman armor")
**Process:**
1. **Text → Vector**: CLIP text encoder
2. **Keyword Detection**: Extract cultural markers ("roman", "greek", etc.)
3. **Over-Retrieve**: Fetch 3x candidates via semantic search
4. **Filter**: Apply tag constraints (must contain "roman")
5. **Rank**: Return top-k filtered results
**Advantage:** Prevents irrelevant matches (e.g., "roman" in "Roman Catholic art")
---
## Model Details
### CoraCurator (LLM)
- **Model**: `meta-llama/Llama-3.2-3B-Instruct`
- **Purpose**: Prompt refinement
- **System Instruction**: Guide toward "Daily Life" or "Epic Dimension" scenes
- **Context**: Etymology → Visual description
### CoraEngine (Image Gen)
- **Primary Model**: `ByteDance/SDXL-Lightning`
- **Params**: `guidance_scale=0.0`, `steps=4`
- **Style**: Historical Illustration / Strategy Game Art
- **Fallback**: RAG → Museum artifacts
### CoraVision (Embeddings)
- **CLIP Model**: `sentence-transformers/clip-ViT-L-14`
- **Output**: 768-dimensional vectors
- **YOLO**: `yolov8n.pt` for object detection/tagging
### CoraMemory (Vector DB)
- **Database**: ChromaDB (persistent, local)
- **Storage**: `./archive_db`
- **Metadata Schema**:
- `path`: Local file path
- `prompt`: Original search query
- `tags`: Comma-separated (e.g., "roman,armor,met_museum_open_access")
- `timestamp`: ISO format
---
## API Design
### Etymology API Endpoints
#### POST `/api/v1/generate_illustration`
**Purpose**: Single endpoint for full pipeline
**Design Decisions**:
- Returns both `image_url` and `image_base64` (flexibility)
- Includes `source` field ("generated" vs "archive")
- Auto-archives all results for future retrieval
- CORS-enabled for cross-origin integration
#### GET `/api/v1/search_archive`
**Purpose**: Direct access to historical artifacts
**Use Case**: Browse mode in etymology app
#### GET `/health`
**Purpose**: Monitor component status
**Returns**:
```json
{
"status": "healthy",
"components": {
"engine": true,
"curator": true,
"vision": true,
"memory": true
}
}
```
---
## Scaling Considerations
### Current Constraints
- **Single Instance**: No load balancing
- **Local Storage**: ChromaDB in-process
- **API Limits**: HF free tier (402 errors common)
### Future Optimizations
1. **Archive Curator (Priority)**: Intelligent system to manage and curate the museum archive
- **Auto-Tagging**: Enhance metadata with historical period, culture, object type
- **Quality Scoring**: Rate artifact relevance for different etymology contexts
- **Deduplication**: Detect and merge similar artifacts
- **Smart Indexing**: Organize by historical timeline, geography, theme
- **Active Curation**: Suggest best artifacts for specific words/contexts
- **Gap Analysis**: Identify missing periods/cultures and trigger targeted ingestion
2. **Caching**: Hash etymology text → serve cached images
3. **Queue System**: Celery for async generation
4. **CDN**: Serve `archive_images/` via CloudFront/similar
5. **Model Hosting**: Self-host SDXL on GPU server to avoid 402 errors
---
## Security Notes
### API Keys
- Stored in `.env` (gitignored)
- Never exposed in responses or logs
### CORS
- Currently set to `allow_origins=["*"]` for development
- **Production**: Restrict to etymology app domain
### Static Files
- `archive_images/` served directly via FastAPI
- No authentication (museum artifacts are public domain)
- Consider rate limiting for public deployments