File size: 5,275 Bytes
38ab39c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
# Architecture Overview

## System Design Philosophy

Cora is built on three core principles:

1. **Graceful Degradation**: Never fail completely; always serve a visual result
2. **RAG over Fine-Tuning**: Use museum archives to provide context without costly training
3. **Hybrid Intelligence**: Combine AI generation with curated historical data

---

## Component Architecture

### Layer 1: Interface
- **UI (Gradio)**: `ui.py` - Testing/demo interface
- **Etymology API (FastAPI)**: `etymology_api.py` - Production integration endpoint

### Layer 2: Generation Pipeline
```

CoraCurator → CoraEngine → CoraVision → CoraMemory

   (LLM)       (SDXL)       (CLIP)      (ChromaDB)

```

### Layer 3: Data Sources
- **Primary**: Hugging Face Inference API (SDXL-Lightning)
- **Fallback**: Museum Archives (Smithsonian + Met)

---

## Data Flow

### Generation Request Flow
```

1. User Request


2. Curator: Refine prompt with LLM


3. Engine: Attempt SDXL generation

   ├─ Success → Continue to step 4

   └─ 402 Error → RAG Fallback


       Search Memory by embedding


       Return museum artifact


4. Vision: Generate embedding + tags


5. Memory: Archive for future retrieval


6. Response: Image URL + metadata

```

### Ingestion Flow (Museums)
```

1. Loader (smithsonian_loader.py or met_loader.py)


2. API Query → Download images


3. Vision: Generate embedding + detect tags


4. Memory: Index with metadata


5. Persistent storage in ChromaDB

```

---

## Search Strategy

### Hybrid Search Algorithm

**Input:** Query text (e.g., "roman armor")

**Process:**
1. **Text → Vector**: CLIP text encoder
2. **Keyword Detection**: Extract cultural markers ("roman", "greek", etc.)
3. **Over-Retrieve**: Fetch 3x candidates via semantic search
4. **Filter**: Apply tag constraints (must contain "roman")
5. **Rank**: Return top-k filtered results

**Advantage:** Prevents irrelevant matches (e.g., "roman" in "Roman Catholic art")

---

## Model Details

### CoraCurator (LLM)
- **Model**: `meta-llama/Llama-3.2-3B-Instruct`
- **Purpose**: Prompt refinement
- **System Instruction**: Guide toward "Daily Life" or "Epic Dimension" scenes
- **Context**: Etymology → Visual description

### CoraEngine (Image Gen)
- **Primary Model**: `ByteDance/SDXL-Lightning`
- **Params**: `guidance_scale=0.0`, `steps=4`
- **Style**: Historical Illustration / Strategy Game Art
- **Fallback**: RAG → Museum artifacts

### CoraVision (Embeddings)
- **CLIP Model**: `sentence-transformers/clip-ViT-L-14`
- **Output**: 768-dimensional vectors
- **YOLO**: `yolov8n.pt` for object detection/tagging

### CoraMemory (Vector DB)
- **Database**: ChromaDB (persistent, local)
- **Storage**: `./archive_db`
- **Metadata Schema**:
  - `path`: Local file path
  - `prompt`: Original search query
  - `tags`: Comma-separated (e.g., "roman,armor,met_museum_open_access")

  - `timestamp`: ISO format



---



## API Design



### Etymology API Endpoints



#### POST `/api/v1/generate_illustration`
**Purpose**: Single endpoint for full pipeline

**Design Decisions**:
- Returns both `image_url` and `image_base64` (flexibility)
- Includes `source` field ("generated" vs "archive")
- Auto-archives all results for future retrieval
- CORS-enabled for cross-origin integration

#### GET `/api/v1/search_archive`

**Purpose**: Direct access to historical artifacts



**Use Case**: Browse mode in etymology app



#### GET `/health`

**Purpose**: Monitor component status



**Returns**:

```json

{

  "status": "healthy",

  "components": {

    "engine": true,

    "curator": true,

    "vision": true,

    "memory": true

  }

}

```



---



## Scaling Considerations



### Current Constraints

- **Single Instance**: No load balancing

- **Local Storage**: ChromaDB in-process

- **API Limits**: HF free tier (402 errors common)



### Future Optimizations

1. **Archive Curator (Priority)**: Intelligent system to manage and curate the museum archive

   - **Auto-Tagging**: Enhance metadata with historical period, culture, object type

   - **Quality Scoring**: Rate artifact relevance for different etymology contexts

   - **Deduplication**: Detect and merge similar artifacts

   - **Smart Indexing**: Organize by historical timeline, geography, theme

   - **Active Curation**: Suggest best artifacts for specific words/contexts

   - **Gap Analysis**: Identify missing periods/cultures and trigger targeted ingestion

   

2. **Caching**: Hash etymology text → serve cached images

3. **Queue System**: Celery for async generation

4. **CDN**: Serve `archive_images/` via CloudFront/similar
5. **Model Hosting**: Self-host SDXL on GPU server to avoid 402 errors

---

## Security Notes

### API Keys
- Stored in `.env` (gitignored)
- Never exposed in responses or logs

### CORS
- Currently set to `allow_origins=["*"]` for development
- **Production**: Restrict to etymology app domain

### Static Files
- `archive_images/` served directly via FastAPI
- No authentication (museum artifacts are public domain)
- Consider rate limiting for public deployments