Big backend rewrite for using HF datasets
Browse files- README.md +19 -12
- backend/runner/app.py +7 -15
- backend/runner/config.py +53 -14
- backend/runner/filtering.py +10 -28
- backend/runner/inference.py +51 -114
- requirements.txt +2 -1
README.md
CHANGED
|
@@ -12,6 +12,7 @@ models:
|
|
| 12 |
- samwaugh/paintingclip-lora
|
| 13 |
datasets:
|
| 14 |
- samwaugh/artefact-embeddings
|
|
|
|
| 15 |
- samwaugh/artefact-markdown
|
| 16 |
---
|
| 17 |
|
|
@@ -46,6 +47,12 @@ datasets:
|
|
| 46 |
- `clip_embeddings.safetensors` (6.39GB) - CLIP model embeddings
|
| 47 |
- `paintingclip_embeddings.safetensors` (6.39GB) - PaintingCLIP embeddings
|
| 48 |
- `*_sentence_ids.json` (71.7MB each) - Sentence ID mappings
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
- **`artefact-markdown`**: Source documents and images (planned)
|
| 50 |
- 7,200 work directories with markdown files and associated images
|
| 51 |
- Organized by work ID for efficient retrieval
|
|
@@ -87,7 +94,7 @@ git push hf main:main
|
|
| 87 |
# Force rebuild if needed (use HF Space settings β Factory Reset)
|
| 88 |
```
|
| 89 |
|
| 90 |
-
##
|
| 91 |
|
| 92 |
### **Environment Variables**
|
| 93 |
- `STUB_MODE`: Set to `1` for stub responses, `0` for real ML inference
|
|
@@ -96,11 +103,11 @@ git push hf main:main
|
|
| 96 |
- `MAX_WORKERS`: Thread pool size for ML inference (default: 2)
|
| 97 |
|
| 98 |
### **Data Sources**
|
| 99 |
-
The application connects to distributed
|
| 100 |
- **Embeddings**: `samwaugh/artefact-embeddings` for fast similarity search
|
| 101 |
-
- **
|
|
|
|
| 102 |
- **Models**: Local `data/models/` directory for ML model weights
|
| 103 |
-
- **Metadata**: Local `data/json_info/` for fast access to sentence and work information
|
| 104 |
|
| 105 |
## π Data Processing Pipeline
|
| 106 |
|
|
@@ -118,14 +125,12 @@ ArteFact processes a massive corpus of art historical texts:
|
|
| 118 |
data/
|
| 119 |
βββ models/
|
| 120 |
β βββ PaintingCLIP/ # LoRA fine-tuned weights
|
| 121 |
-
βββ embeddings/ # Local cache (if needed)
|
| 122 |
-
βββ json_info/ # Metadata files
|
| 123 |
-
β βββ sentences.json # 3.1M sentence metadata
|
| 124 |
-
β βββ works.json # 7,200 work records
|
| 125 |
-
β βββ creators.json # Artist/creator mappings
|
| 126 |
-
β βββ topics.json # Topic classifications
|
| 127 |
-
β βββ topic_names.json # Human-readable topic names
|
| 128 |
βββ marker_output/ # Document analysis outputs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
```
|
| 130 |
|
| 131 |
## π§ AI Models & Features
|
|
@@ -162,6 +167,7 @@ data/
|
|
| 162 |
- **Memory-Optimized Inference**: Caching and batch processing
|
| 163 |
- **Real-Time Analysis**: Sub-second response times for similarity search
|
| 164 |
- **Scalable Architecture**: Designed for production deployment
|
|
|
|
| 165 |
|
| 166 |
### **Academic Applications**
|
| 167 |
- **Art Historical Research**: Discover connections across large corpora
|
|
@@ -203,8 +209,9 @@ This work made use of the facilities of the N8 Centre of Excellence in Computati
|
|
| 203 |
- **Source Code**: [GitHub Repository](https://github.com/sammwaughh/artefact-context)
|
| 204 |
- **Research Paper**: [Download PDF](paper/waugh2025artcontext.pdf)
|
| 205 |
- **Embeddings Dataset**: [artefact-embeddings on HF](https://huggingface.co/datasets/samwaugh/artefact-embeddings)
|
|
|
|
| 206 |
- **Markdown Dataset**: [artefact-markdown on HF](https://huggingface.co/datasets/samwaugh/artefact-markdown) (planned)
|
| 207 |
|
| 208 |
---
|
| 209 |
|
| 210 |
-
*ArteFact represents a significant contribution to computational art history, making large-scale scholarly resources accessible through AI-powered visual analysis while maintaining academic rigor and providing transparent explanations of AI decision-making.*
|
|
|
|
| 12 |
- samwaugh/paintingclip-lora
|
| 13 |
datasets:
|
| 14 |
- samwaugh/artefact-embeddings
|
| 15 |
+
- samwaugh/artefact-json
|
| 16 |
- samwaugh/artefact-markdown
|
| 17 |
---
|
| 18 |
|
|
|
|
| 47 |
- `clip_embeddings.safetensors` (6.39GB) - CLIP model embeddings
|
| 48 |
- `paintingclip_embeddings.safetensors` (6.39GB) - PaintingCLIP embeddings
|
| 49 |
- `*_sentence_ids.json` (71.7MB each) - Sentence ID mappings
|
| 50 |
+
- **`artefact-json`**: Metadata and structured data
|
| 51 |
+
- `sentences.json` - 3.1M sentence metadata
|
| 52 |
+
- `works.json` - 7,200 work records
|
| 53 |
+
- `creators.json` - Artist/creator mappings
|
| 54 |
+
- `topics.json` - Topic classifications
|
| 55 |
+
- `topic_names.json` - Human-readable topic names
|
| 56 |
- **`artefact-markdown`**: Source documents and images (planned)
|
| 57 |
- 7,200 work directories with markdown files and associated images
|
| 58 |
- Organized by work ID for efficient retrieval
|
|
|
|
| 94 |
# Force rebuild if needed (use HF Space settings β Factory Reset)
|
| 95 |
```
|
| 96 |
|
| 97 |
+
## βοΈ Configuration
|
| 98 |
|
| 99 |
### **Environment Variables**
|
| 100 |
- `STUB_MODE`: Set to `1` for stub responses, `0` for real ML inference
|
|
|
|
| 103 |
- `MAX_WORKERS`: Thread pool size for ML inference (default: 2)
|
| 104 |
|
| 105 |
### **Data Sources**
|
| 106 |
+
The application automatically connects to distributed Hugging Face datasets:
|
| 107 |
- **Embeddings**: `samwaugh/artefact-embeddings` for fast similarity search
|
| 108 |
+
- **Metadata**: `samwaugh/artefact-json` for sentence, work, and topic information
|
| 109 |
+
- **Documents**: `samwaugh/artefact-markdown` for source documents and context
|
| 110 |
- **Models**: Local `data/models/` directory for ML model weights
|
|
|
|
| 111 |
|
| 112 |
## π Data Processing Pipeline
|
| 113 |
|
|
|
|
| 125 |
data/
|
| 126 |
βββ models/
|
| 127 |
β βββ PaintingCLIP/ # LoRA fine-tuned weights
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
βββ marker_output/ # Document analysis outputs
|
| 129 |
+
|
| 130 |
+
# Data hosted on Hugging Face Hub:
|
| 131 |
+
# - samwaugh/artefact-embeddings: 12.8GB embeddings
|
| 132 |
+
# - samwaugh/artefact-json: Metadata files
|
| 133 |
+
# - samwaugh/artefact-markdown: Source documents
|
| 134 |
```
|
| 135 |
|
| 136 |
## π§ AI Models & Features
|
|
|
|
| 167 |
- **Memory-Optimized Inference**: Caching and batch processing
|
| 168 |
- **Real-Time Analysis**: Sub-second response times for similarity search
|
| 169 |
- **Scalable Architecture**: Designed for production deployment
|
| 170 |
+
- **Distributed Data**: Hugging Face datasets for scalable data management
|
| 171 |
|
| 172 |
### **Academic Applications**
|
| 173 |
- **Art Historical Research**: Discover connections across large corpora
|
|
|
|
| 209 |
- **Source Code**: [GitHub Repository](https://github.com/sammwaughh/artefact-context)
|
| 210 |
- **Research Paper**: [Download PDF](paper/waugh2025artcontext.pdf)
|
| 211 |
- **Embeddings Dataset**: [artefact-embeddings on HF](https://huggingface.co/datasets/samwaugh/artefact-embeddings)
|
| 212 |
+
- **JSON Dataset**: [artefact-json on HF](https://huggingface.co/datasets/samwaugh/artefact-json)
|
| 213 |
- **Markdown Dataset**: [artefact-markdown on HF](https://huggingface.co/datasets/samwaugh/artefact-markdown) (planned)
|
| 214 |
|
| 215 |
---
|
| 216 |
|
| 217 |
+
*ArteFact represents a significant contribution to computational art history, making large-scale scholarly resources accessible through AI-powered visual analysis while maintaining academic rigor and providing transparent explanations of AI decision-making. The application now leverages Hugging Face's distributed data infrastructure for scalable and collaborative research.*
|
backend/runner/app.py
CHANGED
|
@@ -101,25 +101,17 @@ from .config import (
|
|
| 101 |
MARKER_DIR
|
| 102 |
)
|
| 103 |
|
|
|
|
|
|
|
|
|
|
| 104 |
# --------------------------------------------------------------------------- #
|
| 105 |
-
# Global Data (
|
| 106 |
# --------------------------------------------------------------------------- #
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
try:
|
| 110 |
-
return json.loads(p.read_text(encoding="utf-8")) if p.is_file() else default
|
| 111 |
-
except Exception:
|
| 112 |
-
return default
|
| 113 |
-
|
| 114 |
-
# Load data/sentences.json into variables (safe for missing files)
|
| 115 |
-
sentences = _load_json(JSON_INFO_DIR / "sentences.json", {})
|
| 116 |
-
works = _load_json(JSON_INFO_DIR / "works.json", {})
|
| 117 |
-
creators = _load_json(JSON_INFO_DIR / "creators.json", {})
|
| 118 |
-
topics = _load_json(JSON_INFO_DIR / "topics.json", {})
|
| 119 |
-
topic_names = _load_json(JSON_INFO_DIR / "topic_names.json", {})
|
| 120 |
|
| 121 |
# Debug logging for data loading
|
| 122 |
-
print(f"π Data loaded:")
|
| 123 |
print(f"π Sentences: {len(sentences)} entries")
|
| 124 |
print(f"π Works: {len(works)} entries")
|
| 125 |
print(f"π Topics: {len(topics)} entries")
|
|
|
|
| 101 |
MARKER_DIR
|
| 102 |
)
|
| 103 |
|
| 104 |
+
# Import data from config (loaded from HF datasets)
|
| 105 |
+
from .config import sentences, works, creators, topics, topic_names
|
| 106 |
+
|
| 107 |
# --------------------------------------------------------------------------- #
|
| 108 |
+
# Global Data (loaded from HF datasets via config) #
|
| 109 |
# --------------------------------------------------------------------------- #
|
| 110 |
+
# Data is now loaded from Hugging Face datasets in config.py
|
| 111 |
+
# No need to load from local files anymore
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
# Debug logging for data loading
|
| 114 |
+
print(f"π Data loaded from HF datasets:")
|
| 115 |
print(f"π Sentences: {len(sentences)} entries")
|
| 116 |
print(f"π Works: {len(works)} entries")
|
| 117 |
print(f"π Topics: {len(topics)} entries")
|
backend/runner/config.py
CHANGED
|
@@ -1,10 +1,16 @@
|
|
| 1 |
"""
|
| 2 |
-
Unified configuration for
|
| 3 |
All runner modules should import from this module instead of defining their own paths.
|
| 4 |
"""
|
| 5 |
|
| 6 |
import os
|
| 7 |
from pathlib import Path
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
# READ root (repo data - read-only)
|
| 10 |
PROJECT_ROOT = Path(__file__).resolve().parents[2]
|
|
@@ -35,8 +41,6 @@ print(f"β
Using WRITE_ROOT: {WRITE_ROOT}")
|
|
| 35 |
print(f"β
Using READ_ROOT: {DATA_READ_ROOT}")
|
| 36 |
|
| 37 |
# Read-only directories (from repo)
|
| 38 |
-
EMBEDDINGS_DIR = DATA_READ_ROOT / "embeddings"
|
| 39 |
-
JSON_INFO_DIR = DATA_READ_ROOT / "json_info"
|
| 40 |
MODELS_DIR = DATA_READ_ROOT / "models"
|
| 41 |
MARKER_DIR = DATA_READ_ROOT / "marker_output"
|
| 42 |
|
|
@@ -55,16 +59,51 @@ for dir_path in [OUTPUTS_DIR, ARTIFACTS_DIR]:
|
|
| 55 |
except Exception as e:
|
| 56 |
print(f"β οΈ Could not create directory {dir_path}: {e}")
|
| 57 |
|
| 58 |
-
#
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
|
| 70 |
-
|
|
|
|
| 1 |
"""
|
| 2 |
+
Unified configuration for Hugging Face datasets integration.
|
| 3 |
All runner modules should import from this module instead of defining their own paths.
|
| 4 |
"""
|
| 5 |
|
| 6 |
import os
|
| 7 |
from pathlib import Path
|
| 8 |
+
from datasets import load_dataset
|
| 9 |
+
|
| 10 |
+
# HF Dataset IDs
|
| 11 |
+
EMBEDDINGS_DATASET = "samwaugh/artefact-embeddings"
|
| 12 |
+
JSON_DATASET = "samwaugh/artefact-json"
|
| 13 |
+
MARKDOWN_DATASET = "samwaugh/artefact-markdown"
|
| 14 |
|
| 15 |
# READ root (repo data - read-only)
|
| 16 |
PROJECT_ROOT = Path(__file__).resolve().parents[2]
|
|
|
|
| 41 |
print(f"β
Using READ_ROOT: {DATA_READ_ROOT}")
|
| 42 |
|
| 43 |
# Read-only directories (from repo)
|
|
|
|
|
|
|
| 44 |
MODELS_DIR = DATA_READ_ROOT / "models"
|
| 45 |
MARKER_DIR = DATA_READ_ROOT / "marker_output"
|
| 46 |
|
|
|
|
| 59 |
except Exception as e:
|
| 60 |
print(f"β οΈ Could not create directory {dir_path}: {e}")
|
| 61 |
|
| 62 |
+
# Global data variables (will be populated from HF datasets)
|
| 63 |
+
sentences = {}
|
| 64 |
+
works = {}
|
| 65 |
+
creators = {}
|
| 66 |
+
topics = {}
|
| 67 |
+
topic_names = {}
|
| 68 |
+
|
| 69 |
+
def load_json_from_hf(dataset_name: str, file_name: str):
|
| 70 |
+
"""Load JSON data from Hugging Face dataset"""
|
| 71 |
+
try:
|
| 72 |
+
dataset = load_dataset(dataset_name, split="train")
|
| 73 |
+
# Access the specific file content
|
| 74 |
+
return dataset[file_name]
|
| 75 |
+
except Exception as e:
|
| 76 |
+
print(f"Failed to load {file_name} from HF: {e}")
|
| 77 |
+
return None
|
| 78 |
|
| 79 |
+
def load_all_data():
|
| 80 |
+
"""Load all data from Hugging Face datasets"""
|
| 81 |
+
global sentences, works, creators, topics, topic_names
|
| 82 |
+
|
| 83 |
+
print("π Loading data from Hugging Face datasets...")
|
| 84 |
+
|
| 85 |
+
sentences = load_json_from_hf(JSON_DATASET, "sentences.json")
|
| 86 |
+
works = load_json_from_hf(JSON_DATASET, "works.json")
|
| 87 |
+
creators = load_json_from_hf(JSON_DATASET, "creators.json")
|
| 88 |
+
topics = load_json_from_hf(JSON_DATASET, "topics.json")
|
| 89 |
+
topic_names = load_json_from_hf(JSON_DATASET, "topic_names.json")
|
| 90 |
+
|
| 91 |
+
# Validate data loading
|
| 92 |
+
if sentences and works and creators and topics and topic_names:
|
| 93 |
+
print(f"β
Successfully loaded data from HF:")
|
| 94 |
+
print(f" Sentences: {len(sentences)} entries")
|
| 95 |
+
print(f" Works: {len(works)} entries")
|
| 96 |
+
print(f" Topics: {len(topics)} entries")
|
| 97 |
+
print(f" Creators: {len(creators)} entries")
|
| 98 |
+
print(f" Topic names: {len(topic_names)} entries")
|
| 99 |
+
else:
|
| 100 |
+
print("β οΈ Some data failed to load from HF datasets")
|
| 101 |
+
# Fallback to empty dicts to prevent crashes
|
| 102 |
+
sentences = sentences or {}
|
| 103 |
+
works = works or {}
|
| 104 |
+
creators = creators or {}
|
| 105 |
+
topics = topics or {}
|
| 106 |
+
topic_names = topic_names or {}
|
| 107 |
|
| 108 |
+
# Initialize data loading
|
| 109 |
+
load_all_data()
|
backend/runner/filtering.py
CHANGED
|
@@ -2,31 +2,13 @@
|
|
| 2 |
Filtering logic for sentence selection based on topics and creators.
|
| 3 |
"""
|
| 4 |
|
| 5 |
-
import json
|
| 6 |
-
from pathlib import Path
|
| 7 |
from typing import Any, Dict, List, Set
|
| 8 |
|
| 9 |
-
# Import
|
| 10 |
-
from .config import
|
| 11 |
-
SENTENCES_JSON,
|
| 12 |
-
WORKS_JSON,
|
| 13 |
-
TOPICS_JSON,
|
| 14 |
-
CREATORS_JSON
|
| 15 |
-
)
|
| 16 |
-
|
| 17 |
-
# Load data files
|
| 18 |
-
with open(SENTENCES_JSON, "r", encoding="utf-8") as f:
|
| 19 |
-
SENTENCES = json.load(f)
|
| 20 |
-
|
| 21 |
-
with open(WORKS_JSON, "r", encoding="utf-8") as f:
|
| 22 |
-
WORKS = json.load(f)
|
| 23 |
-
|
| 24 |
-
with open(TOPICS_JSON, "r", encoding="utf-8") as f:
|
| 25 |
-
TOPICS = json.load(f)
|
| 26 |
-
|
| 27 |
-
with open(CREATORS_JSON, "r", encoding="utf-8") as f:
|
| 28 |
-
CREATORS_MAP = json.load(f)
|
| 29 |
|
|
|
|
|
|
|
| 30 |
|
| 31 |
def get_filtered_sentence_ids(
|
| 32 |
filter_topics: List[str] = None, filter_creators: List[str] = None
|
|
@@ -42,7 +24,7 @@ def get_filtered_sentence_ids(
|
|
| 42 |
Set of sentence IDs that match all filters
|
| 43 |
"""
|
| 44 |
# Start with all sentence IDs
|
| 45 |
-
valid_sentence_ids = set(
|
| 46 |
|
| 47 |
# If no filters, return all sentences
|
| 48 |
if not filter_topics and not filter_creators:
|
|
@@ -56,21 +38,21 @@ def get_filtered_sentence_ids(
|
|
| 56 |
# Using topics.json (topic -> works mapping)
|
| 57 |
# For each selected topic, get all works that have it
|
| 58 |
for topic_id in filter_topics:
|
| 59 |
-
if topic_id in
|
| 60 |
# Add all works that have this topic
|
| 61 |
-
valid_work_ids.update(
|
| 62 |
else:
|
| 63 |
# If no topic filter, all works are valid so far
|
| 64 |
-
valid_work_ids = set(
|
| 65 |
|
| 66 |
# Apply creator filter
|
| 67 |
if filter_creators:
|
| 68 |
# Direct lookup in creators.json (more efficient)
|
| 69 |
creator_work_ids = set()
|
| 70 |
for creator_name in filter_creators:
|
| 71 |
-
if creator_name in
|
| 72 |
# Get all works by this creator directly from creators.json
|
| 73 |
-
creator_work_ids.update(
|
| 74 |
|
| 75 |
# Intersect with existing valid_work_ids if topics were filtered
|
| 76 |
if filter_topics:
|
|
|
|
| 2 |
Filtering logic for sentence selection based on topics and creators.
|
| 3 |
"""
|
| 4 |
|
|
|
|
|
|
|
| 5 |
from typing import Any, Dict, List, Set
|
| 6 |
|
| 7 |
+
# Import data from config (loaded from HF datasets)
|
| 8 |
+
from .config import sentences, works, creators, topics
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
# Data is now loaded from Hugging Face datasets in config.py
|
| 11 |
+
# No need to load from local files anymore
|
| 12 |
|
| 13 |
def get_filtered_sentence_ids(
|
| 14 |
filter_topics: List[str] = None, filter_creators: List[str] = None
|
|
|
|
| 24 |
Set of sentence IDs that match all filters
|
| 25 |
"""
|
| 26 |
# Start with all sentence IDs
|
| 27 |
+
valid_sentence_ids = set(sentences.keys())
|
| 28 |
|
| 29 |
# If no filters, return all sentences
|
| 30 |
if not filter_topics and not filter_creators:
|
|
|
|
| 38 |
# Using topics.json (topic -> works mapping)
|
| 39 |
# For each selected topic, get all works that have it
|
| 40 |
for topic_id in filter_topics:
|
| 41 |
+
if topic_id in topics:
|
| 42 |
# Add all works that have this topic
|
| 43 |
+
valid_work_ids.update(topics[topic_id])
|
| 44 |
else:
|
| 45 |
# If no topic filter, all works are valid so far
|
| 46 |
+
valid_work_ids = set(works.keys())
|
| 47 |
|
| 48 |
# Apply creator filter
|
| 49 |
if filter_creators:
|
| 50 |
# Direct lookup in creators.json (more efficient)
|
| 51 |
creator_work_ids = set()
|
| 52 |
for creator_name in filter_creators:
|
| 53 |
+
if creator_name in creators:
|
| 54 |
# Get all works by this creator directly from creators.json
|
| 55 |
+
creator_work_ids.update(creators[creator_name])
|
| 56 |
|
| 57 |
# Intersect with existing valid_work_ids if topics were filtered
|
| 58 |
if filter_topics:
|
backend/runner/inference.py
CHANGED
|
@@ -25,19 +25,20 @@ import torch.nn.functional as F
|
|
| 25 |
from peft import PeftModel
|
| 26 |
from PIL import Image
|
| 27 |
from transformers import CLIPModel, CLIPProcessor
|
| 28 |
-
from
|
| 29 |
|
| 30 |
from .filtering import get_filtered_sentence_ids
|
| 31 |
# on-demand Grad-ECLIP & region-aware ranking
|
| 32 |
from .heatmap import generate_heatmap
|
| 33 |
from .config import (
|
| 34 |
-
CLIP_EMBEDDINGS_DIR,
|
| 35 |
-
PAINTINGCLIP_EMBEDDINGS_DIR,
|
| 36 |
PAINTINGCLIP_MODEL_DIR,
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
| 41 |
)
|
| 42 |
|
| 43 |
# βββ Configuration βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
@@ -47,115 +48,51 @@ MODEL_TYPE: Literal["clip", "paintingclip"] = "paintingclip"
|
|
| 47 |
MODEL_CONFIG = {
|
| 48 |
"clip": {
|
| 49 |
"model_id": "openai/clip-vit-base-patch32",
|
| 50 |
-
"embeddings_dir": CLIP_EMBEDDINGS_DIR,
|
| 51 |
"use_lora": False,
|
| 52 |
"lora_dir": None,
|
| 53 |
},
|
| 54 |
"paintingclip": {
|
| 55 |
"model_id": "openai/clip-vit-base-patch32",
|
| 56 |
-
"embeddings_dir": PAINTINGCLIP_EMBEDDINGS_DIR,
|
| 57 |
"use_lora": True,
|
| 58 |
"lora_dir": PAINTINGCLIP_MODEL_DIR,
|
| 59 |
},
|
| 60 |
}
|
| 61 |
|
| 62 |
-
# Data paths
|
| 63 |
-
# SENTENCES_JSON = ROOT / "data" / "json_info" / "sentences.json"
|
| 64 |
-
|
| 65 |
# Inference settings
|
| 66 |
TOP_K = 25 # Number of results to return
|
| 67 |
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
-
def
|
| 71 |
-
"""
|
| 72 |
-
Load pre-computed sentence embeddings from individual .pt files.
|
| 73 |
-
|
| 74 |
-
Each embedding file follows the naming convention:
|
| 75 |
-
- CLIP: {sentence_id}_clip.pt (e.g., W1982215463_s0001_clip.pt)
|
| 76 |
-
- PaintingCLIP: {sentence_id}_painting_clip.pt (e.g., W1982215463_s0001_painting_clip.pt)
|
| 77 |
-
|
| 78 |
-
Args:
|
| 79 |
-
embeddings_dir: Directory containing individual embedding files
|
| 80 |
-
|
| 81 |
-
Returns:
|
| 82 |
-
embeddings: Stacked tensor of shape (N, embedding_dim)
|
| 83 |
-
sentence_ids: List of sentence IDs corresponding to each embedding
|
| 84 |
-
|
| 85 |
-
Raises:
|
| 86 |
-
ValueError: If no embedding files are found in the directory
|
| 87 |
-
"""
|
| 88 |
-
embeddings = []
|
| 89 |
-
sentence_ids = []
|
| 90 |
-
|
| 91 |
-
# Glob all .pt files and sort for consistent ordering
|
| 92 |
-
pt_files = sorted(embeddings_dir.glob("*.pt"))
|
| 93 |
-
|
| 94 |
-
if not pt_files:
|
| 95 |
-
raise ValueError(
|
| 96 |
-
f"No embedding files (*.pt) found in {embeddings_dir}. "
|
| 97 |
-
f"Please ensure embeddings are generated and stored correctly."
|
| 98 |
-
)
|
| 99 |
-
|
| 100 |
-
for pt_file in pt_files:
|
| 101 |
-
# Extract sentence ID by removing the appropriate suffix based on model type
|
| 102 |
-
stem = pt_file.stem
|
| 103 |
-
|
| 104 |
-
# Remove the suffix based on which embeddings we're loading
|
| 105 |
-
if "_painting_clip" in stem:
|
| 106 |
-
# PaintingCLIP embeddings: remove "_painting_clip"
|
| 107 |
-
sentence_id = stem.replace("_painting_clip", "")
|
| 108 |
-
elif "_clip" in stem:
|
| 109 |
-
# Regular CLIP embeddings: remove "_clip"
|
| 110 |
-
sentence_id = stem.replace("_clip", "")
|
| 111 |
-
else:
|
| 112 |
-
# Fallback: use the stem as-is
|
| 113 |
-
sentence_id = stem
|
| 114 |
-
|
| 115 |
-
# Load the embedding tensor
|
| 116 |
-
embedding = torch.load(pt_file, map_location="cpu", weights_only=True)
|
| 117 |
-
|
| 118 |
-
# Handle various storage formats (dict vs direct tensor)
|
| 119 |
-
if isinstance(embedding, dict):
|
| 120 |
-
# Try common dictionary keys
|
| 121 |
-
for key in ["embedding", "embeddings", "features"]:
|
| 122 |
-
if key in embedding:
|
| 123 |
-
embedding = embedding[key]
|
| 124 |
-
break
|
| 125 |
-
|
| 126 |
-
# Ensure 1D tensor shape
|
| 127 |
-
if embedding.ndim > 1:
|
| 128 |
-
embedding = embedding.squeeze()
|
| 129 |
-
|
| 130 |
-
# Validate embedding dimension
|
| 131 |
-
if embedding.ndim != 1:
|
| 132 |
-
raise ValueError(
|
| 133 |
-
f"Invalid embedding shape {embedding.shape} in {pt_file}. "
|
| 134 |
-
f"Expected 1D tensor."
|
| 135 |
-
)
|
| 136 |
-
|
| 137 |
-
embeddings.append(embedding)
|
| 138 |
-
sentence_ids.append(sentence_id)
|
| 139 |
-
|
| 140 |
-
# Stack all embeddings into a single tensor
|
| 141 |
-
embeddings_tensor = torch.stack(embeddings, dim=0)
|
| 142 |
-
|
| 143 |
-
return embeddings_tensor, sentence_ids
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
def _load_sentences_metadata(sentences_path: Path) -> Dict[str, Dict[str, Any]]:
|
| 147 |
"""
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
Args:
|
| 151 |
-
sentences_path: Path to sentences.json file
|
| 152 |
-
|
| 153 |
-
Returns:
|
| 154 |
-
Dictionary mapping sentence IDs to their metadata
|
| 155 |
"""
|
| 156 |
-
|
| 157 |
-
return json.load(f)
|
| 158 |
-
|
| 159 |
|
| 160 |
@lru_cache(maxsize=1)
|
| 161 |
def _initialize_pipeline():
|
|
@@ -164,8 +101,8 @@ def _initialize_pipeline():
|
|
| 164 |
|
| 165 |
This function loads all heavy resources once and caches them:
|
| 166 |
- CLIP model (with optional LoRA adapter)
|
| 167 |
-
- Pre-computed sentence embeddings
|
| 168 |
-
- Sentence metadata
|
| 169 |
|
| 170 |
Returns:
|
| 171 |
Tuple of (processor, model, embeddings, sentence_ids, sentences_data, device)
|
|
@@ -215,12 +152,16 @@ def _initialize_pipeline():
|
|
| 215 |
|
| 216 |
model = model.eval()
|
| 217 |
|
| 218 |
-
# Load pre-computed embeddings
|
| 219 |
try:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 220 |
if MODEL_TYPE == "clip":
|
| 221 |
-
embeddings, sentence_ids =
|
| 222 |
else:
|
| 223 |
-
embeddings, sentence_ids =
|
| 224 |
|
| 225 |
if embeddings is None or sentence_ids is None:
|
| 226 |
raise ValueError(f"Failed to load embeddings for model type: {MODEL_TYPE}")
|
|
@@ -230,16 +171,12 @@ def _initialize_pipeline():
|
|
| 230 |
print(f"β Error loading embeddings: {e}")
|
| 231 |
raise
|
| 232 |
|
| 233 |
-
#
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
print(f"π Sample sentence data structure: {sentences_data[sample_key]}")
|
| 240 |
-
except Exception as e:
|
| 241 |
-
print(f"β Error loading sentence metadata: {e}")
|
| 242 |
-
sentences_data = {}
|
| 243 |
|
| 244 |
return processor, model, embeddings, sentence_ids, sentences_data, device
|
| 245 |
|
|
|
|
| 25 |
from peft import PeftModel
|
| 26 |
from PIL import Image
|
| 27 |
from transformers import CLIPModel, CLIPProcessor
|
| 28 |
+
from datasets import load_dataset
|
| 29 |
|
| 30 |
from .filtering import get_filtered_sentence_ids
|
| 31 |
# on-demand Grad-ECLIP & region-aware ranking
|
| 32 |
from .heatmap import generate_heatmap
|
| 33 |
from .config import (
|
|
|
|
|
|
|
| 34 |
PAINTINGCLIP_MODEL_DIR,
|
| 35 |
+
EMBEDDINGS_DATASET,
|
| 36 |
+
JSON_DATASET,
|
| 37 |
+
sentences,
|
| 38 |
+
works,
|
| 39 |
+
creators,
|
| 40 |
+
topics,
|
| 41 |
+
topic_names
|
| 42 |
)
|
| 43 |
|
| 44 |
# βββ Configuration βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
| 48 |
MODEL_CONFIG = {
|
| 49 |
"clip": {
|
| 50 |
"model_id": "openai/clip-vit-base-patch32",
|
|
|
|
| 51 |
"use_lora": False,
|
| 52 |
"lora_dir": None,
|
| 53 |
},
|
| 54 |
"paintingclip": {
|
| 55 |
"model_id": "openai/clip-vit-base-patch32",
|
|
|
|
| 56 |
"use_lora": True,
|
| 57 |
"lora_dir": PAINTINGCLIP_MODEL_DIR,
|
| 58 |
},
|
| 59 |
}
|
| 60 |
|
|
|
|
|
|
|
|
|
|
| 61 |
# Inference settings
|
| 62 |
TOP_K = 25 # Number of results to return
|
| 63 |
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 64 |
|
| 65 |
+
def load_embeddings_from_hf():
|
| 66 |
+
"""Load embeddings from HF dataset"""
|
| 67 |
+
try:
|
| 68 |
+
print(f"π Loading embeddings from {EMBEDDINGS_DATASET}...")
|
| 69 |
+
dataset = load_dataset(EMBEDDINGS_DATASET, split="train")
|
| 70 |
+
|
| 71 |
+
# Load CLIP embeddings
|
| 72 |
+
clip_embeddings = dataset["clip_embeddings"]
|
| 73 |
+
clip_sentence_ids = dataset["clip_embeddings_sentence_ids"]
|
| 74 |
+
|
| 75 |
+
# Load PaintingCLIP embeddings
|
| 76 |
+
paintingclip_embeddings = dataset["paintingclip_embeddings"]
|
| 77 |
+
paintingclip_sentence_ids = dataset["paintingclip_embeddings_sentence_ids"]
|
| 78 |
+
|
| 79 |
+
print(f"β
Successfully loaded embeddings from HF:")
|
| 80 |
+
print(f" CLIP: {len(clip_sentence_ids)} embeddings")
|
| 81 |
+
print(f" PaintingCLIP: {len(paintingclip_sentence_ids)} embeddings")
|
| 82 |
+
|
| 83 |
+
return {
|
| 84 |
+
"clip": (clip_embeddings, clip_sentence_ids),
|
| 85 |
+
"paintingclip": (paintingclip_embeddings, paintingclip_sentence_ids)
|
| 86 |
+
}
|
| 87 |
+
except Exception as e:
|
| 88 |
+
print(f"β Failed to load embeddings from HF: {e}")
|
| 89 |
+
return None
|
| 90 |
|
| 91 |
+
def _load_sentences_metadata() -> Dict[str, Dict[str, Any]]:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
"""
|
| 93 |
+
Get sentence metadata from global config (loaded from HF datasets).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
"""
|
| 95 |
+
return sentences
|
|
|
|
|
|
|
| 96 |
|
| 97 |
@lru_cache(maxsize=1)
|
| 98 |
def _initialize_pipeline():
|
|
|
|
| 101 |
|
| 102 |
This function loads all heavy resources once and caches them:
|
| 103 |
- CLIP model (with optional LoRA adapter)
|
| 104 |
+
- Pre-computed sentence embeddings from HF
|
| 105 |
+
- Sentence metadata from HF
|
| 106 |
|
| 107 |
Returns:
|
| 108 |
Tuple of (processor, model, embeddings, sentence_ids, sentences_data, device)
|
|
|
|
| 152 |
|
| 153 |
model = model.eval()
|
| 154 |
|
| 155 |
+
# Load pre-computed embeddings from HF
|
| 156 |
try:
|
| 157 |
+
embeddings_data = load_embeddings_from_hf()
|
| 158 |
+
if embeddings_data is None:
|
| 159 |
+
raise ValueError(f"Failed to load embeddings from HF dataset: {EMBEDDINGS_DATASET}")
|
| 160 |
+
|
| 161 |
if MODEL_TYPE == "clip":
|
| 162 |
+
embeddings, sentence_ids = embeddings_data["clip"]
|
| 163 |
else:
|
| 164 |
+
embeddings, sentence_ids = embeddings_data["paintingclip"]
|
| 165 |
|
| 166 |
if embeddings is None or sentence_ids is None:
|
| 167 |
raise ValueError(f"Failed to load embeddings for model type: {MODEL_TYPE}")
|
|
|
|
| 171 |
print(f"β Error loading embeddings: {e}")
|
| 172 |
raise
|
| 173 |
|
| 174 |
+
# Get sentence metadata from global config
|
| 175 |
+
sentences_data = _load_sentences_metadata()
|
| 176 |
+
print(f"π Loaded {len(sentences_data)} sentence metadata entries")
|
| 177 |
+
if sentences_data:
|
| 178 |
+
sample_key = next(iter(sentences_data.keys()))
|
| 179 |
+
print(f"π Sample sentence data structure: {sentences_data[sample_key]}")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
return processor, model, embeddings, sentence_ids, sentences_data, device
|
| 182 |
|
requirements.txt
CHANGED
|
@@ -5,7 +5,8 @@ flask-cors
|
|
| 5 |
|
| 6 |
# Hugging Face ecosystem
|
| 7 |
huggingface_hub>=0.20
|
| 8 |
-
hf_transfer>=0.1.4
|
|
|
|
| 9 |
|
| 10 |
# Core ML libraries
|
| 11 |
torch>=2.0.0
|
|
|
|
| 5 |
|
| 6 |
# Hugging Face ecosystem
|
| 7 |
huggingface_hub>=0.20
|
| 8 |
+
hf_transfer>=0.1.4
|
| 9 |
+
datasets>=2.14.0
|
| 10 |
|
| 11 |
# Core ML libraries
|
| 12 |
torch>=2.0.0
|