samwaugh commited on
Commit
efbac81
Β·
1 Parent(s): 1f1001b

Big backend rewrite for using HF datasets

Browse files
README.md CHANGED
@@ -12,6 +12,7 @@ models:
12
  - samwaugh/paintingclip-lora
13
  datasets:
14
  - samwaugh/artefact-embeddings
 
15
  - samwaugh/artefact-markdown
16
  ---
17
 
@@ -46,6 +47,12 @@ datasets:
46
  - `clip_embeddings.safetensors` (6.39GB) - CLIP model embeddings
47
  - `paintingclip_embeddings.safetensors` (6.39GB) - PaintingCLIP embeddings
48
  - `*_sentence_ids.json` (71.7MB each) - Sentence ID mappings
 
 
 
 
 
 
49
  - **`artefact-markdown`**: Source documents and images (planned)
50
  - 7,200 work directories with markdown files and associated images
51
  - Organized by work ID for efficient retrieval
@@ -87,7 +94,7 @@ git push hf main:main
87
  # Force rebuild if needed (use HF Space settings β†’ Factory Reset)
88
  ```
89
 
90
- ## Configuration
91
 
92
  ### **Environment Variables**
93
  - `STUB_MODE`: Set to `1` for stub responses, `0` for real ML inference
@@ -96,11 +103,11 @@ git push hf main:main
96
  - `MAX_WORKERS`: Thread pool size for ML inference (default: 2)
97
 
98
  ### **Data Sources**
99
- The application connects to distributed data sources:
100
  - **Embeddings**: `samwaugh/artefact-embeddings` for fast similarity search
101
- - **Markdown**: `samwaugh/artefact-markdown` for source documents and context
 
102
  - **Models**: Local `data/models/` directory for ML model weights
103
- - **Metadata**: Local `data/json_info/` for fast access to sentence and work information
104
 
105
  ## πŸ“Š Data Processing Pipeline
106
 
@@ -118,14 +125,12 @@ ArteFact processes a massive corpus of art historical texts:
118
  data/
119
  β”œβ”€β”€ models/
120
  β”‚ └── PaintingCLIP/ # LoRA fine-tuned weights
121
- β”œβ”€β”€ embeddings/ # Local cache (if needed)
122
- β”œβ”€β”€ json_info/ # Metadata files
123
- β”‚ β”œβ”€β”€ sentences.json # 3.1M sentence metadata
124
- β”‚ β”œβ”€β”€ works.json # 7,200 work records
125
- β”‚ β”œβ”€β”€ creators.json # Artist/creator mappings
126
- β”‚ β”œβ”€β”€ topics.json # Topic classifications
127
- β”‚ └── topic_names.json # Human-readable topic names
128
  └── marker_output/ # Document analysis outputs
 
 
 
 
 
129
  ```
130
 
131
  ## 🧠 AI Models & Features
@@ -162,6 +167,7 @@ data/
162
  - **Memory-Optimized Inference**: Caching and batch processing
163
  - **Real-Time Analysis**: Sub-second response times for similarity search
164
  - **Scalable Architecture**: Designed for production deployment
 
165
 
166
  ### **Academic Applications**
167
  - **Art Historical Research**: Discover connections across large corpora
@@ -203,8 +209,9 @@ This work made use of the facilities of the N8 Centre of Excellence in Computati
203
  - **Source Code**: [GitHub Repository](https://github.com/sammwaughh/artefact-context)
204
  - **Research Paper**: [Download PDF](paper/waugh2025artcontext.pdf)
205
  - **Embeddings Dataset**: [artefact-embeddings on HF](https://huggingface.co/datasets/samwaugh/artefact-embeddings)
 
206
  - **Markdown Dataset**: [artefact-markdown on HF](https://huggingface.co/datasets/samwaugh/artefact-markdown) (planned)
207
 
208
  ---
209
 
210
- *ArteFact represents a significant contribution to computational art history, making large-scale scholarly resources accessible through AI-powered visual analysis while maintaining academic rigor and providing transparent explanations of AI decision-making.*
 
12
  - samwaugh/paintingclip-lora
13
  datasets:
14
  - samwaugh/artefact-embeddings
15
+ - samwaugh/artefact-json
16
  - samwaugh/artefact-markdown
17
  ---
18
 
 
47
  - `clip_embeddings.safetensors` (6.39GB) - CLIP model embeddings
48
  - `paintingclip_embeddings.safetensors` (6.39GB) - PaintingCLIP embeddings
49
  - `*_sentence_ids.json` (71.7MB each) - Sentence ID mappings
50
+ - **`artefact-json`**: Metadata and structured data
51
+ - `sentences.json` - 3.1M sentence metadata
52
+ - `works.json` - 7,200 work records
53
+ - `creators.json` - Artist/creator mappings
54
+ - `topics.json` - Topic classifications
55
+ - `topic_names.json` - Human-readable topic names
56
  - **`artefact-markdown`**: Source documents and images (planned)
57
  - 7,200 work directories with markdown files and associated images
58
  - Organized by work ID for efficient retrieval
 
94
  # Force rebuild if needed (use HF Space settings β†’ Factory Reset)
95
  ```
96
 
97
+ ## βš™οΈ Configuration
98
 
99
  ### **Environment Variables**
100
  - `STUB_MODE`: Set to `1` for stub responses, `0` for real ML inference
 
103
  - `MAX_WORKERS`: Thread pool size for ML inference (default: 2)
104
 
105
  ### **Data Sources**
106
+ The application automatically connects to distributed Hugging Face datasets:
107
  - **Embeddings**: `samwaugh/artefact-embeddings` for fast similarity search
108
+ - **Metadata**: `samwaugh/artefact-json` for sentence, work, and topic information
109
+ - **Documents**: `samwaugh/artefact-markdown` for source documents and context
110
  - **Models**: Local `data/models/` directory for ML model weights
 
111
 
112
  ## πŸ“Š Data Processing Pipeline
113
 
 
125
  data/
126
  β”œβ”€β”€ models/
127
  β”‚ └── PaintingCLIP/ # LoRA fine-tuned weights
 
 
 
 
 
 
 
128
  └── marker_output/ # Document analysis outputs
129
+
130
+ # Data hosted on Hugging Face Hub:
131
+ # - samwaugh/artefact-embeddings: 12.8GB embeddings
132
+ # - samwaugh/artefact-json: Metadata files
133
+ # - samwaugh/artefact-markdown: Source documents
134
  ```
135
 
136
  ## 🧠 AI Models & Features
 
167
  - **Memory-Optimized Inference**: Caching and batch processing
168
  - **Real-Time Analysis**: Sub-second response times for similarity search
169
  - **Scalable Architecture**: Designed for production deployment
170
+ - **Distributed Data**: Hugging Face datasets for scalable data management
171
 
172
  ### **Academic Applications**
173
  - **Art Historical Research**: Discover connections across large corpora
 
209
  - **Source Code**: [GitHub Repository](https://github.com/sammwaughh/artefact-context)
210
  - **Research Paper**: [Download PDF](paper/waugh2025artcontext.pdf)
211
  - **Embeddings Dataset**: [artefact-embeddings on HF](https://huggingface.co/datasets/samwaugh/artefact-embeddings)
212
+ - **JSON Dataset**: [artefact-json on HF](https://huggingface.co/datasets/samwaugh/artefact-json)
213
  - **Markdown Dataset**: [artefact-markdown on HF](https://huggingface.co/datasets/samwaugh/artefact-markdown) (planned)
214
 
215
  ---
216
 
217
+ *ArteFact represents a significant contribution to computational art history, making large-scale scholarly resources accessible through AI-powered visual analysis while maintaining academic rigor and providing transparent explanations of AI decision-making. The application now leverages Hugging Face's distributed data infrastructure for scalable and collaborative research.*
backend/runner/app.py CHANGED
@@ -101,25 +101,17 @@ from .config import (
101
  MARKER_DIR
102
  )
103
 
 
 
 
104
  # --------------------------------------------------------------------------- #
105
- # Global Data (safe loading for Phase 1) #
106
  # --------------------------------------------------------------------------- #
107
- def _load_json(p: Path, default):
108
- """Safely load JSON file, return default if missing or corrupted."""
109
- try:
110
- return json.loads(p.read_text(encoding="utf-8")) if p.is_file() else default
111
- except Exception:
112
- return default
113
-
114
- # Load data/sentences.json into variables (safe for missing files)
115
- sentences = _load_json(JSON_INFO_DIR / "sentences.json", {})
116
- works = _load_json(JSON_INFO_DIR / "works.json", {})
117
- creators = _load_json(JSON_INFO_DIR / "creators.json", {})
118
- topics = _load_json(JSON_INFO_DIR / "topics.json", {})
119
- topic_names = _load_json(JSON_INFO_DIR / "topic_names.json", {})
120
 
121
  # Debug logging for data loading
122
- print(f"πŸ“Š Data loaded:")
123
  print(f"πŸ“Š Sentences: {len(sentences)} entries")
124
  print(f"πŸ“Š Works: {len(works)} entries")
125
  print(f"πŸ“Š Topics: {len(topics)} entries")
 
101
  MARKER_DIR
102
  )
103
 
104
+ # Import data from config (loaded from HF datasets)
105
+ from .config import sentences, works, creators, topics, topic_names
106
+
107
  # --------------------------------------------------------------------------- #
108
+ # Global Data (loaded from HF datasets via config) #
109
  # --------------------------------------------------------------------------- #
110
+ # Data is now loaded from Hugging Face datasets in config.py
111
+ # No need to load from local files anymore
 
 
 
 
 
 
 
 
 
 
 
112
 
113
  # Debug logging for data loading
114
+ print(f"πŸ“Š Data loaded from HF datasets:")
115
  print(f"πŸ“Š Sentences: {len(sentences)} entries")
116
  print(f"πŸ“Š Works: {len(works)} entries")
117
  print(f"πŸ“Š Topics: {len(topics)} entries")
backend/runner/config.py CHANGED
@@ -1,10 +1,16 @@
1
  """
2
- Unified configuration for data paths in Hugging Face Spaces.
3
  All runner modules should import from this module instead of defining their own paths.
4
  """
5
 
6
  import os
7
  from pathlib import Path
 
 
 
 
 
 
8
 
9
  # READ root (repo data - read-only)
10
  PROJECT_ROOT = Path(__file__).resolve().parents[2]
@@ -35,8 +41,6 @@ print(f"βœ… Using WRITE_ROOT: {WRITE_ROOT}")
35
  print(f"βœ… Using READ_ROOT: {DATA_READ_ROOT}")
36
 
37
  # Read-only directories (from repo)
38
- EMBEDDINGS_DIR = DATA_READ_ROOT / "embeddings"
39
- JSON_INFO_DIR = DATA_READ_ROOT / "json_info"
40
  MODELS_DIR = DATA_READ_ROOT / "models"
41
  MARKER_DIR = DATA_READ_ROOT / "marker_output"
42
 
@@ -55,16 +59,51 @@ for dir_path in [OUTPUTS_DIR, ARTIFACTS_DIR]:
55
  except Exception as e:
56
  print(f"⚠️ Could not create directory {dir_path}: {e}")
57
 
58
- # Metadata files
59
- SENTENCES_JSON = JSON_INFO_DIR / "sentences.json"
60
- WORKS_JSON = JSON_INFO_DIR / "works.json"
61
- TOPICS_JSON = JSON_INFO_DIR / "topics.json"
62
- CREATORS_JSON = JSON_INFO_DIR / "creators.json"
63
- TOPIC_NAMES_JSON = JSON_INFO_DIR / "topic_names.json"
 
 
 
 
 
 
 
 
 
 
64
 
65
- # Embedding files (lowercase for backend compatibility)
66
- CLIP_EMBEDDINGS_ST = EMBEDDINGS_DIR / "clip_embeddings.safetensors"
67
- CLIP_SENTENCE_IDS = EMBEDDINGS_DIR / "clip_embeddings_sentence_ids.json"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
- PAINTINGCLIP_EMBEDDINGS_ST = EMBEDDINGS_DIR / "paintingclip_embeddings.safetensors"
70
- PAINTINGCLIP_SENTENCE_IDS = EMBEDDINGS_DIR / "paintingclip_embeddings_sentence_ids.json"
 
1
  """
2
+ Unified configuration for Hugging Face datasets integration.
3
  All runner modules should import from this module instead of defining their own paths.
4
  """
5
 
6
  import os
7
  from pathlib import Path
8
+ from datasets import load_dataset
9
+
10
+ # HF Dataset IDs
11
+ EMBEDDINGS_DATASET = "samwaugh/artefact-embeddings"
12
+ JSON_DATASET = "samwaugh/artefact-json"
13
+ MARKDOWN_DATASET = "samwaugh/artefact-markdown"
14
 
15
  # READ root (repo data - read-only)
16
  PROJECT_ROOT = Path(__file__).resolve().parents[2]
 
41
  print(f"βœ… Using READ_ROOT: {DATA_READ_ROOT}")
42
 
43
  # Read-only directories (from repo)
 
 
44
  MODELS_DIR = DATA_READ_ROOT / "models"
45
  MARKER_DIR = DATA_READ_ROOT / "marker_output"
46
 
 
59
  except Exception as e:
60
  print(f"⚠️ Could not create directory {dir_path}: {e}")
61
 
62
+ # Global data variables (will be populated from HF datasets)
63
+ sentences = {}
64
+ works = {}
65
+ creators = {}
66
+ topics = {}
67
+ topic_names = {}
68
+
69
+ def load_json_from_hf(dataset_name: str, file_name: str):
70
+ """Load JSON data from Hugging Face dataset"""
71
+ try:
72
+ dataset = load_dataset(dataset_name, split="train")
73
+ # Access the specific file content
74
+ return dataset[file_name]
75
+ except Exception as e:
76
+ print(f"Failed to load {file_name} from HF: {e}")
77
+ return None
78
 
79
+ def load_all_data():
80
+ """Load all data from Hugging Face datasets"""
81
+ global sentences, works, creators, topics, topic_names
82
+
83
+ print("πŸ”„ Loading data from Hugging Face datasets...")
84
+
85
+ sentences = load_json_from_hf(JSON_DATASET, "sentences.json")
86
+ works = load_json_from_hf(JSON_DATASET, "works.json")
87
+ creators = load_json_from_hf(JSON_DATASET, "creators.json")
88
+ topics = load_json_from_hf(JSON_DATASET, "topics.json")
89
+ topic_names = load_json_from_hf(JSON_DATASET, "topic_names.json")
90
+
91
+ # Validate data loading
92
+ if sentences and works and creators and topics and topic_names:
93
+ print(f"βœ… Successfully loaded data from HF:")
94
+ print(f" Sentences: {len(sentences)} entries")
95
+ print(f" Works: {len(works)} entries")
96
+ print(f" Topics: {len(topics)} entries")
97
+ print(f" Creators: {len(creators)} entries")
98
+ print(f" Topic names: {len(topic_names)} entries")
99
+ else:
100
+ print("⚠️ Some data failed to load from HF datasets")
101
+ # Fallback to empty dicts to prevent crashes
102
+ sentences = sentences or {}
103
+ works = works or {}
104
+ creators = creators or {}
105
+ topics = topics or {}
106
+ topic_names = topic_names or {}
107
 
108
+ # Initialize data loading
109
+ load_all_data()
backend/runner/filtering.py CHANGED
@@ -2,31 +2,13 @@
2
  Filtering logic for sentence selection based on topics and creators.
3
  """
4
 
5
- import json
6
- from pathlib import Path
7
  from typing import Any, Dict, List, Set
8
 
9
- # Import configuration from unified config module
10
- from .config import (
11
- SENTENCES_JSON,
12
- WORKS_JSON,
13
- TOPICS_JSON,
14
- CREATORS_JSON
15
- )
16
-
17
- # Load data files
18
- with open(SENTENCES_JSON, "r", encoding="utf-8") as f:
19
- SENTENCES = json.load(f)
20
-
21
- with open(WORKS_JSON, "r", encoding="utf-8") as f:
22
- WORKS = json.load(f)
23
-
24
- with open(TOPICS_JSON, "r", encoding="utf-8") as f:
25
- TOPICS = json.load(f)
26
-
27
- with open(CREATORS_JSON, "r", encoding="utf-8") as f:
28
- CREATORS_MAP = json.load(f)
29
 
 
 
30
 
31
  def get_filtered_sentence_ids(
32
  filter_topics: List[str] = None, filter_creators: List[str] = None
@@ -42,7 +24,7 @@ def get_filtered_sentence_ids(
42
  Set of sentence IDs that match all filters
43
  """
44
  # Start with all sentence IDs
45
- valid_sentence_ids = set(SENTENCES.keys())
46
 
47
  # If no filters, return all sentences
48
  if not filter_topics and not filter_creators:
@@ -56,21 +38,21 @@ def get_filtered_sentence_ids(
56
  # Using topics.json (topic -> works mapping)
57
  # For each selected topic, get all works that have it
58
  for topic_id in filter_topics:
59
- if topic_id in TOPICS:
60
  # Add all works that have this topic
61
- valid_work_ids.update(TOPICS[topic_id])
62
  else:
63
  # If no topic filter, all works are valid so far
64
- valid_work_ids = set(WORKS.keys())
65
 
66
  # Apply creator filter
67
  if filter_creators:
68
  # Direct lookup in creators.json (more efficient)
69
  creator_work_ids = set()
70
  for creator_name in filter_creators:
71
- if creator_name in CREATORS_MAP:
72
  # Get all works by this creator directly from creators.json
73
- creator_work_ids.update(CREATORS_MAP[creator_name])
74
 
75
  # Intersect with existing valid_work_ids if topics were filtered
76
  if filter_topics:
 
2
  Filtering logic for sentence selection based on topics and creators.
3
  """
4
 
 
 
5
  from typing import Any, Dict, List, Set
6
 
7
+ # Import data from config (loaded from HF datasets)
8
+ from .config import sentences, works, creators, topics
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
+ # Data is now loaded from Hugging Face datasets in config.py
11
+ # No need to load from local files anymore
12
 
13
  def get_filtered_sentence_ids(
14
  filter_topics: List[str] = None, filter_creators: List[str] = None
 
24
  Set of sentence IDs that match all filters
25
  """
26
  # Start with all sentence IDs
27
+ valid_sentence_ids = set(sentences.keys())
28
 
29
  # If no filters, return all sentences
30
  if not filter_topics and not filter_creators:
 
38
  # Using topics.json (topic -> works mapping)
39
  # For each selected topic, get all works that have it
40
  for topic_id in filter_topics:
41
+ if topic_id in topics:
42
  # Add all works that have this topic
43
+ valid_work_ids.update(topics[topic_id])
44
  else:
45
  # If no topic filter, all works are valid so far
46
+ valid_work_ids = set(works.keys())
47
 
48
  # Apply creator filter
49
  if filter_creators:
50
  # Direct lookup in creators.json (more efficient)
51
  creator_work_ids = set()
52
  for creator_name in filter_creators:
53
+ if creator_name in creators:
54
  # Get all works by this creator directly from creators.json
55
+ creator_work_ids.update(creators[creator_name])
56
 
57
  # Intersect with existing valid_work_ids if topics were filtered
58
  if filter_topics:
backend/runner/inference.py CHANGED
@@ -25,19 +25,20 @@ import torch.nn.functional as F
25
  from peft import PeftModel
26
  from PIL import Image
27
  from transformers import CLIPModel, CLIPProcessor
28
- from safetensors.torch import load_file as st_load_file
29
 
30
  from .filtering import get_filtered_sentence_ids
31
  # on-demand Grad-ECLIP & region-aware ranking
32
  from .heatmap import generate_heatmap
33
  from .config import (
34
- CLIP_EMBEDDINGS_DIR,
35
- PAINTINGCLIP_EMBEDDINGS_DIR,
36
  PAINTINGCLIP_MODEL_DIR,
37
- SENTENCES_JSON,
38
- EMBEDDINGS_DIR,
39
- CLIP_EMBEDDINGS_ST, CLIP_SENTENCE_IDS,
40
- PAINTINGCLIP_EMBEDDINGS_ST, PAINTINGCLIP_SENTENCE_IDS,
 
 
 
41
  )
42
 
43
  # ─── Configuration ───────────────────────────────────────────────────────────
@@ -47,115 +48,51 @@ MODEL_TYPE: Literal["clip", "paintingclip"] = "paintingclip"
47
  MODEL_CONFIG = {
48
  "clip": {
49
  "model_id": "openai/clip-vit-base-patch32",
50
- "embeddings_dir": CLIP_EMBEDDINGS_DIR,
51
  "use_lora": False,
52
  "lora_dir": None,
53
  },
54
  "paintingclip": {
55
  "model_id": "openai/clip-vit-base-patch32",
56
- "embeddings_dir": PAINTINGCLIP_EMBEDDINGS_DIR,
57
  "use_lora": True,
58
  "lora_dir": PAINTINGCLIP_MODEL_DIR,
59
  },
60
  }
61
 
62
- # Data paths
63
- # SENTENCES_JSON = ROOT / "data" / "json_info" / "sentences.json"
64
-
65
  # Inference settings
66
  TOP_K = 25 # Number of results to return
67
  # ─────────────────────────────────────────────────────────────────────────────
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
- def _load_embeddings(embeddings_dir: Path) -> Tuple[torch.Tensor, List[str]]:
71
- """
72
- Load pre-computed sentence embeddings from individual .pt files.
73
-
74
- Each embedding file follows the naming convention:
75
- - CLIP: {sentence_id}_clip.pt (e.g., W1982215463_s0001_clip.pt)
76
- - PaintingCLIP: {sentence_id}_painting_clip.pt (e.g., W1982215463_s0001_painting_clip.pt)
77
-
78
- Args:
79
- embeddings_dir: Directory containing individual embedding files
80
-
81
- Returns:
82
- embeddings: Stacked tensor of shape (N, embedding_dim)
83
- sentence_ids: List of sentence IDs corresponding to each embedding
84
-
85
- Raises:
86
- ValueError: If no embedding files are found in the directory
87
- """
88
- embeddings = []
89
- sentence_ids = []
90
-
91
- # Glob all .pt files and sort for consistent ordering
92
- pt_files = sorted(embeddings_dir.glob("*.pt"))
93
-
94
- if not pt_files:
95
- raise ValueError(
96
- f"No embedding files (*.pt) found in {embeddings_dir}. "
97
- f"Please ensure embeddings are generated and stored correctly."
98
- )
99
-
100
- for pt_file in pt_files:
101
- # Extract sentence ID by removing the appropriate suffix based on model type
102
- stem = pt_file.stem
103
-
104
- # Remove the suffix based on which embeddings we're loading
105
- if "_painting_clip" in stem:
106
- # PaintingCLIP embeddings: remove "_painting_clip"
107
- sentence_id = stem.replace("_painting_clip", "")
108
- elif "_clip" in stem:
109
- # Regular CLIP embeddings: remove "_clip"
110
- sentence_id = stem.replace("_clip", "")
111
- else:
112
- # Fallback: use the stem as-is
113
- sentence_id = stem
114
-
115
- # Load the embedding tensor
116
- embedding = torch.load(pt_file, map_location="cpu", weights_only=True)
117
-
118
- # Handle various storage formats (dict vs direct tensor)
119
- if isinstance(embedding, dict):
120
- # Try common dictionary keys
121
- for key in ["embedding", "embeddings", "features"]:
122
- if key in embedding:
123
- embedding = embedding[key]
124
- break
125
-
126
- # Ensure 1D tensor shape
127
- if embedding.ndim > 1:
128
- embedding = embedding.squeeze()
129
-
130
- # Validate embedding dimension
131
- if embedding.ndim != 1:
132
- raise ValueError(
133
- f"Invalid embedding shape {embedding.shape} in {pt_file}. "
134
- f"Expected 1D tensor."
135
- )
136
-
137
- embeddings.append(embedding)
138
- sentence_ids.append(sentence_id)
139
-
140
- # Stack all embeddings into a single tensor
141
- embeddings_tensor = torch.stack(embeddings, dim=0)
142
-
143
- return embeddings_tensor, sentence_ids
144
-
145
-
146
- def _load_sentences_metadata(sentences_path: Path) -> Dict[str, Dict[str, Any]]:
147
  """
148
- Load sentence metadata from sentences.json.
149
-
150
- Args:
151
- sentences_path: Path to sentences.json file
152
-
153
- Returns:
154
- Dictionary mapping sentence IDs to their metadata
155
  """
156
- with open(sentences_path, "r", encoding="utf-8") as f:
157
- return json.load(f)
158
-
159
 
160
  @lru_cache(maxsize=1)
161
  def _initialize_pipeline():
@@ -164,8 +101,8 @@ def _initialize_pipeline():
164
 
165
  This function loads all heavy resources once and caches them:
166
  - CLIP model (with optional LoRA adapter)
167
- - Pre-computed sentence embeddings
168
- - Sentence metadata
169
 
170
  Returns:
171
  Tuple of (processor, model, embeddings, sentence_ids, sentences_data, device)
@@ -215,12 +152,16 @@ def _initialize_pipeline():
215
 
216
  model = model.eval()
217
 
218
- # Load pre-computed embeddings - USE CONSOLIDATED LOADING
219
  try:
 
 
 
 
220
  if MODEL_TYPE == "clip":
221
- embeddings, sentence_ids = load_embeddings_for_model("clip")
222
  else:
223
- embeddings, sentence_ids = load_embeddings_for_model("paintingclip")
224
 
225
  if embeddings is None or sentence_ids is None:
226
  raise ValueError(f"Failed to load embeddings for model type: {MODEL_TYPE}")
@@ -230,16 +171,12 @@ def _initialize_pipeline():
230
  print(f"❌ Error loading embeddings: {e}")
231
  raise
232
 
233
- # Load sentence metadata
234
- try:
235
- sentences_data = _load_sentences_metadata(SENTENCES_JSON)
236
- print(f"πŸ” Loaded {len(sentences_data)} sentence metadata entries")
237
- if sentences_data:
238
- sample_key = next(iter(sentences_data.keys()))
239
- print(f"πŸ” Sample sentence data structure: {sentences_data[sample_key]}")
240
- except Exception as e:
241
- print(f"❌ Error loading sentence metadata: {e}")
242
- sentences_data = {}
243
 
244
  return processor, model, embeddings, sentence_ids, sentences_data, device
245
 
 
25
  from peft import PeftModel
26
  from PIL import Image
27
  from transformers import CLIPModel, CLIPProcessor
28
+ from datasets import load_dataset
29
 
30
  from .filtering import get_filtered_sentence_ids
31
  # on-demand Grad-ECLIP & region-aware ranking
32
  from .heatmap import generate_heatmap
33
  from .config import (
 
 
34
  PAINTINGCLIP_MODEL_DIR,
35
+ EMBEDDINGS_DATASET,
36
+ JSON_DATASET,
37
+ sentences,
38
+ works,
39
+ creators,
40
+ topics,
41
+ topic_names
42
  )
43
 
44
  # ─── Configuration ───────────────────────────────────────────────────────────
 
48
  MODEL_CONFIG = {
49
  "clip": {
50
  "model_id": "openai/clip-vit-base-patch32",
 
51
  "use_lora": False,
52
  "lora_dir": None,
53
  },
54
  "paintingclip": {
55
  "model_id": "openai/clip-vit-base-patch32",
 
56
  "use_lora": True,
57
  "lora_dir": PAINTINGCLIP_MODEL_DIR,
58
  },
59
  }
60
 
 
 
 
61
  # Inference settings
62
  TOP_K = 25 # Number of results to return
63
  # ─────────────────────────────────────────────────────────────────────────────
64
 
65
+ def load_embeddings_from_hf():
66
+ """Load embeddings from HF dataset"""
67
+ try:
68
+ print(f"πŸ” Loading embeddings from {EMBEDDINGS_DATASET}...")
69
+ dataset = load_dataset(EMBEDDINGS_DATASET, split="train")
70
+
71
+ # Load CLIP embeddings
72
+ clip_embeddings = dataset["clip_embeddings"]
73
+ clip_sentence_ids = dataset["clip_embeddings_sentence_ids"]
74
+
75
+ # Load PaintingCLIP embeddings
76
+ paintingclip_embeddings = dataset["paintingclip_embeddings"]
77
+ paintingclip_sentence_ids = dataset["paintingclip_embeddings_sentence_ids"]
78
+
79
+ print(f"βœ… Successfully loaded embeddings from HF:")
80
+ print(f" CLIP: {len(clip_sentence_ids)} embeddings")
81
+ print(f" PaintingCLIP: {len(paintingclip_sentence_ids)} embeddings")
82
+
83
+ return {
84
+ "clip": (clip_embeddings, clip_sentence_ids),
85
+ "paintingclip": (paintingclip_embeddings, paintingclip_sentence_ids)
86
+ }
87
+ except Exception as e:
88
+ print(f"❌ Failed to load embeddings from HF: {e}")
89
+ return None
90
 
91
+ def _load_sentences_metadata() -> Dict[str, Dict[str, Any]]:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  """
93
+ Get sentence metadata from global config (loaded from HF datasets).
 
 
 
 
 
 
94
  """
95
+ return sentences
 
 
96
 
97
  @lru_cache(maxsize=1)
98
  def _initialize_pipeline():
 
101
 
102
  This function loads all heavy resources once and caches them:
103
  - CLIP model (with optional LoRA adapter)
104
+ - Pre-computed sentence embeddings from HF
105
+ - Sentence metadata from HF
106
 
107
  Returns:
108
  Tuple of (processor, model, embeddings, sentence_ids, sentences_data, device)
 
152
 
153
  model = model.eval()
154
 
155
+ # Load pre-computed embeddings from HF
156
  try:
157
+ embeddings_data = load_embeddings_from_hf()
158
+ if embeddings_data is None:
159
+ raise ValueError(f"Failed to load embeddings from HF dataset: {EMBEDDINGS_DATASET}")
160
+
161
  if MODEL_TYPE == "clip":
162
+ embeddings, sentence_ids = embeddings_data["clip"]
163
  else:
164
+ embeddings, sentence_ids = embeddings_data["paintingclip"]
165
 
166
  if embeddings is None or sentence_ids is None:
167
  raise ValueError(f"Failed to load embeddings for model type: {MODEL_TYPE}")
 
171
  print(f"❌ Error loading embeddings: {e}")
172
  raise
173
 
174
+ # Get sentence metadata from global config
175
+ sentences_data = _load_sentences_metadata()
176
+ print(f"πŸ” Loaded {len(sentences_data)} sentence metadata entries")
177
+ if sentences_data:
178
+ sample_key = next(iter(sentences_data.keys()))
179
+ print(f"πŸ” Sample sentence data structure: {sentences_data[sample_key]}")
 
 
 
 
180
 
181
  return processor, model, embeddings, sentence_ids, sentences_data, device
182
 
requirements.txt CHANGED
@@ -5,7 +5,8 @@ flask-cors
5
 
6
  # Hugging Face ecosystem
7
  huggingface_hub>=0.20
8
- hf_transfer>=0.1.4 # ← Add this line
 
9
 
10
  # Core ML libraries
11
  torch>=2.0.0
 
5
 
6
  # Hugging Face ecosystem
7
  huggingface_hub>=0.20
8
+ hf_transfer>=0.1.4
9
+ datasets>=2.14.0
10
 
11
  # Core ML libraries
12
  torch>=2.0.0