Spaces:

samwaugh
/

ArteFact

Paused

App Files Files Community

samwaugh commited on Sep 1, 2025

Commit

efbac81

1 Parent(s): 1f1001b

Big backend rewrite for using HF datasets

Browse files

Files changed (6) hide show

README.md +19 -12
backend/runner/app.py +7 -15
backend/runner/config.py +53 -14
backend/runner/filtering.py +10 -28
backend/runner/inference.py +51 -114
requirements.txt +2 -1

README.md CHANGED Viewed

@@ -12,6 +12,7 @@ models:
   - samwaugh/paintingclip-lora
 datasets:
   - samwaugh/artefact-embeddings
   - samwaugh/artefact-markdown
 ---
@@ -46,6 +47,12 @@ datasets:
   - `clip_embeddings.safetensors` (6.39GB) - CLIP model embeddings
   - `paintingclip_embeddings.safetensors` (6.39GB) - PaintingCLIP embeddings
   - `*_sentence_ids.json` (71.7MB each) - Sentence ID mappings
 - **`artefact-markdown`**: Source documents and images (planned)
   - 7,200 work directories with markdown files and associated images
   - Organized by work ID for efficient retrieval
@@ -87,7 +94,7 @@ git push hf main:main
 # Force rebuild if needed (use HF Space settings → Factory Reset)
 ```
-##  Configuration
 ### **Environment Variables**
 - `STUB_MODE`: Set to `1` for stub responses, `0` for real ML inference
@@ -96,11 +103,11 @@ git push hf main:main
 - `MAX_WORKERS`: Thread pool size for ML inference (default: 2)
 ### **Data Sources**
-The application connects to distributed data sources:
 - **Embeddings**: `samwaugh/artefact-embeddings` for fast similarity search
-- **Markdown**: `samwaugh/artefact-markdown` for source documents and context
 - **Models**: Local `data/models/` directory for ML model weights
-- **Metadata**: Local `data/json_info/` for fast access to sentence and work information
 ## 📊 Data Processing Pipeline
@@ -118,14 +125,12 @@ ArteFact processes a massive corpus of art historical texts:
 data/
 ├── models/
 │   └── PaintingCLIP/          # LoRA fine-tuned weights
-├── embeddings/                 # Local cache (if needed)
-├── json_info/                  # Metadata files
-│   ├── sentences.json         # 3.1M sentence metadata
-│   ├── works.json            # 7,200 work records
-│   ├── creators.json         # Artist/creator mappings
-│   ├── topics.json           # Topic classifications
-│   └── topic_names.json      # Human-readable topic names
 └── marker_output/             # Document analysis outputs
 ```
 ## 🧠 AI Models & Features
@@ -162,6 +167,7 @@ data/
 - **Memory-Optimized Inference**: Caching and batch processing
 - **Real-Time Analysis**: Sub-second response times for similarity search
 - **Scalable Architecture**: Designed for production deployment
 ### **Academic Applications**
 - **Art Historical Research**: Discover connections across large corpora
@@ -203,8 +209,9 @@ This work made use of the facilities of the N8 Centre of Excellence in Computati
 - **Source Code**: [GitHub Repository](https://github.com/sammwaughh/artefact-context)
 - **Research Paper**: [Download PDF](paper/waugh2025artcontext.pdf)
 - **Embeddings Dataset**: [artefact-embeddings on HF](https://huggingface.co/datasets/samwaugh/artefact-embeddings)
 - **Markdown Dataset**: [artefact-markdown on HF](https://huggingface.co/datasets/samwaugh/artefact-markdown) (planned)
 ---
-*ArteFact represents a significant contribution to computational art history, making large-scale scholarly resources accessible through AI-powered visual analysis while maintaining academic rigor and providing transparent explanations of AI decision-making.*

   - samwaugh/paintingclip-lora
 datasets:
   - samwaugh/artefact-embeddings
+  - samwaugh/artefact-json
   - samwaugh/artefact-markdown
 ---
   - `clip_embeddings.safetensors` (6.39GB) - CLIP model embeddings
   - `paintingclip_embeddings.safetensors` (6.39GB) - PaintingCLIP embeddings
   - `*_sentence_ids.json` (71.7MB each) - Sentence ID mappings
+- **`artefact-json`**: Metadata and structured data
+  - `sentences.json` - 3.1M sentence metadata
+  - `works.json` - 7,200 work records
+  - `creators.json` - Artist/creator mappings
+  - `topics.json` - Topic classifications
+  - `topic_names.json` - Human-readable topic names
 - **`artefact-markdown`**: Source documents and images (planned)
   - 7,200 work directories with markdown files and associated images
   - Organized by work ID for efficient retrieval
 # Force rebuild if needed (use HF Space settings → Factory Reset)
 ```
+## ⚙️ Configuration
 ### **Environment Variables**
 - `STUB_MODE`: Set to `1` for stub responses, `0` for real ML inference
 - `MAX_WORKERS`: Thread pool size for ML inference (default: 2)
 ### **Data Sources**
+The application automatically connects to distributed Hugging Face datasets:
 - **Embeddings**: `samwaugh/artefact-embeddings` for fast similarity search
+- **Metadata**: `samwaugh/artefact-json` for sentence, work, and topic information
+- **Documents**: `samwaugh/artefact-markdown` for source documents and context
 - **Models**: Local `data/models/` directory for ML model weights
 ## 📊 Data Processing Pipeline
 data/
 ├── models/
 │   └── PaintingCLIP/          # LoRA fine-tuned weights
 └── marker_output/             # Document analysis outputs
+# Data hosted on Hugging Face Hub:
+# - samwaugh/artefact-embeddings: 12.8GB embeddings
+# - samwaugh/artefact-json: Metadata files
+# - samwaugh/artefact-markdown: Source documents
 ```
 ## 🧠 AI Models & Features
 - **Memory-Optimized Inference**: Caching and batch processing
 - **Real-Time Analysis**: Sub-second response times for similarity search
 - **Scalable Architecture**: Designed for production deployment
+- **Distributed Data**: Hugging Face datasets for scalable data management
 ### **Academic Applications**
 - **Art Historical Research**: Discover connections across large corpora
 - **Source Code**: [GitHub Repository](https://github.com/sammwaughh/artefact-context)
 - **Research Paper**: [Download PDF](paper/waugh2025artcontext.pdf)
 - **Embeddings Dataset**: [artefact-embeddings on HF](https://huggingface.co/datasets/samwaugh/artefact-embeddings)
+- **JSON Dataset**: [artefact-json on HF](https://huggingface.co/datasets/samwaugh/artefact-json)
 - **Markdown Dataset**: [artefact-markdown on HF](https://huggingface.co/datasets/samwaugh/artefact-markdown) (planned)
 ---
+*ArteFact represents a significant contribution to computational art history, making large-scale scholarly resources accessible through AI-powered visual analysis while maintaining academic rigor and providing transparent explanations of AI decision-making. The application now leverages Hugging Face's distributed data infrastructure for scalable and collaborative research.*

backend/runner/app.py CHANGED Viewed

@@ -101,25 +101,17 @@ from .config import (
     MARKER_DIR
 )
 # --------------------------------------------------------------------------- #
-#  Global Data (safe loading for Phase 1)                                    #
 # --------------------------------------------------------------------------- #
-def _load_json(p: Path, default):
-    """Safely load JSON file, return default if missing or corrupted."""
-    try:
-        return json.loads(p.read_text(encoding="utf-8")) if p.is_file() else default
-    except Exception:
-        return default
-# Load data/sentences.json into variables (safe for missing files)
-sentences = _load_json(JSON_INFO_DIR / "sentences.json", {})
-works = _load_json(JSON_INFO_DIR / "works.json", {})
-creators = _load_json(JSON_INFO_DIR / "creators.json", {})
-topics = _load_json(JSON_INFO_DIR / "topics.json", {})
-topic_names = _load_json(JSON_INFO_DIR / "topic_names.json", {})
 # Debug logging for data loading
-print(f"📊 Data loaded:")
 print(f"📊   Sentences: {len(sentences)} entries")
 print(f"📊   Works: {len(works)} entries")
 print(f"📊   Topics: {len(topics)} entries")

     MARKER_DIR
 )
+# Import data from config (loaded from HF datasets)
+from .config import sentences, works, creators, topics, topic_names
 # --------------------------------------------------------------------------- #
+#  Global Data (loaded from HF datasets via config)                            #
 # --------------------------------------------------------------------------- #
+# Data is now loaded from Hugging Face datasets in config.py
+# No need to load from local files anymore
 # Debug logging for data loading
+print(f"📊 Data loaded from HF datasets:")
 print(f"📊   Sentences: {len(sentences)} entries")
 print(f"📊   Works: {len(works)} entries")
 print(f"📊   Topics: {len(topics)} entries")

backend/runner/config.py CHANGED Viewed

@@ -1,10 +1,16 @@
 """
-Unified configuration for data paths in Hugging Face Spaces.
 All runner modules should import from this module instead of defining their own paths.
 """
 import os
 from pathlib import Path
 # READ root (repo data - read-only)
 PROJECT_ROOT = Path(__file__).resolve().parents[2]
@@ -35,8 +41,6 @@ print(f"✅ Using WRITE_ROOT: {WRITE_ROOT}")
 print(f"✅ Using READ_ROOT: {DATA_READ_ROOT}")
 # Read-only directories (from repo)
-EMBEDDINGS_DIR = DATA_READ_ROOT / "embeddings"
-JSON_INFO_DIR = DATA_READ_ROOT / "json_info"
 MODELS_DIR = DATA_READ_ROOT / "models"
 MARKER_DIR = DATA_READ_ROOT / "marker_output"
@@ -55,16 +59,51 @@ for dir_path in [OUTPUTS_DIR, ARTIFACTS_DIR]:
     except Exception as e:
         print(f"⚠️  Could not create directory {dir_path}: {e}")
-# Metadata files
-SENTENCES_JSON = JSON_INFO_DIR / "sentences.json"
-WORKS_JSON = JSON_INFO_DIR / "works.json"
-TOPICS_JSON = JSON_INFO_DIR / "topics.json"
-CREATORS_JSON = JSON_INFO_DIR / "creators.json"
-TOPIC_NAMES_JSON = JSON_INFO_DIR / "topic_names.json"
-# Embedding files (lowercase for backend compatibility)
-CLIP_EMBEDDINGS_ST = EMBEDDINGS_DIR / "clip_embeddings.safetensors"
-CLIP_SENTENCE_IDS = EMBEDDINGS_DIR / "clip_embeddings_sentence_ids.json"
-PAINTINGCLIP_EMBEDDINGS_ST = EMBEDDINGS_DIR / "paintingclip_embeddings.safetensors"
-PAINTINGCLIP_SENTENCE_IDS = EMBEDDINGS_DIR / "paintingclip_embeddings_sentence_ids.json"

 """
+Unified configuration for Hugging Face datasets integration.
 All runner modules should import from this module instead of defining their own paths.
 """
 import os
 from pathlib import Path
+from datasets import load_dataset
+# HF Dataset IDs
+EMBEDDINGS_DATASET = "samwaugh/artefact-embeddings"
+JSON_DATASET = "samwaugh/artefact-json"
+MARKDOWN_DATASET = "samwaugh/artefact-markdown"
 # READ root (repo data - read-only)
 PROJECT_ROOT = Path(__file__).resolve().parents[2]
 print(f"✅ Using READ_ROOT: {DATA_READ_ROOT}")
 # Read-only directories (from repo)
 MODELS_DIR = DATA_READ_ROOT / "models"
 MARKER_DIR = DATA_READ_ROOT / "marker_output"
     except Exception as e:
         print(f"⚠️  Could not create directory {dir_path}: {e}")
+# Global data variables (will be populated from HF datasets)
+sentences = {}
+works = {}
+creators = {}
+topics = {}
+topic_names = {}
+def load_json_from_hf(dataset_name: str, file_name: str):
+    """Load JSON data from Hugging Face dataset"""
+    try:
+        dataset = load_dataset(dataset_name, split="train")
+        # Access the specific file content
+        return dataset[file_name]
+    except Exception as e:
+        print(f"Failed to load {file_name} from HF: {e}")
+        return None
+def load_all_data():
+    """Load all data from Hugging Face datasets"""
+    global sentences, works, creators, topics, topic_names
+    print("🔄 Loading data from Hugging Face datasets...")
+    sentences = load_json_from_hf(JSON_DATASET, "sentences.json")
+    works = load_json_from_hf(JSON_DATASET, "works.json")
+    creators = load_json_from_hf(JSON_DATASET, "creators.json")
+    topics = load_json_from_hf(JSON_DATASET, "topics.json")
+    topic_names = load_json_from_hf(JSON_DATASET, "topic_names.json")
+    # Validate data loading
+    if sentences and works and creators and topics and topic_names:
+        print(f"✅ Successfully loaded data from HF:")
+        print(f"   Sentences: {len(sentences)} entries")
+        print(f"   Works: {len(works)} entries")
+        print(f"   Topics: {len(topics)} entries")
+        print(f"   Creators: {len(creators)} entries")
+        print(f"   Topic names: {len(topic_names)} entries")
+    else:
+        print("⚠️  Some data failed to load from HF datasets")
+        # Fallback to empty dicts to prevent crashes
+        sentences = sentences or {}
+        works = works or {}
+        creators = creators or {}
+        topics = topics or {}
+        topic_names = topic_names or {}
+# Initialize data loading
+load_all_data()

backend/runner/filtering.py CHANGED Viewed

@@ -2,31 +2,13 @@
 Filtering logic for sentence selection based on topics and creators.
 """
-import json
-from pathlib import Path
 from typing import Any, Dict, List, Set
-# Import configuration from unified config module
-from .config import (
-    SENTENCES_JSON,
-    WORKS_JSON,
-    TOPICS_JSON,
-    CREATORS_JSON
-)
-# Load data files
-with open(SENTENCES_JSON, "r", encoding="utf-8") as f:
-    SENTENCES = json.load(f)
-with open(WORKS_JSON, "r", encoding="utf-8") as f:
-    WORKS = json.load(f)
-with open(TOPICS_JSON, "r", encoding="utf-8") as f:
-    TOPICS = json.load(f)
-with open(CREATORS_JSON, "r", encoding="utf-8") as f:
-    CREATORS_MAP = json.load(f)
 def get_filtered_sentence_ids(
     filter_topics: List[str] = None, filter_creators: List[str] = None
@@ -42,7 +24,7 @@ def get_filtered_sentence_ids(
         Set of sentence IDs that match all filters
     """
     # Start with all sentence IDs
-    valid_sentence_ids = set(SENTENCES.keys())
     # If no filters, return all sentences
     if not filter_topics and not filter_creators:
@@ -56,21 +38,21 @@ def get_filtered_sentence_ids(
         # Using topics.json (topic -> works mapping)
         # For each selected topic, get all works that have it
         for topic_id in filter_topics:
-            if topic_id in TOPICS:
                 # Add all works that have this topic
-                valid_work_ids.update(TOPICS[topic_id])
     else:
         # If no topic filter, all works are valid so far
-        valid_work_ids = set(WORKS.keys())
     # Apply creator filter
     if filter_creators:
         # Direct lookup in creators.json (more efficient)
         creator_work_ids = set()
         for creator_name in filter_creators:
-            if creator_name in CREATORS_MAP:
                 # Get all works by this creator directly from creators.json
-                creator_work_ids.update(CREATORS_MAP[creator_name])
         # Intersect with existing valid_work_ids if topics were filtered
         if filter_topics:

 Filtering logic for sentence selection based on topics and creators.
 """
 from typing import Any, Dict, List, Set
+# Import data from config (loaded from HF datasets)
+from .config import sentences, works, creators, topics
+# Data is now loaded from Hugging Face datasets in config.py
+# No need to load from local files anymore
 def get_filtered_sentence_ids(
     filter_topics: List[str] = None, filter_creators: List[str] = None
         Set of sentence IDs that match all filters
     """
     # Start with all sentence IDs
+    valid_sentence_ids = set(sentences.keys())
     # If no filters, return all sentences
     if not filter_topics and not filter_creators:
         # Using topics.json (topic -> works mapping)
         # For each selected topic, get all works that have it
         for topic_id in filter_topics:
+            if topic_id in topics:
                 # Add all works that have this topic
+                valid_work_ids.update(topics[topic_id])
     else:
         # If no topic filter, all works are valid so far
+        valid_work_ids = set(works.keys())
     # Apply creator filter
     if filter_creators:
         # Direct lookup in creators.json (more efficient)
         creator_work_ids = set()
         for creator_name in filter_creators:
+            if creator_name in creators:
                 # Get all works by this creator directly from creators.json
+                creator_work_ids.update(creators[creator_name])
         # Intersect with existing valid_work_ids if topics were filtered
         if filter_topics:

backend/runner/inference.py CHANGED Viewed

@@ -25,19 +25,20 @@ import torch.nn.functional as F
 from peft import PeftModel
 from PIL import Image
 from transformers import CLIPModel, CLIPProcessor
-from safetensors.torch import load_file as st_load_file
 from .filtering import get_filtered_sentence_ids
 # on-demand Grad-ECLIP & region-aware ranking
 from .heatmap import generate_heatmap
 from .config import (
-    CLIP_EMBEDDINGS_DIR,
-    PAINTINGCLIP_EMBEDDINGS_DIR,
     PAINTINGCLIP_MODEL_DIR,
-    SENTENCES_JSON,
-    EMBEDDINGS_DIR,
-    CLIP_EMBEDDINGS_ST, CLIP_SENTENCE_IDS,
-    PAINTINGCLIP_EMBEDDINGS_ST, PAINTINGCLIP_SENTENCE_IDS,
 )
 # ─── Configuration ───────────────────────────────────────────────────────────
@@ -47,115 +48,51 @@ MODEL_TYPE: Literal["clip", "paintingclip"] = "paintingclip"
 MODEL_CONFIG = {
     "clip": {
         "model_id": "openai/clip-vit-base-patch32",
-        "embeddings_dir": CLIP_EMBEDDINGS_DIR,
         "use_lora": False,
         "lora_dir": None,
     },
     "paintingclip": {
         "model_id": "openai/clip-vit-base-patch32",
-        "embeddings_dir": PAINTINGCLIP_EMBEDDINGS_DIR,
         "use_lora": True,
         "lora_dir": PAINTINGCLIP_MODEL_DIR,
     },
 }
-# Data paths
-# SENTENCES_JSON = ROOT / "data" / "json_info" / "sentences.json"
 # Inference settings
 TOP_K = 25  # Number of results to return
 # ─────────────────────────────────────────────────────────────────────────────
-def _load_embeddings(embeddings_dir: Path) -> Tuple[torch.Tensor, List[str]]:
-    """
-    Load pre-computed sentence embeddings from individual .pt files.
-    Each embedding file follows the naming convention:
-    - CLIP: {sentence_id}_clip.pt (e.g., W1982215463_s0001_clip.pt)
-    - PaintingCLIP: {sentence_id}_painting_clip.pt (e.g., W1982215463_s0001_painting_clip.pt)
-    Args:
-        embeddings_dir: Directory containing individual embedding files
-    Returns:
-        embeddings: Stacked tensor of shape (N, embedding_dim)
-        sentence_ids: List of sentence IDs corresponding to each embedding
-    Raises:
-        ValueError: If no embedding files are found in the directory
-    """
-    embeddings = []
-    sentence_ids = []
-    # Glob all .pt files and sort for consistent ordering
-    pt_files = sorted(embeddings_dir.glob("*.pt"))
-    if not pt_files:
-        raise ValueError(
-            f"No embedding files (*.pt) found in {embeddings_dir}. "
-            f"Please ensure embeddings are generated and stored correctly."
-        )
-    for pt_file in pt_files:
-        # Extract sentence ID by removing the appropriate suffix based on model type
-        stem = pt_file.stem
-        # Remove the suffix based on which embeddings we're loading
-        if "_painting_clip" in stem:
-            # PaintingCLIP embeddings: remove "_painting_clip"
-            sentence_id = stem.replace("_painting_clip", "")
-        elif "_clip" in stem:
-            # Regular CLIP embeddings: remove "_clip"
-            sentence_id = stem.replace("_clip", "")
-        else:
-            # Fallback: use the stem as-is
-            sentence_id = stem
-        # Load the embedding tensor
-        embedding = torch.load(pt_file, map_location="cpu", weights_only=True)
-        # Handle various storage formats (dict vs direct tensor)
-        if isinstance(embedding, dict):
-            # Try common dictionary keys
-            for key in ["embedding", "embeddings", "features"]:
-                if key in embedding:
-                    embedding = embedding[key]
-                    break
-        # Ensure 1D tensor shape
-        if embedding.ndim > 1:
-            embedding = embedding.squeeze()
-        # Validate embedding dimension
-        if embedding.ndim != 1:
-            raise ValueError(
-                f"Invalid embedding shape {embedding.shape} in {pt_file}. "
-                f"Expected 1D tensor."
-            )
-        embeddings.append(embedding)
-        sentence_ids.append(sentence_id)
-    # Stack all embeddings into a single tensor
-    embeddings_tensor = torch.stack(embeddings, dim=0)
-    return embeddings_tensor, sentence_ids
-def _load_sentences_metadata(sentences_path: Path) -> Dict[str, Dict[str, Any]]:
     """
-    Load sentence metadata from sentences.json.
-    Args:
-        sentences_path: Path to sentences.json file
-    Returns:
-        Dictionary mapping sentence IDs to their metadata
     """
-    with open(sentences_path, "r", encoding="utf-8") as f:
-        return json.load(f)
 @lru_cache(maxsize=1)
 def _initialize_pipeline():
@@ -164,8 +101,8 @@ def _initialize_pipeline():
     This function loads all heavy resources once and caches them:
     - CLIP model (with optional LoRA adapter)
-    - Pre-computed sentence embeddings
-    - Sentence metadata
     Returns:
         Tuple of (processor, model, embeddings, sentence_ids, sentences_data, device)
@@ -215,12 +152,16 @@ def _initialize_pipeline():
     model = model.eval()
-    # Load pre-computed embeddings - USE CONSOLIDATED LOADING
     try:
         if MODEL_TYPE == "clip":
-            embeddings, sentence_ids = load_embeddings_for_model("clip")
         else:
-            embeddings, sentence_ids = load_embeddings_for_model("paintingclip")
         if embeddings is None or sentence_ids is None:
             raise ValueError(f"Failed to load embeddings for model type: {MODEL_TYPE}")
@@ -230,16 +171,12 @@ def _initialize_pipeline():
         print(f"❌ Error loading embeddings: {e}")
         raise
-    # Load sentence metadata
-    try:
-        sentences_data = _load_sentences_metadata(SENTENCES_JSON)
-        print(f"🔍 Loaded {len(sentences_data)} sentence metadata entries")
-        if sentences_data:
-            sample_key = next(iter(sentences_data.keys()))
-            print(f"🔍 Sample sentence data structure: {sentences_data[sample_key]}")
-    except Exception as e:
-        print(f"❌ Error loading sentence metadata: {e}")
-        sentences_data = {}
     return processor, model, embeddings, sentence_ids, sentences_data, device

 from peft import PeftModel
 from PIL import Image
 from transformers import CLIPModel, CLIPProcessor
+from datasets import load_dataset
 from .filtering import get_filtered_sentence_ids
 # on-demand Grad-ECLIP & region-aware ranking
 from .heatmap import generate_heatmap
 from .config import (
     PAINTINGCLIP_MODEL_DIR,
+    EMBEDDINGS_DATASET,
+    JSON_DATASET,
+    sentences,
+    works,
+    creators,
+    topics,
+    topic_names
 )
 # ─── Configuration ───────────────────────────────────────────────────────────
 MODEL_CONFIG = {
     "clip": {
         "model_id": "openai/clip-vit-base-patch32",
         "use_lora": False,
         "lora_dir": None,
     },
     "paintingclip": {
         "model_id": "openai/clip-vit-base-patch32",
         "use_lora": True,
         "lora_dir": PAINTINGCLIP_MODEL_DIR,
     },
 }
 # Inference settings
 TOP_K = 25  # Number of results to return
 # ─────────────────────────────────────────────────────────────────────────────
+def load_embeddings_from_hf():
+    """Load embeddings from HF dataset"""
+    try:
+        print(f"🔍 Loading embeddings from {EMBEDDINGS_DATASET}...")
+        dataset = load_dataset(EMBEDDINGS_DATASET, split="train")
+        # Load CLIP embeddings
+        clip_embeddings = dataset["clip_embeddings"]
+        clip_sentence_ids = dataset["clip_embeddings_sentence_ids"]
+        # Load PaintingCLIP embeddings
+        paintingclip_embeddings = dataset["paintingclip_embeddings"]
+        paintingclip_sentence_ids = dataset["paintingclip_embeddings_sentence_ids"]
+        print(f"✅ Successfully loaded embeddings from HF:")
+        print(f"   CLIP: {len(clip_sentence_ids)} embeddings")
+        print(f"   PaintingCLIP: {len(paintingclip_sentence_ids)} embeddings")
+        return {
+            "clip": (clip_embeddings, clip_sentence_ids),
+            "paintingclip": (paintingclip_embeddings, paintingclip_sentence_ids)
+        }
+    except Exception as e:
+        print(f"❌ Failed to load embeddings from HF: {e}")
+        return None
+def _load_sentences_metadata() -> Dict[str, Dict[str, Any]]:
     """
+    Get sentence metadata from global config (loaded from HF datasets).
     """
+    return sentences
 @lru_cache(maxsize=1)
 def _initialize_pipeline():
     This function loads all heavy resources once and caches them:
     - CLIP model (with optional LoRA adapter)
+    - Pre-computed sentence embeddings from HF
+    - Sentence metadata from HF
     Returns:
         Tuple of (processor, model, embeddings, sentence_ids, sentences_data, device)
     model = model.eval()
+    # Load pre-computed embeddings from HF
     try:
+        embeddings_data = load_embeddings_from_hf()
+        if embeddings_data is None:
+            raise ValueError(f"Failed to load embeddings from HF dataset: {EMBEDDINGS_DATASET}")
         if MODEL_TYPE == "clip":
+            embeddings, sentence_ids = embeddings_data["clip"]
         else:
+            embeddings, sentence_ids = embeddings_data["paintingclip"]
         if embeddings is None or sentence_ids is None:
             raise ValueError(f"Failed to load embeddings for model type: {MODEL_TYPE}")
         print(f"❌ Error loading embeddings: {e}")
         raise
+    # Get sentence metadata from global config
+    sentences_data = _load_sentences_metadata()
+    print(f"🔍 Loaded {len(sentences_data)} sentence metadata entries")
+    if sentences_data:
+        sample_key = next(iter(sentences_data.keys()))
+        print(f"🔍 Sample sentence data structure: {sentences_data[sample_key]}")
     return processor, model, embeddings, sentence_ids, sentences_data, device

requirements.txt CHANGED Viewed

@@ -5,7 +5,8 @@ flask-cors
 # Hugging Face ecosystem
 huggingface_hub>=0.20
-hf_transfer>=0.1.4  # ← Add this line
 # Core ML libraries
 torch>=2.0.0

 # Hugging Face ecosystem
 huggingface_hub>=0.20
+hf_transfer>=0.1.4
+datasets>=2.14.0
 # Core ML libraries
 torch>=2.0.0