Spaces:

samwaugh
/

ArteFact

Paused

App Files Files Community

samwaugh commited on Sep 1, 2025

Commit

1f1001b

1 Parent(s): 6e524f4

New README

Browse files

Files changed (1) hide show

README.md +169 -69

README.md CHANGED Viewed

@@ -11,100 +11,200 @@ models:
   - openai/clip-vit-base-patch32
   - samwaugh/paintingclip-lora
 datasets:
-  - samwaugh/artefact-embeddings-clip
-  - samwaugh/artefact-embeddings-paintingclip
 ---
-# ArteFact — Hugging Face Space
-This branch contains the files required to run the **ArteFact** web app on Hugging Face **Spaces** using Docker.
-The full project documentation lives in the main GitHub repo (`main` branch).
-## What runs here
-- **Flask server** (`backend/runner/app.py`) serving the SPA from `frontend/` (UI + API share one origin)
-- Built with the provided **Dockerfile**; the app listens on `$PORT` (set by Spaces)
-- **Phase 1**: Stub mode with fake ML responses (`STUB_MODE=1`)
-- **Phase 2**: Full ML inference with CLIP and PaintingCLIP models
-## Current Status
-- ✅ **Phase 1 Complete**: Basic Flask app with stub responses
-- ✅ **Phase 2 Complete**: Real ML inference with large-scale corpus
-- 🎯 **Dataset**: **Massive 3.1M sentence corpus** (~33GB total) processed on Durham University's Bede HPC cluster
-## �� New: Massive Scale Art History Corpus
-**Now featuring 3.1 million sentences** from art historical texts, processed through our **ArtContext pipeline** on Durham University's Bede HPC cluster using **Grace Hopper GPUs**. This represents one of the largest art history text corpora available for computational analysis.
-### **Processing Scale**
-- **Total sentences processed**: 3,119,199
-- **Embedding models**: CLIP + PaintingCLIP
-- **Processing time**: ~12 minutes on Grace Hopper
-- **Total data generated**: ~33GB
-- **GPU**: NVIDIA H100 with 32GB memory
-### **Data Files**
-- **CLIP embeddings**: 6.2GB safetensors file
-- **PaintingCLIP embeddings**: 6.2GB safetensors file
-- **Updated metadata**: sentences.json with embedding status
-- **Marker outputs**: Document analysis results
-## Deploy / update
 ```bash
-# one-time setup
 git remote add hf https://huggingface.co/spaces/samwaugh/ArteFact
-# deploy this branch to the Space
-git push hf space-clean:main
-# force rebuild if needed
-# (use Hugging Face Space settings → Factory Reset)
 ```
-## Environment Variables
-- `STUB_MODE`: Set to `1` for stub responses, `0` for real ML
-- `DATA_ROOT`: Data directory path (default: `/data`)
-- `PORT`: Server port (set by Hugging Face)
-## Architecture
-- **Backend**: Flask API with ML inference pipeline
-- **Frontend**: Single-page application (HTML/CSS/JS)
-- **Models**: CLIP base + PaintingCLIP LoRA fine-tune
-- **Data**: **Large-scale embeddings (12.4GB total)** with comprehensive metadata
-## 🎯 Performance Improvements
-With the new large-scale corpus:
-- **Search quality**: Significantly improved with 3.1M sentences
-- **Coverage**: Broader art historical context
-- **Efficiency**: Safetensors format for faster loading
-- **Scalability**: Ready for production deployment
-## Data Structure
-The Space now includes:
-- **`data/embeddings/`**: Large-scale sentence vectors (12.4GB total)
-  - `clip_embeddings.safetensors` (6.2GB)
-  - `paintingclip_embeddings.safetensors` (6.2GB)
-  - Sentence ID mapping files
-- **`data/json_info/`**: Metadata for 3.1M sentences
-- **`data/marker_output/`**: Document analysis outputs
-## HPC Pipeline: Bede Cluster Processing
-The **ArtContext pipeline** has been successfully executed on Durham University's **Bede HPC cluster**:
-### **HPC Job Details**
-- **Partition**: Grace Hopper (gh)
-- **GPU**: NVIDIA H100
-- **Memory**: 32GB
-- **Batch size**: 1,024 sentences
-- **Processing speed**: ~9 batches/second
-### **Pipeline Outputs**
-All data is now available in this Space for real-time art analysis with unprecedented scale and accuracy.
-## Acknowledgements
-**Special thanks to Durham University's Bede HPC cluster** for providing the computational resources needed to process this large-scale art history corpus using Grace Hopper GPUs.
-This work made use of the facilities of the N8 Centre of Excellence in Computationally Intensive Research (N8 CIR) provided and funded by the N8 research partnership and EPSRC (Grant No. EP/T022167/1). The Centre is coordinated by the Universities of Durham, Manchester and York.

   - openai/clip-vit-base-patch32
   - samwaugh/paintingclip-lora
 datasets:
+  - samwaugh/artefact-embeddings
+  - samwaugh/artefact-markdown
 ---
+# ArteFact — Art History AI Research Platform
+**ArteFact** is a sophisticated web application that bridges visual art and textual scholarship using AI. By automatically linking visual elements in artworks to scholarly descriptions, it empowers researchers, students, and art enthusiasts to discover new connections and understand artworks in their broader academic context.
+##  What ArteFact Does
+- **Upload or select artwork images** and find scholarly passages that describe similar visual elements
+- **Search by region** - crop specific areas of paintings to find text about those visual details
+- **Filter results** by art historical topics or specific creators
+- **Access scholarly sources** with full citations, DOI links, and BibTeX references
+- **Generate heatmaps** showing which image regions contribute to text similarity using Grad-ECLIP
+## 🏗️ Architecture Overview
+### **Backend: Flask API with ML Pipeline**
+- **Flask server** (`backend/runner/app.py`) serving the SPA from `frontend/`
+- **ML Models**: CLIP base + PaintingCLIP LoRA fine-tune
+- **Inference Engine**: Region-aware analysis with 7×7 grid overlay
+- **Background Processing**: Thread-based task queue for ML inference
+### **Frontend: Interactive Web Application**
+- **Single-page application** with responsive Bootstrap design
+- **Image Tools**: Upload, crop, edit, and analyze specific regions
+- **Grid Analysis**: Click-to-analyze 7×7 grid cells for spatial understanding
+- **Academic Integration**: Full citation management and source verification
+### **Data Architecture: Distributed Hugging Face Datasets**
+- **`artefact-embeddings`**: Pre-computed sentence embeddings (12.8GB total)
+  - `clip_embeddings.safetensors` (6.39GB) - CLIP model embeddings
+  - `paintingclip_embeddings.safetensors` (6.39GB) - PaintingCLIP embeddings
+  - `*_sentence_ids.json` (71.7MB each) - Sentence ID mappings
+- **`artefact-markdown`**: Source documents and images (planned)
+  - 7,200 work directories with markdown files and associated images
+  - Organized by work ID for efficient retrieval
+- **Local Models**: PaintingCLIP LoRA weights in `data/models/PaintingCLIP/`
+## 🚀 Getting Started
+### **Prerequisites**
+- Python 3.9+
+- Docker (for containerized deployment)
+- Access to Hugging Face datasets
+### **Local Development**
 ```bash
+# Clone the repository
+git clone https://github.com/sammwaughh/artefact-context.git
+cd artefact-context
+# Install backend dependencies
+cd backend
+pip install -e .
+# Set environment variables
+export STUB_MODE=1  # Use stub responses for development
+export DATA_ROOT=./data
+# Run the Flask development server
+python -m backend.runner.app
+```
+### **Hugging Face Spaces Deployment**
+```bash
+# Add HF Spaces remote
 git remote add hf https://huggingface.co/spaces/samwaugh/ArteFact
+# Deploy to Space
+git push hf main:main
+# Force rebuild if needed (use HF Space settings → Factory Reset)
+```
+##  Configuration
+### **Environment Variables**
+- `STUB_MODE`: Set to `1` for stub responses, `0` for real ML inference
+- `DATA_ROOT`: Data directory path (default: `/data` for HF Spaces)
+- `PORT`: Server port (set by Hugging Face Spaces)
+- `MAX_WORKERS`: Thread pool size for ML inference (default: 2)
+### **Data Sources**
+The application connects to distributed data sources:
+- **Embeddings**: `samwaugh/artefact-embeddings` for fast similarity search
+- **Markdown**: `samwaugh/artefact-markdown` for source documents and context
+- **Models**: Local `data/models/` directory for ML model weights
+- **Metadata**: Local `data/json_info/` for fast access to sentence and work information
+## 📊 Data Processing Pipeline
+### **ArtContext Research Pipeline**
+ArteFact processes a massive corpus of art historical texts:
+- **Scale**: 3.1 million sentences from scholarly articles
+- **Processing**: Executed on Durham University's Bede HPC cluster
+- **GPU**: NVIDIA H100 with 32GB memory
+- **Processing Time**: ~12 minutes for full corpus
+- **Output**: Structured embeddings and metadata for real-time analysis
+### **Data Organization**
+```
+data/
+├── models/
+│   └── PaintingCLIP/          # LoRA fine-tuned weights
+├── embeddings/                 # Local cache (if needed)
+├── json_info/                  # Metadata files
+│   ├── sentences.json         # 3.1M sentence metadata
+│   ├── works.json            # 7,200 work records
+│   ├── creators.json         # Artist/creator mappings
+│   ├── topics.json           # Topic classifications
+│   └── topic_names.json      # Human-readable topic names
+└── marker_output/             # Document analysis outputs
 ```
+## 🧠 AI Models & Features
+### **Core Models**
+- **CLIP**: OpenAI's CLIP-ViT-B/32 for general image-text understanding
+- **PaintingCLIP**: Fine-tuned version specialized for art historical content
+- **Model Switching**: Users can choose between models for different analysis types
+### **Advanced AI Features**
+- **Region-Aware Analysis**: 7×7 grid overlay for spatial understanding
+- **Grad-ECLIP Heatmaps**: Visual explanations of AI decision-making
+- **Smart Filtering**: Topic and creator-based result filtering
+- **Patch-Level Attention**: ViT patch embeddings for detailed analysis
+## 🎨 User Interface Features
+### **Image Analysis Tools**
+- **Drag & Drop Upload**: Easy image input with preview
+- **Interactive Grid**: Click-to-analyze specific image regions
+- **Crop & Edit**: Built-in image manipulation tools
+- **Image History**: Track and compare different analyses
+### **Academic Integration**
+- **Citation Management**: One-click BibTeX copying
+- **Source Verification**: Direct links to scholarly articles
+- **Context Preservation**: Full paragraph context for matched sentences
+- **Work Exploration**: Browse related images and metadata
+## 🔬 Research & Development
+### **Technical Innovations**
+- **Efficient Embedding Storage**: Safetensors format for fast loading
+- **Memory-Optimized Inference**: Caching and batch processing
+- **Real-Time Analysis**: Sub-second response times for similarity search
+- **Scalable Architecture**: Designed for production deployment
+### **Academic Applications**
+- **Art Historical Research**: Discover connections across large corpora
+- **Digital Humanities**: Computational analysis of visual-textual relationships
+- **Educational Tools**: Interactive learning for art history students
+- **Scholarly Discovery**: AI-powered literature review and citation analysis
+## 🤝 Contributing
+### **Development Setup**
+1. Fork the repository
+2. Create a feature branch
+3. Install development dependencies: `pip install -e ".[dev]"`
+4. Run tests: `pytest backend/tests/`
+5. Submit a pull request
+### **Data Contributions**
+- **Embeddings**: Process new art historical texts
+- **Models**: Improve fine-tuning and model performance
+- **Documentation**: Enhance user guides and API documentation
+## 📄 License & Acknowledgments
+**License**: MIT License
+**Created by**: [Samuel Waugh](https://www.linkedin.com/in/samuel-waugh-31903b1bb/)
+**Supervised by**: [Dr. Stuart James](https://stuart-james.com), Department of Computer Science, Durham University
+**Supported by**: [N8 Centre of Excellence in Computationally Intensive Research (N8 CIR)](https://n8cir.org.uk/themes/internships/internships-2025/)
+**Special Thanks**: Durham University's Bede HPC cluster for providing computational resources needed to process the large-scale art history corpus using Grace Hopper GPUs.
+This work made use of the facilities of the N8 Centre of Excellence in Computationally Intensive Research (N8 CIR) provided and funded by the N8 research partnership and EPSRC (Grant No. EP/T022167/1). The Centre is coordinated by the Universities of Durham, Manchester and York.
+## 🔗 Links
+- **Live Application**: [ArteFact on Hugging Face Spaces](https://huggingface.co/spaces/samwaugh/ArteFact)
+- **Source Code**: [GitHub Repository](https://github.com/sammwaughh/artefact-context)
+- **Research Paper**: [Download PDF](paper/waugh2025artcontext.pdf)
+- **Embeddings Dataset**: [artefact-embeddings on HF](https://huggingface.co/datasets/samwaugh/artefact-embeddings)
+- **Markdown Dataset**: [artefact-markdown on HF](https://huggingface.co/datasets/samwaugh/artefact-markdown) (planned)
+---
+*ArteFact represents a significant contribution to computational art history, making large-scale scholarly resources accessible through AI-powered visual analysis while maintaining academic rigor and providing transparent explanations of AI decision-making.*