New README
Browse files
README.md
CHANGED
|
@@ -11,100 +11,200 @@ models:
|
|
| 11 |
- openai/clip-vit-base-patch32
|
| 12 |
- samwaugh/paintingclip-lora
|
| 13 |
datasets:
|
| 14 |
-
- samwaugh/artefact-embeddings
|
| 15 |
-
- samwaugh/artefact-
|
| 16 |
---
|
| 17 |
|
| 18 |
-
# ArteFact β
|
| 19 |
|
| 20 |
-
|
| 21 |
-
The full project documentation lives in the main GitHub repo (`main` branch).
|
| 22 |
|
| 23 |
-
##
|
| 24 |
-
- **Flask server** (`backend/runner/app.py`) serving the SPA from `frontend/` (UI + API share one origin)
|
| 25 |
-
- Built with the provided **Dockerfile**; the app listens on `$PORT` (set by Spaces)
|
| 26 |
-
- **Phase 1**: Stub mode with fake ML responses (`STUB_MODE=1`)
|
| 27 |
-
- **Phase 2**: Full ML inference with CLIP and PaintingCLIP models
|
| 28 |
|
| 29 |
-
|
| 30 |
-
-
|
| 31 |
-
-
|
| 32 |
-
-
|
|
|
|
| 33 |
|
| 34 |
-
##
|
| 35 |
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
### **
|
| 39 |
-
- **
|
| 40 |
-
- **
|
| 41 |
-
- **
|
| 42 |
-
- **
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
-
- **CLIP embeddings**: 6.2GB safetensors file
|
| 47 |
-
- **PaintingCLIP embeddings**: 6.2GB safetensors file
|
| 48 |
-
- **Updated metadata**: sentences.json with embedding status
|
| 49 |
-
- **Marker outputs**: Document analysis results
|
| 50 |
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
```bash
|
| 53 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
git remote add hf https://huggingface.co/spaces/samwaugh/ArteFact
|
| 55 |
|
| 56 |
-
#
|
| 57 |
-
git push hf
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
```
|
| 62 |
|
| 63 |
-
##
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
-
|
| 69 |
-
- **
|
| 70 |
-
- **
|
| 71 |
-
- **
|
| 72 |
-
- **
|
| 73 |
|
| 74 |
-
##
|
| 75 |
|
| 76 |
-
|
| 77 |
-
- **
|
| 78 |
-
- **
|
| 79 |
-
- **
|
| 80 |
-
- **
|
| 81 |
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
-
|
| 85 |
-
- **`data/embeddings/`**: Large-scale sentence vectors (12.4GB total)
|
| 86 |
-
- `clip_embeddings.safetensors` (6.2GB)
|
| 87 |
-
- `paintingclip_embeddings.safetensors` (6.2GB)
|
| 88 |
-
- Sentence ID mapping files
|
| 89 |
-
- **`data/json_info/`**: Metadata for 3.1M sentences
|
| 90 |
-
- **`data/marker_output/`**: Document analysis outputs
|
| 91 |
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
-
|
| 97 |
-
- **Partition**: Grace Hopper (gh)
|
| 98 |
-
- **GPU**: NVIDIA H100
|
| 99 |
-
- **Memory**: 32GB
|
| 100 |
-
- **Batch size**: 1,024 sentences
|
| 101 |
-
- **Processing speed**: ~9 batches/second
|
| 102 |
|
| 103 |
-
|
| 104 |
-
All data is now available in this Space for real-time art analysis with unprecedented scale and accuracy.
|
| 105 |
|
| 106 |
-
|
| 107 |
|
| 108 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
-
|
|
|
|
| 11 |
- openai/clip-vit-base-patch32
|
| 12 |
- samwaugh/paintingclip-lora
|
| 13 |
datasets:
|
| 14 |
+
- samwaugh/artefact-embeddings
|
| 15 |
+
- samwaugh/artefact-markdown
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# ArteFact β Art History AI Research Platform
|
| 19 |
|
| 20 |
+
**ArteFact** is a sophisticated web application that bridges visual art and textual scholarship using AI. By automatically linking visual elements in artworks to scholarly descriptions, it empowers researchers, students, and art enthusiasts to discover new connections and understand artworks in their broader academic context.
|
|
|
|
| 21 |
|
| 22 |
+
## What ArteFact Does
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
- **Upload or select artwork images** and find scholarly passages that describe similar visual elements
|
| 25 |
+
- **Search by region** - crop specific areas of paintings to find text about those visual details
|
| 26 |
+
- **Filter results** by art historical topics or specific creators
|
| 27 |
+
- **Access scholarly sources** with full citations, DOI links, and BibTeX references
|
| 28 |
+
- **Generate heatmaps** showing which image regions contribute to text similarity using Grad-ECLIP
|
| 29 |
|
| 30 |
+
## ποΈ Architecture Overview
|
| 31 |
|
| 32 |
+
### **Backend: Flask API with ML Pipeline**
|
| 33 |
+
- **Flask server** (`backend/runner/app.py`) serving the SPA from `frontend/`
|
| 34 |
+
- **ML Models**: CLIP base + PaintingCLIP LoRA fine-tune
|
| 35 |
+
- **Inference Engine**: Region-aware analysis with 7Γ7 grid overlay
|
| 36 |
+
- **Background Processing**: Thread-based task queue for ML inference
|
| 37 |
|
| 38 |
+
### **Frontend: Interactive Web Application**
|
| 39 |
+
- **Single-page application** with responsive Bootstrap design
|
| 40 |
+
- **Image Tools**: Upload, crop, edit, and analyze specific regions
|
| 41 |
+
- **Grid Analysis**: Click-to-analyze 7Γ7 grid cells for spatial understanding
|
| 42 |
+
- **Academic Integration**: Full citation management and source verification
|
| 43 |
+
|
| 44 |
+
### **Data Architecture: Distributed Hugging Face Datasets**
|
| 45 |
+
- **`artefact-embeddings`**: Pre-computed sentence embeddings (12.8GB total)
|
| 46 |
+
- `clip_embeddings.safetensors` (6.39GB) - CLIP model embeddings
|
| 47 |
+
- `paintingclip_embeddings.safetensors` (6.39GB) - PaintingCLIP embeddings
|
| 48 |
+
- `*_sentence_ids.json` (71.7MB each) - Sentence ID mappings
|
| 49 |
+
- **`artefact-markdown`**: Source documents and images (planned)
|
| 50 |
+
- 7,200 work directories with markdown files and associated images
|
| 51 |
+
- Organized by work ID for efficient retrieval
|
| 52 |
+
- **Local Models**: PaintingCLIP LoRA weights in `data/models/PaintingCLIP/`
|
| 53 |
|
| 54 |
+
## π Getting Started
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
+
### **Prerequisites**
|
| 57 |
+
- Python 3.9+
|
| 58 |
+
- Docker (for containerized deployment)
|
| 59 |
+
- Access to Hugging Face datasets
|
| 60 |
+
|
| 61 |
+
### **Local Development**
|
| 62 |
```bash
|
| 63 |
+
# Clone the repository
|
| 64 |
+
git clone https://github.com/sammwaughh/artefact-context.git
|
| 65 |
+
cd artefact-context
|
| 66 |
+
|
| 67 |
+
# Install backend dependencies
|
| 68 |
+
cd backend
|
| 69 |
+
pip install -e .
|
| 70 |
+
|
| 71 |
+
# Set environment variables
|
| 72 |
+
export STUB_MODE=1 # Use stub responses for development
|
| 73 |
+
export DATA_ROOT=./data
|
| 74 |
+
|
| 75 |
+
# Run the Flask development server
|
| 76 |
+
python -m backend.runner.app
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
### **Hugging Face Spaces Deployment**
|
| 80 |
+
```bash
|
| 81 |
+
# Add HF Spaces remote
|
| 82 |
git remote add hf https://huggingface.co/spaces/samwaugh/ArteFact
|
| 83 |
|
| 84 |
+
# Deploy to Space
|
| 85 |
+
git push hf main:main
|
| 86 |
+
|
| 87 |
+
# Force rebuild if needed (use HF Space settings β Factory Reset)
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
## Configuration
|
| 91 |
+
|
| 92 |
+
### **Environment Variables**
|
| 93 |
+
- `STUB_MODE`: Set to `1` for stub responses, `0` for real ML inference
|
| 94 |
+
- `DATA_ROOT`: Data directory path (default: `/data` for HF Spaces)
|
| 95 |
+
- `PORT`: Server port (set by Hugging Face Spaces)
|
| 96 |
+
- `MAX_WORKERS`: Thread pool size for ML inference (default: 2)
|
| 97 |
+
|
| 98 |
+
### **Data Sources**
|
| 99 |
+
The application connects to distributed data sources:
|
| 100 |
+
- **Embeddings**: `samwaugh/artefact-embeddings` for fast similarity search
|
| 101 |
+
- **Markdown**: `samwaugh/artefact-markdown` for source documents and context
|
| 102 |
+
- **Models**: Local `data/models/` directory for ML model weights
|
| 103 |
+
- **Metadata**: Local `data/json_info/` for fast access to sentence and work information
|
| 104 |
+
|
| 105 |
+
## π Data Processing Pipeline
|
| 106 |
|
| 107 |
+
### **ArtContext Research Pipeline**
|
| 108 |
+
ArteFact processes a massive corpus of art historical texts:
|
| 109 |
+
|
| 110 |
+
- **Scale**: 3.1 million sentences from scholarly articles
|
| 111 |
+
- **Processing**: Executed on Durham University's Bede HPC cluster
|
| 112 |
+
- **GPU**: NVIDIA H100 with 32GB memory
|
| 113 |
+
- **Processing Time**: ~12 minutes for full corpus
|
| 114 |
+
- **Output**: Structured embeddings and metadata for real-time analysis
|
| 115 |
+
|
| 116 |
+
### **Data Organization**
|
| 117 |
+
```
|
| 118 |
+
data/
|
| 119 |
+
βββ models/
|
| 120 |
+
β βββ PaintingCLIP/ # LoRA fine-tuned weights
|
| 121 |
+
βββ embeddings/ # Local cache (if needed)
|
| 122 |
+
βββ json_info/ # Metadata files
|
| 123 |
+
β βββ sentences.json # 3.1M sentence metadata
|
| 124 |
+
β βββ works.json # 7,200 work records
|
| 125 |
+
β βββ creators.json # Artist/creator mappings
|
| 126 |
+
β βββ topics.json # Topic classifications
|
| 127 |
+
β βββ topic_names.json # Human-readable topic names
|
| 128 |
+
βββ marker_output/ # Document analysis outputs
|
| 129 |
```
|
| 130 |
|
| 131 |
+
## π§ AI Models & Features
|
| 132 |
+
|
| 133 |
+
### **Core Models**
|
| 134 |
+
- **CLIP**: OpenAI's CLIP-ViT-B/32 for general image-text understanding
|
| 135 |
+
- **PaintingCLIP**: Fine-tuned version specialized for art historical content
|
| 136 |
+
- **Model Switching**: Users can choose between models for different analysis types
|
| 137 |
+
|
| 138 |
+
### **Advanced AI Features**
|
| 139 |
+
- **Region-Aware Analysis**: 7Γ7 grid overlay for spatial understanding
|
| 140 |
+
- **Grad-ECLIP Heatmaps**: Visual explanations of AI decision-making
|
| 141 |
+
- **Smart Filtering**: Topic and creator-based result filtering
|
| 142 |
+
- **Patch-Level Attention**: ViT patch embeddings for detailed analysis
|
| 143 |
+
|
| 144 |
+
## π¨ User Interface Features
|
| 145 |
+
|
| 146 |
+
### **Image Analysis Tools**
|
| 147 |
+
- **Drag & Drop Upload**: Easy image input with preview
|
| 148 |
+
- **Interactive Grid**: Click-to-analyze specific image regions
|
| 149 |
+
- **Crop & Edit**: Built-in image manipulation tools
|
| 150 |
+
- **Image History**: Track and compare different analyses
|
| 151 |
|
| 152 |
+
### **Academic Integration**
|
| 153 |
+
- **Citation Management**: One-click BibTeX copying
|
| 154 |
+
- **Source Verification**: Direct links to scholarly articles
|
| 155 |
+
- **Context Preservation**: Full paragraph context for matched sentences
|
| 156 |
+
- **Work Exploration**: Browse related images and metadata
|
| 157 |
|
| 158 |
+
## π¬ Research & Development
|
| 159 |
|
| 160 |
+
### **Technical Innovations**
|
| 161 |
+
- **Efficient Embedding Storage**: Safetensors format for fast loading
|
| 162 |
+
- **Memory-Optimized Inference**: Caching and batch processing
|
| 163 |
+
- **Real-Time Analysis**: Sub-second response times for similarity search
|
| 164 |
+
- **Scalable Architecture**: Designed for production deployment
|
| 165 |
|
| 166 |
+
### **Academic Applications**
|
| 167 |
+
- **Art Historical Research**: Discover connections across large corpora
|
| 168 |
+
- **Digital Humanities**: Computational analysis of visual-textual relationships
|
| 169 |
+
- **Educational Tools**: Interactive learning for art history students
|
| 170 |
+
- **Scholarly Discovery**: AI-powered literature review and citation analysis
|
| 171 |
|
| 172 |
+
## π€ Contributing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 173 |
|
| 174 |
+
### **Development Setup**
|
| 175 |
+
1. Fork the repository
|
| 176 |
+
2. Create a feature branch
|
| 177 |
+
3. Install development dependencies: `pip install -e ".[dev]"`
|
| 178 |
+
4. Run tests: `pytest backend/tests/`
|
| 179 |
+
5. Submit a pull request
|
| 180 |
|
| 181 |
+
### **Data Contributions**
|
| 182 |
+
- **Embeddings**: Process new art historical texts
|
| 183 |
+
- **Models**: Improve fine-tuning and model performance
|
| 184 |
+
- **Documentation**: Enhance user guides and API documentation
|
| 185 |
|
| 186 |
+
## π License & Acknowledgments
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
|
| 188 |
+
**License**: MIT License
|
|
|
|
| 189 |
|
| 190 |
+
**Created by**: [Samuel Waugh](https://www.linkedin.com/in/samuel-waugh-31903b1bb/)
|
| 191 |
|
| 192 |
+
**Supervised by**: [Dr. Stuart James](https://stuart-james.com), Department of Computer Science, Durham University
|
| 193 |
+
|
| 194 |
+
**Supported by**: [N8 Centre of Excellence in Computationally Intensive Research (N8 CIR)](https://n8cir.org.uk/themes/internships/internships-2025/)
|
| 195 |
+
|
| 196 |
+
**Special Thanks**: Durham University's Bede HPC cluster for providing computational resources needed to process the large-scale art history corpus using Grace Hopper GPUs.
|
| 197 |
+
|
| 198 |
+
This work made use of the facilities of the N8 Centre of Excellence in Computationally Intensive Research (N8 CIR) provided and funded by the N8 research partnership and EPSRC (Grant No. EP/T022167/1). The Centre is coordinated by the Universities of Durham, Manchester and York.
|
| 199 |
+
|
| 200 |
+
## π Links
|
| 201 |
+
|
| 202 |
+
- **Live Application**: [ArteFact on Hugging Face Spaces](https://huggingface.co/spaces/samwaugh/ArteFact)
|
| 203 |
+
- **Source Code**: [GitHub Repository](https://github.com/sammwaughh/artefact-context)
|
| 204 |
+
- **Research Paper**: [Download PDF](paper/waugh2025artcontext.pdf)
|
| 205 |
+
- **Embeddings Dataset**: [artefact-embeddings on HF](https://huggingface.co/datasets/samwaugh/artefact-embeddings)
|
| 206 |
+
- **Markdown Dataset**: [artefact-markdown on HF](https://huggingface.co/datasets/samwaugh/artefact-markdown) (planned)
|
| 207 |
+
|
| 208 |
+
---
|
| 209 |
|
| 210 |
+
*ArteFact represents a significant contribution to computational art history, making large-scale scholarly resources accessible through AI-powered visual analysis while maintaining academic rigor and providing transparent explanations of AI decision-making.*
|