samwaugh commited on
Commit
1f1001b
Β·
1 Parent(s): 6e524f4

New README

Browse files
Files changed (1) hide show
  1. README.md +169 -69
README.md CHANGED
@@ -11,100 +11,200 @@ models:
11
  - openai/clip-vit-base-patch32
12
  - samwaugh/paintingclip-lora
13
  datasets:
14
- - samwaugh/artefact-embeddings-clip
15
- - samwaugh/artefact-embeddings-paintingclip
16
  ---
17
 
18
- # ArteFact β€” Hugging Face Space
19
 
20
- This branch contains the files required to run the **ArteFact** web app on Hugging Face **Spaces** using Docker.
21
- The full project documentation lives in the main GitHub repo (`main` branch).
22
 
23
- ## What runs here
24
- - **Flask server** (`backend/runner/app.py`) serving the SPA from `frontend/` (UI + API share one origin)
25
- - Built with the provided **Dockerfile**; the app listens on `$PORT` (set by Spaces)
26
- - **Phase 1**: Stub mode with fake ML responses (`STUB_MODE=1`)
27
- - **Phase 2**: Full ML inference with CLIP and PaintingCLIP models
28
 
29
- ## Current Status
30
- - βœ… **Phase 1 Complete**: Basic Flask app with stub responses
31
- - βœ… **Phase 2 Complete**: Real ML inference with large-scale corpus
32
- - 🎯 **Dataset**: **Massive 3.1M sentence corpus** (~33GB total) processed on Durham University's Bede HPC cluster
 
33
 
34
- ## οΏ½οΏ½ New: Massive Scale Art History Corpus
35
 
36
- **Now featuring 3.1 million sentences** from art historical texts, processed through our **ArtContext pipeline** on Durham University's Bede HPC cluster using **Grace Hopper GPUs**. This represents one of the largest art history text corpora available for computational analysis.
 
 
 
 
37
 
38
- ### **Processing Scale**
39
- - **Total sentences processed**: 3,119,199
40
- - **Embedding models**: CLIP + PaintingCLIP
41
- - **Processing time**: ~12 minutes on Grace Hopper
42
- - **Total data generated**: ~33GB
43
- - **GPU**: NVIDIA H100 with 32GB memory
 
 
 
 
 
 
 
 
 
44
 
45
- ### **Data Files**
46
- - **CLIP embeddings**: 6.2GB safetensors file
47
- - **PaintingCLIP embeddings**: 6.2GB safetensors file
48
- - **Updated metadata**: sentences.json with embedding status
49
- - **Marker outputs**: Document analysis results
50
 
51
- ## Deploy / update
 
 
 
 
 
52
  ```bash
53
- # one-time setup
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  git remote add hf https://huggingface.co/spaces/samwaugh/ArteFact
55
 
56
- # deploy this branch to the Space
57
- git push hf space-clean:main
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
- # force rebuild if needed
60
- # (use Hugging Face Space settings β†’ Factory Reset)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ```
62
 
63
- ## Environment Variables
64
- - `STUB_MODE`: Set to `1` for stub responses, `0` for real ML
65
- - `DATA_ROOT`: Data directory path (default: `/data`)
66
- - `PORT`: Server port (set by Hugging Face)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
- ## Architecture
69
- - **Backend**: Flask API with ML inference pipeline
70
- - **Frontend**: Single-page application (HTML/CSS/JS)
71
- - **Models**: CLIP base + PaintingCLIP LoRA fine-tune
72
- - **Data**: **Large-scale embeddings (12.4GB total)** with comprehensive metadata
73
 
74
- ## 🎯 Performance Improvements
75
 
76
- With the new large-scale corpus:
77
- - **Search quality**: Significantly improved with 3.1M sentences
78
- - **Coverage**: Broader art historical context
79
- - **Efficiency**: Safetensors format for faster loading
80
- - **Scalability**: Ready for production deployment
81
 
82
- ## Data Structure
 
 
 
 
83
 
84
- The Space now includes:
85
- - **`data/embeddings/`**: Large-scale sentence vectors (12.4GB total)
86
- - `clip_embeddings.safetensors` (6.2GB)
87
- - `paintingclip_embeddings.safetensors` (6.2GB)
88
- - Sentence ID mapping files
89
- - **`data/json_info/`**: Metadata for 3.1M sentences
90
- - **`data/marker_output/`**: Document analysis outputs
91
 
92
- ## HPC Pipeline: Bede Cluster Processing
 
 
 
 
 
93
 
94
- The **ArtContext pipeline** has been successfully executed on Durham University's **Bede HPC cluster**:
 
 
 
95
 
96
- ### **HPC Job Details**
97
- - **Partition**: Grace Hopper (gh)
98
- - **GPU**: NVIDIA H100
99
- - **Memory**: 32GB
100
- - **Batch size**: 1,024 sentences
101
- - **Processing speed**: ~9 batches/second
102
 
103
- ### **Pipeline Outputs**
104
- All data is now available in this Space for real-time art analysis with unprecedented scale and accuracy.
105
 
106
- ## Acknowledgements
107
 
108
- **Special thanks to Durham University's Bede HPC cluster** for providing the computational resources needed to process this large-scale art history corpus using Grace Hopper GPUs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
 
110
- This work made use of the facilities of the N8 Centre of Excellence in Computationally Intensive Research (N8 CIR) provided and funded by the N8 research partnership and EPSRC (Grant No. EP/T022167/1). The Centre is coordinated by the Universities of Durham, Manchester and York.
 
11
  - openai/clip-vit-base-patch32
12
  - samwaugh/paintingclip-lora
13
  datasets:
14
+ - samwaugh/artefact-embeddings
15
+ - samwaugh/artefact-markdown
16
  ---
17
 
18
+ # ArteFact β€” Art History AI Research Platform
19
 
20
+ **ArteFact** is a sophisticated web application that bridges visual art and textual scholarship using AI. By automatically linking visual elements in artworks to scholarly descriptions, it empowers researchers, students, and art enthusiasts to discover new connections and understand artworks in their broader academic context.
 
21
 
22
+ ## What ArteFact Does
 
 
 
 
23
 
24
+ - **Upload or select artwork images** and find scholarly passages that describe similar visual elements
25
+ - **Search by region** - crop specific areas of paintings to find text about those visual details
26
+ - **Filter results** by art historical topics or specific creators
27
+ - **Access scholarly sources** with full citations, DOI links, and BibTeX references
28
+ - **Generate heatmaps** showing which image regions contribute to text similarity using Grad-ECLIP
29
 
30
+ ## πŸ—οΈ Architecture Overview
31
 
32
+ ### **Backend: Flask API with ML Pipeline**
33
+ - **Flask server** (`backend/runner/app.py`) serving the SPA from `frontend/`
34
+ - **ML Models**: CLIP base + PaintingCLIP LoRA fine-tune
35
+ - **Inference Engine**: Region-aware analysis with 7Γ—7 grid overlay
36
+ - **Background Processing**: Thread-based task queue for ML inference
37
 
38
+ ### **Frontend: Interactive Web Application**
39
+ - **Single-page application** with responsive Bootstrap design
40
+ - **Image Tools**: Upload, crop, edit, and analyze specific regions
41
+ - **Grid Analysis**: Click-to-analyze 7Γ—7 grid cells for spatial understanding
42
+ - **Academic Integration**: Full citation management and source verification
43
+
44
+ ### **Data Architecture: Distributed Hugging Face Datasets**
45
+ - **`artefact-embeddings`**: Pre-computed sentence embeddings (12.8GB total)
46
+ - `clip_embeddings.safetensors` (6.39GB) - CLIP model embeddings
47
+ - `paintingclip_embeddings.safetensors` (6.39GB) - PaintingCLIP embeddings
48
+ - `*_sentence_ids.json` (71.7MB each) - Sentence ID mappings
49
+ - **`artefact-markdown`**: Source documents and images (planned)
50
+ - 7,200 work directories with markdown files and associated images
51
+ - Organized by work ID for efficient retrieval
52
+ - **Local Models**: PaintingCLIP LoRA weights in `data/models/PaintingCLIP/`
53
 
54
+ ## πŸš€ Getting Started
 
 
 
 
55
 
56
+ ### **Prerequisites**
57
+ - Python 3.9+
58
+ - Docker (for containerized deployment)
59
+ - Access to Hugging Face datasets
60
+
61
+ ### **Local Development**
62
  ```bash
63
+ # Clone the repository
64
+ git clone https://github.com/sammwaughh/artefact-context.git
65
+ cd artefact-context
66
+
67
+ # Install backend dependencies
68
+ cd backend
69
+ pip install -e .
70
+
71
+ # Set environment variables
72
+ export STUB_MODE=1 # Use stub responses for development
73
+ export DATA_ROOT=./data
74
+
75
+ # Run the Flask development server
76
+ python -m backend.runner.app
77
+ ```
78
+
79
+ ### **Hugging Face Spaces Deployment**
80
+ ```bash
81
+ # Add HF Spaces remote
82
  git remote add hf https://huggingface.co/spaces/samwaugh/ArteFact
83
 
84
+ # Deploy to Space
85
+ git push hf main:main
86
+
87
+ # Force rebuild if needed (use HF Space settings β†’ Factory Reset)
88
+ ```
89
+
90
+ ## Configuration
91
+
92
+ ### **Environment Variables**
93
+ - `STUB_MODE`: Set to `1` for stub responses, `0` for real ML inference
94
+ - `DATA_ROOT`: Data directory path (default: `/data` for HF Spaces)
95
+ - `PORT`: Server port (set by Hugging Face Spaces)
96
+ - `MAX_WORKERS`: Thread pool size for ML inference (default: 2)
97
+
98
+ ### **Data Sources**
99
+ The application connects to distributed data sources:
100
+ - **Embeddings**: `samwaugh/artefact-embeddings` for fast similarity search
101
+ - **Markdown**: `samwaugh/artefact-markdown` for source documents and context
102
+ - **Models**: Local `data/models/` directory for ML model weights
103
+ - **Metadata**: Local `data/json_info/` for fast access to sentence and work information
104
+
105
+ ## πŸ“Š Data Processing Pipeline
106
 
107
+ ### **ArtContext Research Pipeline**
108
+ ArteFact processes a massive corpus of art historical texts:
109
+
110
+ - **Scale**: 3.1 million sentences from scholarly articles
111
+ - **Processing**: Executed on Durham University's Bede HPC cluster
112
+ - **GPU**: NVIDIA H100 with 32GB memory
113
+ - **Processing Time**: ~12 minutes for full corpus
114
+ - **Output**: Structured embeddings and metadata for real-time analysis
115
+
116
+ ### **Data Organization**
117
+ ```
118
+ data/
119
+ β”œβ”€β”€ models/
120
+ β”‚ └── PaintingCLIP/ # LoRA fine-tuned weights
121
+ β”œβ”€β”€ embeddings/ # Local cache (if needed)
122
+ β”œβ”€β”€ json_info/ # Metadata files
123
+ β”‚ β”œβ”€β”€ sentences.json # 3.1M sentence metadata
124
+ β”‚ β”œβ”€β”€ works.json # 7,200 work records
125
+ β”‚ β”œβ”€β”€ creators.json # Artist/creator mappings
126
+ β”‚ β”œβ”€β”€ topics.json # Topic classifications
127
+ β”‚ └── topic_names.json # Human-readable topic names
128
+ └── marker_output/ # Document analysis outputs
129
  ```
130
 
131
+ ## 🧠 AI Models & Features
132
+
133
+ ### **Core Models**
134
+ - **CLIP**: OpenAI's CLIP-ViT-B/32 for general image-text understanding
135
+ - **PaintingCLIP**: Fine-tuned version specialized for art historical content
136
+ - **Model Switching**: Users can choose between models for different analysis types
137
+
138
+ ### **Advanced AI Features**
139
+ - **Region-Aware Analysis**: 7Γ—7 grid overlay for spatial understanding
140
+ - **Grad-ECLIP Heatmaps**: Visual explanations of AI decision-making
141
+ - **Smart Filtering**: Topic and creator-based result filtering
142
+ - **Patch-Level Attention**: ViT patch embeddings for detailed analysis
143
+
144
+ ## 🎨 User Interface Features
145
+
146
+ ### **Image Analysis Tools**
147
+ - **Drag & Drop Upload**: Easy image input with preview
148
+ - **Interactive Grid**: Click-to-analyze specific image regions
149
+ - **Crop & Edit**: Built-in image manipulation tools
150
+ - **Image History**: Track and compare different analyses
151
 
152
+ ### **Academic Integration**
153
+ - **Citation Management**: One-click BibTeX copying
154
+ - **Source Verification**: Direct links to scholarly articles
155
+ - **Context Preservation**: Full paragraph context for matched sentences
156
+ - **Work Exploration**: Browse related images and metadata
157
 
158
+ ## πŸ”¬ Research & Development
159
 
160
+ ### **Technical Innovations**
161
+ - **Efficient Embedding Storage**: Safetensors format for fast loading
162
+ - **Memory-Optimized Inference**: Caching and batch processing
163
+ - **Real-Time Analysis**: Sub-second response times for similarity search
164
+ - **Scalable Architecture**: Designed for production deployment
165
 
166
+ ### **Academic Applications**
167
+ - **Art Historical Research**: Discover connections across large corpora
168
+ - **Digital Humanities**: Computational analysis of visual-textual relationships
169
+ - **Educational Tools**: Interactive learning for art history students
170
+ - **Scholarly Discovery**: AI-powered literature review and citation analysis
171
 
172
+ ## 🀝 Contributing
 
 
 
 
 
 
173
 
174
+ ### **Development Setup**
175
+ 1. Fork the repository
176
+ 2. Create a feature branch
177
+ 3. Install development dependencies: `pip install -e ".[dev]"`
178
+ 4. Run tests: `pytest backend/tests/`
179
+ 5. Submit a pull request
180
 
181
+ ### **Data Contributions**
182
+ - **Embeddings**: Process new art historical texts
183
+ - **Models**: Improve fine-tuning and model performance
184
+ - **Documentation**: Enhance user guides and API documentation
185
 
186
+ ## πŸ“„ License & Acknowledgments
 
 
 
 
 
187
 
188
+ **License**: MIT License
 
189
 
190
+ **Created by**: [Samuel Waugh](https://www.linkedin.com/in/samuel-waugh-31903b1bb/)
191
 
192
+ **Supervised by**: [Dr. Stuart James](https://stuart-james.com), Department of Computer Science, Durham University
193
+
194
+ **Supported by**: [N8 Centre of Excellence in Computationally Intensive Research (N8 CIR)](https://n8cir.org.uk/themes/internships/internships-2025/)
195
+
196
+ **Special Thanks**: Durham University's Bede HPC cluster for providing computational resources needed to process the large-scale art history corpus using Grace Hopper GPUs.
197
+
198
+ This work made use of the facilities of the N8 Centre of Excellence in Computationally Intensive Research (N8 CIR) provided and funded by the N8 research partnership and EPSRC (Grant No. EP/T022167/1). The Centre is coordinated by the Universities of Durham, Manchester and York.
199
+
200
+ ## πŸ”— Links
201
+
202
+ - **Live Application**: [ArteFact on Hugging Face Spaces](https://huggingface.co/spaces/samwaugh/ArteFact)
203
+ - **Source Code**: [GitHub Repository](https://github.com/sammwaughh/artefact-context)
204
+ - **Research Paper**: [Download PDF](paper/waugh2025artcontext.pdf)
205
+ - **Embeddings Dataset**: [artefact-embeddings on HF](https://huggingface.co/datasets/samwaugh/artefact-embeddings)
206
+ - **Markdown Dataset**: [artefact-markdown on HF](https://huggingface.co/datasets/samwaugh/artefact-markdown) (planned)
207
+
208
+ ---
209
 
210
+ *ArteFact represents a significant contribution to computational art history, making large-scale scholarly resources accessible through AI-powered visual analysis while maintaining academic rigor and providing transparent explanations of AI decision-making.*