# ArteFact Markdown Download Optimization ## Problem The original markdown download process was extremely slow, taking over 24 hours to download 7,202 work directories with their associated images. The process was: - **Sequential**: Downloading one work directory at a time - **Inefficient**: Downloading both markdown files and images together - **No parallelization**: Single-threaded approach - **Rate**: ~112 directories per hour (extremely slow) ## Solution: Optimized Parallel Download ### Key Improvements 1. **Two-Phase Download**: - **Phase 1**: Download only markdown files in parallel (fast) - **Phase 2**: Download images in batches (manageable) 2. **Parallel Processing**: - **Markdown files**: 10 concurrent downloads - **Images**: 5 concurrent downloads per batch - **Batch processing**: 50 works per batch for images 3. **Smart Error Handling**: - Graceful failure handling - Progress reporting every 500 files - Limited error spam (only first 3 errors per work) 4. **Server-Friendly**: - Small delays between batches - Reasonable concurrency limits - Respectful of Hugging Face rate limits ### Performance Expectations - **Markdown files**: Should complete in minutes (not hours) - **Images**: Will take longer but in manageable batches - **Overall**: 10-50x faster than original approach - **Resumable**: Can be interrupted and restarted ## New API Endpoints ### 1. `/cache/optimized-download` (POST) Starts the optimized download process with parallel processing. **Response**: ```json { "message": "Optimized download completed successfully", "cache_info": { "exists": true, "work_count": 7202, "size_gb": 15.2, "file_count": 45000 } } ``` ### 2. Existing Endpoints - `/cache/info` (GET): Get cache information - `/cache/clear` (POST): Clear the cache - `/cache/refresh` (POST): Force refresh (uses optimized approach) ## Usage ### Option 1: Via API ```bash # Start optimized download curl -X POST http://localhost:7860/cache/optimized-download # Check progress curl http://localhost:7860/cache/info ``` ### Option 2: Via Environment Variable ```bash # Force full download on startup FORCE_FULL_DOWNLOAD=true python -m backend.runner.app ``` ### Option 3: Via Test Script ```bash python test_optimized_download.py ``` ## Technical Details ### File Structure ``` /data/markdown_cache/ └── works/ ├── W1009740230/ │ ├── W1009740230.md │ └── images/ │ ├── image-001.png │ └── image-002.png └── W1014119368/ ├── W1014119368.md └── images/ └── image-001.png ``` ### Concurrency Settings - **Markdown downloads**: 10 workers - **Image downloads**: 5 workers per batch - **Batch size**: 50 works per batch - **Batch delay**: 1 second between batches ### Error Handling - Individual file failures don't stop the process - Progress is reported every 500 files - First 3 errors per work are logged - Graceful degradation on network issues ## Monitoring The system provides detailed logging: - File discovery progress - Phase 1 completion (markdown files) - Phase 2 progress (images by batch) - Final statistics Example output: ``` 🔍 Discovering files in dataset... 📊 Found 7202 work directories to download 📄 Phase 1: Downloading markdown files only... 📄 Downloaded 500/7202 markdown files (failed: 0) ✅ Phase 1 complete: 7202 markdown files downloaded, 0 failed 🖼️ Phase 2: Downloading images in batches... 🖼️ Processing image batch 1/145 (50 works) ✅ Phase 2 complete: 45000 images downloaded, 12 failed ✅ Successfully downloaded markdown dataset to /data/markdown_cache/works ``` ## Benefits 1. **Speed**: 10-50x faster than original approach 2. **Reliability**: Better error handling and recovery 3. **Monitoring**: Clear progress reporting 4. **Flexibility**: Can be triggered via API or environment 5. **Resumable**: Can be restarted if interrupted 6. **Server-friendly**: Respects rate limits and server resources This optimization transforms the markdown download from a 24+ hour process into a manageable task that completes in a reasonable timeframe.