| # ArteFact Markdown Download Optimization | |
| ## Problem | |
| The original markdown download process was extremely slow, taking over 24 hours to download 7,202 work directories with their associated images. The process was: | |
| - **Sequential**: Downloading one work directory at a time | |
| - **Inefficient**: Downloading both markdown files and images together | |
| - **No parallelization**: Single-threaded approach | |
| - **Rate**: ~112 directories per hour (extremely slow) | |
| ## Solution: Optimized Parallel Download | |
| ### Key Improvements | |
| 1. **Two-Phase Download**: | |
| - **Phase 1**: Download only markdown files in parallel (fast) | |
| - **Phase 2**: Download images in batches (manageable) | |
| 2. **Parallel Processing**: | |
| - **Markdown files**: 10 concurrent downloads | |
| - **Images**: 5 concurrent downloads per batch | |
| - **Batch processing**: 50 works per batch for images | |
| 3. **Smart Error Handling**: | |
| - Graceful failure handling | |
| - Progress reporting every 500 files | |
| - Limited error spam (only first 3 errors per work) | |
| 4. **Server-Friendly**: | |
| - Small delays between batches | |
| - Reasonable concurrency limits | |
| - Respectful of Hugging Face rate limits | |
| ### Performance Expectations | |
| - **Markdown files**: Should complete in minutes (not hours) | |
| - **Images**: Will take longer but in manageable batches | |
| - **Overall**: 10-50x faster than original approach | |
| - **Resumable**: Can be interrupted and restarted | |
| ## New API Endpoints | |
| ### 1. `/cache/optimized-download` (POST) | |
| Starts the optimized download process with parallel processing. | |
| **Response**: | |
| ```json | |
| { | |
| "message": "Optimized download completed successfully", | |
| "cache_info": { | |
| "exists": true, | |
| "work_count": 7202, | |
| "size_gb": 15.2, | |
| "file_count": 45000 | |
| } | |
| } | |
| ``` | |
| ### 2. Existing Endpoints | |
| - `/cache/info` (GET): Get cache information | |
| - `/cache/clear` (POST): Clear the cache | |
| - `/cache/refresh` (POST): Force refresh (uses optimized approach) | |
| ## Usage | |
| ### Option 1: Via API | |
| ```bash | |
| # Start optimized download | |
| curl -X POST http://localhost:7860/cache/optimized-download | |
| # Check progress | |
| curl http://localhost:7860/cache/info | |
| ``` | |
| ### Option 2: Via Environment Variable | |
| ```bash | |
| # Force full download on startup | |
| FORCE_FULL_DOWNLOAD=true python -m backend.runner.app | |
| ``` | |
| ### Option 3: Via Test Script | |
| ```bash | |
| python test_optimized_download.py | |
| ``` | |
| ## Technical Details | |
| ### File Structure | |
| ``` | |
| /data/markdown_cache/ | |
| βββ works/ | |
| βββ W1009740230/ | |
| β βββ W1009740230.md | |
| β βββ images/ | |
| β βββ image-001.png | |
| β βββ image-002.png | |
| βββ W1014119368/ | |
| βββ W1014119368.md | |
| βββ images/ | |
| βββ image-001.png | |
| ``` | |
| ### Concurrency Settings | |
| - **Markdown downloads**: 10 workers | |
| - **Image downloads**: 5 workers per batch | |
| - **Batch size**: 50 works per batch | |
| - **Batch delay**: 1 second between batches | |
| ### Error Handling | |
| - Individual file failures don't stop the process | |
| - Progress is reported every 500 files | |
| - First 3 errors per work are logged | |
| - Graceful degradation on network issues | |
| ## Monitoring | |
| The system provides detailed logging: | |
| - File discovery progress | |
| - Phase 1 completion (markdown files) | |
| - Phase 2 progress (images by batch) | |
| - Final statistics | |
| Example output: | |
| ``` | |
| π Discovering files in dataset... | |
| π Found 7202 work directories to download | |
| π Phase 1: Downloading markdown files only... | |
| π Downloaded 500/7202 markdown files (failed: 0) | |
| β Phase 1 complete: 7202 markdown files downloaded, 0 failed | |
| πΌοΈ Phase 2: Downloading images in batches... | |
| πΌοΈ Processing image batch 1/145 (50 works) | |
| β Phase 2 complete: 45000 images downloaded, 12 failed | |
| β Successfully downloaded markdown dataset to /data/markdown_cache/works | |
| ``` | |
| ## Benefits | |
| 1. **Speed**: 10-50x faster than original approach | |
| 2. **Reliability**: Better error handling and recovery | |
| 3. **Monitoring**: Clear progress reporting | |
| 4. **Flexibility**: Can be triggered via API or environment | |
| 5. **Resumable**: Can be restarted if interrupted | |
| 6. **Server-friendly**: Respects rate limits and server resources | |
| This optimization transforms the markdown download from a 24+ hour process into a manageable task that completes in a reasonable timeframe. | |