ArteFact Markdown Download Optimization
Problem
The original markdown download process was extremely slow, taking over 24 hours to download 7,202 work directories with their associated images. The process was:
- Sequential: Downloading one work directory at a time
- Inefficient: Downloading both markdown files and images together
- No parallelization: Single-threaded approach
- Rate: ~112 directories per hour (extremely slow)
Solution: Optimized Parallel Download
Key Improvements
Two-Phase Download:
- Phase 1: Download only markdown files in parallel (fast)
- Phase 2: Download images in batches (manageable)
Parallel Processing:
- Markdown files: 10 concurrent downloads
- Images: 5 concurrent downloads per batch
- Batch processing: 50 works per batch for images
Smart Error Handling:
- Graceful failure handling
- Progress reporting every 500 files
- Limited error spam (only first 3 errors per work)
Server-Friendly:
- Small delays between batches
- Reasonable concurrency limits
- Respectful of Hugging Face rate limits
Performance Expectations
- Markdown files: Should complete in minutes (not hours)
- Images: Will take longer but in manageable batches
- Overall: 10-50x faster than original approach
- Resumable: Can be interrupted and restarted
New API Endpoints
1. /cache/optimized-download (POST)
Starts the optimized download process with parallel processing.
Response:
{
"message": "Optimized download completed successfully",
"cache_info": {
"exists": true,
"work_count": 7202,
"size_gb": 15.2,
"file_count": 45000
}
}
2. Existing Endpoints
/cache/info(GET): Get cache information/cache/clear(POST): Clear the cache/cache/refresh(POST): Force refresh (uses optimized approach)
Usage
Option 1: Via API
# Start optimized download
curl -X POST http://localhost:7860/cache/optimized-download
# Check progress
curl http://localhost:7860/cache/info
Option 2: Via Environment Variable
# Force full download on startup
FORCE_FULL_DOWNLOAD=true python -m backend.runner.app
Option 3: Via Test Script
python test_optimized_download.py
Technical Details
File Structure
/data/markdown_cache/
βββ works/
βββ W1009740230/
β βββ W1009740230.md
β βββ images/
β βββ image-001.png
β βββ image-002.png
βββ W1014119368/
βββ W1014119368.md
βββ images/
βββ image-001.png
Concurrency Settings
- Markdown downloads: 10 workers
- Image downloads: 5 workers per batch
- Batch size: 50 works per batch
- Batch delay: 1 second between batches
Error Handling
- Individual file failures don't stop the process
- Progress is reported every 500 files
- First 3 errors per work are logged
- Graceful degradation on network issues
Monitoring
The system provides detailed logging:
- File discovery progress
- Phase 1 completion (markdown files)
- Phase 2 progress (images by batch)
- Final statistics
Example output:
π Discovering files in dataset...
π Found 7202 work directories to download
π Phase 1: Downloading markdown files only...
π Downloaded 500/7202 markdown files (failed: 0)
β
Phase 1 complete: 7202 markdown files downloaded, 0 failed
πΌοΈ Phase 2: Downloading images in batches...
πΌοΈ Processing image batch 1/145 (50 works)
β
Phase 2 complete: 45000 images downloaded, 12 failed
β
Successfully downloaded markdown dataset to /data/markdown_cache/works
Benefits
- Speed: 10-50x faster than original approach
- Reliability: Better error handling and recovery
- Monitoring: Clear progress reporting
- Flexibility: Can be triggered via API or environment
- Resumable: Can be restarted if interrupted
- Server-friendly: Respects rate limits and server resources
This optimization transforms the markdown download from a 24+ hour process into a manageable task that completes in a reasonable timeframe.