Spaces:

samwaugh
/

ArteFact

Paused

App Files Files Community

ArteFact / OPTIMIZATION_SUMMARY.md

samwaugh

Try to speed up markdown download

33b499e 5 months ago

preview code

raw

history blame contribute delete

4.22 kB

ArteFact Markdown Download Optimization

Problem

The original markdown download process was extremely slow, taking over 24 hours to download 7,202 work directories with their associated images. The process was:

Sequential: Downloading one work directory at a time
Inefficient: Downloading both markdown files and images together
No parallelization: Single-threaded approach
Rate: ~112 directories per hour (extremely slow)

Solution: Optimized Parallel Download

Key Improvements

Two-Phase Download:
- Phase 1: Download only markdown files in parallel (fast)
- Phase 2: Download images in batches (manageable)
Parallel Processing:
- Markdown files: 10 concurrent downloads
- Images: 5 concurrent downloads per batch
- Batch processing: 50 works per batch for images
Smart Error Handling:
- Graceful failure handling
- Progress reporting every 500 files
- Limited error spam (only first 3 errors per work)
Server-Friendly:
- Small delays between batches
- Reasonable concurrency limits
- Respectful of Hugging Face rate limits

Performance Expectations

Markdown files: Should complete in minutes (not hours)
Images: Will take longer but in manageable batches
Overall: 10-50x faster than original approach
Resumable: Can be interrupted and restarted

New API Endpoints

1. `/cache/optimized-download` (POST)

Starts the optimized download process with parallel processing.

Response:

{
  "message": "Optimized download completed successfully",
  "cache_info": {
    "exists": true,
    "work_count": 7202,
    "size_gb": 15.2,
    "file_count": 45000
  }
}

2. Existing Endpoints

/cache/info (GET): Get cache information
/cache/clear (POST): Clear the cache
/cache/refresh (POST): Force refresh (uses optimized approach)

Usage

Option 1: Via API

# Start optimized download
curl -X POST http://localhost:7860/cache/optimized-download

# Check progress
curl http://localhost:7860/cache/info

Option 2: Via Environment Variable

# Force full download on startup
FORCE_FULL_DOWNLOAD=true python -m backend.runner.app

Option 3: Via Test Script

python test_optimized_download.py

Technical Details

File Structure

/data/markdown_cache/
└── works/
    ├── W1009740230/
    │   ├── W1009740230.md
    │   └── images/
    │       ├── image-001.png
    │       └── image-002.png
    └── W1014119368/
        ├── W1014119368.md
        └── images/
            └── image-001.png

Concurrency Settings

Markdown downloads: 10 workers
Image downloads: 5 workers per batch
Batch size: 50 works per batch
Batch delay: 1 second between batches

Error Handling

Individual file failures don't stop the process
Progress is reported every 500 files
First 3 errors per work are logged
Graceful degradation on network issues

Monitoring

The system provides detailed logging:

File discovery progress
Phase 1 completion (markdown files)
Phase 2 progress (images by batch)
Final statistics

Example output:

🔍 Discovering files in dataset...
📊 Found 7202 work directories to download
📄 Phase 1: Downloading markdown files only...
📄 Downloaded 500/7202 markdown files (failed: 0)
✅ Phase 1 complete: 7202 markdown files downloaded, 0 failed
🖼️  Phase 2: Downloading images in batches...
🖼️  Processing image batch 1/145 (50 works)
✅ Phase 2 complete: 45000 images downloaded, 12 failed
✅ Successfully downloaded markdown dataset to /data/markdown_cache/works

Benefits

Speed: 10-50x faster than original approach
Reliability: Better error handling and recovery
Monitoring: Clear progress reporting
Flexibility: Can be triggered via API or environment
Resumable: Can be restarted if interrupted
Server-friendly: Respects rate limits and server resources

This optimization transforms the markdown download from a 24+ hour process into a manageable task that completes in a reasonable timeframe.