# ArteFact Markdown Download Optimization

## Problem
The original markdown download process was extremely slow, taking over 24 hours to download 7,202 work directories with their associated images. The process was:
- **Sequential**: Downloading one work directory at a time
- **Inefficient**: Downloading both markdown files and images together
- **No parallelization**: Single-threaded approach
- **Rate**: ~112 directories per hour (extremely slow)

## Solution: Optimized Parallel Download

### Key Improvements

1. **Two-Phase Download**:
   - **Phase 1**: Download only markdown files in parallel (fast)
   - **Phase 2**: Download images in batches (manageable)

2. **Parallel Processing**:
   - **Markdown files**: 10 concurrent downloads
   - **Images**: 5 concurrent downloads per batch
   - **Batch processing**: 50 works per batch for images

3. **Smart Error Handling**:
   - Graceful failure handling
   - Progress reporting every 500 files
   - Limited error spam (only first 3 errors per work)

4. **Server-Friendly**:
   - Small delays between batches
   - Reasonable concurrency limits
   - Respectful of Hugging Face rate limits

### Performance Expectations

- **Markdown files**: Should complete in minutes (not hours)
- **Images**: Will take longer but in manageable batches
- **Overall**: 10-50x faster than original approach
- **Resumable**: Can be interrupted and restarted

## New API Endpoints

### 1. `/cache/optimized-download` (POST)
Starts the optimized download process with parallel processing.

**Response**:
```json
{
  "message": "Optimized download completed successfully",
  "cache_info": {
    "exists": true,
    "work_count": 7202,
    "size_gb": 15.2,
    "file_count": 45000
  }
}
```

### 2. Existing Endpoints
- `/cache/info` (GET): Get cache information
- `/cache/clear` (POST): Clear the cache
- `/cache/refresh` (POST): Force refresh (uses optimized approach)

## Usage

### Option 1: Via API
```bash
# Start optimized download
curl -X POST http://localhost:7860/cache/optimized-download

# Check progress
curl http://localhost:7860/cache/info
```

### Option 2: Via Environment Variable
```bash
# Force full download on startup
FORCE_FULL_DOWNLOAD=true python -m backend.runner.app
```

### Option 3: Via Test Script
```bash
python test_optimized_download.py
```

## Technical Details

### File Structure
```
/data/markdown_cache/
└── works/
    ├── W1009740230/
    │   ├── W1009740230.md
    │   └── images/
    │       ├── image-001.png
    │       └── image-002.png
    └── W1014119368/
        ├── W1014119368.md
        └── images/
            └── image-001.png
```

### Concurrency Settings
- **Markdown downloads**: 10 workers
- **Image downloads**: 5 workers per batch
- **Batch size**: 50 works per batch
- **Batch delay**: 1 second between batches

### Error Handling
- Individual file failures don't stop the process
- Progress is reported every 500 files
- First 3 errors per work are logged
- Graceful degradation on network issues

## Monitoring

The system provides detailed logging:
- File discovery progress
- Phase 1 completion (markdown files)
- Phase 2 progress (images by batch)
- Final statistics

Example output:
```
🔍 Discovering files in dataset...
📊 Found 7202 work directories to download
📄 Phase 1: Downloading markdown files only...
📄 Downloaded 500/7202 markdown files (failed: 0)
✅ Phase 1 complete: 7202 markdown files downloaded, 0 failed
🖼️  Phase 2: Downloading images in batches...
🖼️  Processing image batch 1/145 (50 works)
✅ Phase 2 complete: 45000 images downloaded, 12 failed
✅ Successfully downloaded markdown dataset to /data/markdown_cache/works
```

## Benefits

1. **Speed**: 10-50x faster than original approach
2. **Reliability**: Better error handling and recovery
3. **Monitoring**: Clear progress reporting
4. **Flexibility**: Can be triggered via API or environment
5. **Resumable**: Can be restarted if interrupted
6. **Server-friendly**: Respects rate limits and server resources

This optimization transforms the markdown download from a 24+ hour process into a manageable task that completes in a reasonable timeframe.