Add initial implementation for Phase 0 validation of HF Inference API with vision models
Browse files- Introduced a new test script `test_phase0_hf_vision_api.py` to validate multimodal support for vision models.
- Implemented functions to encode images to base64, test models with base64 and file path inputs, and handle OCR models.
- Configured logging for detailed output during testing.
- Added a sample test image `test_image_red_square.jpg` for validation purposes.
- Established a decision gate to proceed to Phase 1 based on test results.
- CHANGELOG.md +129 -0
- PLAN.md +37 -15
- README.md +2 -0
- output/phase0_vision_validation_20260107_174146.json +33 -0
- output/phase0_vision_validation_20260107_174401.json +78 -0
- output/phase0_vision_validation_20260107_182113.json +70 -0
- output/phase0_vision_validation_20260107_182155.json +54 -0
- output/phase0_vision_validation_20260107_183155.json +63 -0
- output/phase0_vision_validation_20260107_184839.json +45 -0
- output/phase0_vision_validation_20260111_162124.json +86 -0
- output/phase0_vision_validation_20260111_163647.json +17 -0
- output/phase0_vision_validation_20260111_164531.json +29 -0
- output/phase0_vision_validation_20260111_164945.json +29 -0
- test/fixtures/{test_image.jpg → test_image_red_square.jpg} +0 -0
- test/test_phase0_hf_vision_api.py +378 -0
CHANGELOG.md
CHANGED
|
@@ -1,5 +1,134 @@
|
|
| 1 |
# Session Changelog
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
## [2026-01-06] [Plan Revision] [COMPLETED] HuggingFace Vision Integration Plan - Corrected Architecture
|
| 4 |
|
| 5 |
**Problem:** Initial plan had critical gaps that would waste implementation time:
|
|
|
|
| 1 |
# Session Changelog
|
| 2 |
|
| 3 |
+
## [2026-01-07] [Phase 0: API Validation] [COMPLETED] HF Inference Vision Support - GO Decision
|
| 4 |
+
|
| 5 |
+
**Problem:** Needed to validate HF Inference API supports vision models before implementation.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
### Knowledge Updates
|
| 10 |
+
|
| 11 |
+
**Solution - Phase 0 Validation Results:**
|
| 12 |
+
|
| 13 |
+
**✅ GO Decision - Proceed to Phase 1**
|
| 14 |
+
|
| 15 |
+
**Final Working Model (Recommended):**
|
| 16 |
+
|
| 17 |
+
- **CohereLabs/aya-vision-32b** (32B params, Cohere provider) - ✅ **PRODUCTION READY**
|
| 18 |
+
- Handles small images (1KB base64): ~1-3 seconds
|
| 19 |
+
- Handles large images (2.8MB base64): ~10 seconds, no timeout
|
| 20 |
+
- Excellent quality: Detailed scene understanding, object identification, spatial relationships
|
| 21 |
+
- Sample response on workspace image: "The image depicts a serene workspace setup on a wooden desk...white ceramic mug filled with dark liquid...silver laptop...rolled-up paper secured with rubber band..."
|
| 22 |
+
|
| 23 |
+
**Partially Working Models (Timeout Issues with Large Images):**
|
| 24 |
+
|
| 25 |
+
1. **Qwen/Qwen3-VL-8B-Instruct** (8B params, Novita provider) - ⚠️ Conditionally working
|
| 26 |
+
- Small images (1KB): ✅ Works
|
| 27 |
+
- Large images (2.8MB): ❌ 504 Gateway Timeout (>120 seconds)
|
| 28 |
+
- Only works with models that have `?inference_provider=` in URL
|
| 29 |
+
2. **baidu/ERNIE-4.5-VL-424B-A47B-Base-PT** (424B params, Novita provider) - ⚠️ Conditionally working
|
| 30 |
+
- Small images (1KB): ✅ Works
|
| 31 |
+
- Large images (2.8MB): ❌ 504 Gateway Timeout (>120 seconds)
|
| 32 |
+
|
| 33 |
+
**Failed Models:**
|
| 34 |
+
|
| 35 |
+
1. `deepseek-ai/DeepSeek-OCR` - Not exposed via Inference API (requires local GPU)
|
| 36 |
+
- Attempted both chat_completion and image_to_text endpoints
|
| 37 |
+
- Error: "Task 'image-to-text' not supported for provider 'novita'"
|
| 38 |
+
- Solution: Must use transformers library locally (not serverless API)
|
| 39 |
+
2. `CohereLabs/command-a-vision-07-2025` - 429 rate limit (try later)
|
| 40 |
+
3. `zai-org/GLM-4.1V-9B-Thinking` - Provider doesn't support model
|
| 41 |
+
4. `microsoft/Phi-3.5-vision-instruct` - Not enabled for serverless
|
| 42 |
+
5. `meta-llama/Llama-3.2-11B-Vision-Instruct` - Not enabled for serverless
|
| 43 |
+
6. `Qwen/Qwen2-VL-72B-Instruct` - Not enabled for serverless
|
| 44 |
+
|
| 45 |
+
**Working Format:** Base64 encoding only
|
| 46 |
+
|
| 47 |
+
- ✅ Base64: Works for all providers
|
| 48 |
+
- ❌ File path (file:// URL): Failed with 400 Bad Request
|
| 49 |
+
- ❌ Direct image parameter: API doesn't support
|
| 50 |
+
|
| 51 |
+
**Critical Discovery - Large Image Handling:**
|
| 52 |
+
|
| 53 |
+
| Model | Small Image (1KB) | Large Image (2.8MB) | Recommendation |
|
| 54 |
+
|-------|-------------------|---------------------|----------------|
|
| 55 |
+
| aya-vision-32b | ✅ 1-3s | ✅ ~10s | **Use for production** |
|
| 56 |
+
| Qwen3-VL-8B-Instruct | ✅ 1-3s | ❌ >120s timeout | Use with image preprocessing |
|
| 57 |
+
| ERNIE-4.5-VL-424B | ✅ 1-3s | ❌ >120s timeout | Use with image preprocessing |
|
| 58 |
+
|
| 59 |
+
**API Behavior:**
|
| 60 |
+
|
| 61 |
+
- Response format: Standard chat completion with content field
|
| 62 |
+
- Rate limits: 429 possible (command-a-vision hit this)
|
| 63 |
+
- Errors: Clear error messages in JSON format
|
| 64 |
+
- Latency: 1-3 seconds for small images, 10 seconds for large images (aya only)
|
| 65 |
+
- Timeout: 120 seconds default (Novita times out on large images)
|
| 66 |
+
|
| 67 |
+
**Key Learning - Inference Provider Pattern:**
|
| 68 |
+
|
| 69 |
+
- Models with `?inference_provider=PROVIDER` in URL ARE accessible via serverless API
|
| 70 |
+
- Example: `huggingface.co/Qwen/Qwen3-VL-8B-Instruct?inference_provider=novita` ✅
|
| 71 |
+
- Models without provider parameter (DeepSeek-OCR) require local deployment
|
| 72 |
+
|
| 73 |
+
**Recommendation for Phase 1:**
|
| 74 |
+
|
| 75 |
+
- Primary: `CohereLabs/aya-vision-32b` (handles all image sizes, Cohere provider reliable)
|
| 76 |
+
- Format: Base64 encode images in messages array
|
| 77 |
+
- Consider image preprocessing (resize/compress) for non-Cohere providers
|
| 78 |
+
- Set 120+ second timeouts for large images
|
| 79 |
+
|
| 80 |
+
**HF Pro Account Context:**
|
| 81 |
+
|
| 82 |
+
- Free accounts: $0.10/month credits, NO pay-as-you-go
|
| 83 |
+
- Pro accounts ($9/mo): $2.00/month credits, CAN use pay-as-you-go after credits
|
| 84 |
+
- Provider charges pass-through directly (no HF markup)
|
| 85 |
+
- Pro required for production workloads with uninterrupted access
|
| 86 |
+
|
| 87 |
+
**Next Steps:**
|
| 88 |
+
|
| 89 |
+
- Phase 1: Implement `analyze_image_hf()` using aya-vision-32b
|
| 90 |
+
- Phase 1: Fix vision tool routing to respect `LLM_PROVIDER`
|
| 91 |
+
- Phase 1: Add image preprocessing for large files (resize if >1MB)
|
| 92 |
+
|
| 93 |
+
**Test Images:**
|
| 94 |
+
|
| 95 |
+
- `test/fixtures/test_image_red_square.jpg` - Simple test image (825 bytes)
|
| 96 |
+
- `test/fixtures/test_image_real.png` - Complex workspace photo (2.1MB file, 2.8MB base64)
|
| 97 |
+
|
| 98 |
+
---
|
| 99 |
+
|
| 100 |
+
### Code Changes
|
| 101 |
+
|
| 102 |
+
**Modified Files:**
|
| 103 |
+
|
| 104 |
+
- **test/test_phase0_hf_vision_api.py** (NEW - ~400 lines)
|
| 105 |
+
- Phase 0 validation script
|
| 106 |
+
- Tests multiple vision models
|
| 107 |
+
- Tests multiple image formats
|
| 108 |
+
- Exports results to JSON
|
| 109 |
+
- OCR model testing support (image_to_text endpoint)
|
| 110 |
+
|
| 111 |
+
**Output Files:**
|
| 112 |
+
|
| 113 |
+
- **output/phase0_vision_validation_20260107_174401.json** - Initial test (red square image)
|
| 114 |
+
- **output/phase0_vision_validation_20260107_174146.json** - First attempt (no models worked)
|
| 115 |
+
- **output/phase0_vision_validation_20260107_182113.json** - DeepSeek-OCR test
|
| 116 |
+
- **output/phase0_vision_validation_20260107_182155.json** - Qwen3-VL discovery
|
| 117 |
+
- **output/phase0_vision_validation_20260107_184839.json** - Real image test (workspace photo)
|
| 118 |
+
|
| 119 |
+
**Next Steps:**
|
| 120 |
+
|
| 121 |
+
- Phase 1: Implement `analyze_image_hf()` using aya-vision-32b
|
| 122 |
+
- Phase 1: Fix vision tool routing to respect `LLM_PROVIDER`
|
| 123 |
+
- Phase 1: Add image preprocessing for large files (resize if >1MB)
|
| 124 |
+
|
| 125 |
+
**Test Images:**
|
| 126 |
+
|
| 127 |
+
- `test/fixtures/test_image_red_square.jpg` - Simple test image (825 bytes)
|
| 128 |
+
- `test/fixtures/test_image_real.png` - Complex workspace photo (2.1MB file, 2.8MB base64)
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
## [2026-01-06] [Plan Revision] [COMPLETED] HuggingFace Vision Integration Plan - Corrected Architecture
|
| 133 |
|
| 134 |
**Problem:** Initial plan had critical gaps that would waste implementation time:
|
PLAN.md
CHANGED
|
@@ -22,6 +22,7 @@ Fix LLM selection routing so UI provider selection propagates to ALL tools (plan
|
|
| 22 |
**Changes needed:**
|
| 23 |
|
| 24 |
1. **Vision tool (src/tools/vision.py):**
|
|
|
|
| 25 |
- Add `analyze_image_hf()` function for HuggingFace multimodal models
|
| 26 |
- Modify `analyze_image()` to check `os.getenv("LLM_PROVIDER")`
|
| 27 |
- Route to correct provider: `gemini`, `huggingface`, `groq`, `claude`
|
|
@@ -44,12 +45,14 @@ Fix LLM selection routing so UI provider selection propagates to ALL tools (plan
|
|
| 44 |
**Candidate models:**
|
| 45 |
|
| 46 |
1. **Qwen/Qwen2-VL-72B-Instruct** (Recommended)
|
|
|
|
| 47 |
- 72B parameters, vision-language model
|
| 48 |
- Supports: images, video, text
|
| 49 |
- API: HuggingFace Inference API (paid tier)
|
| 50 |
- Format: Base64 image + text prompt
|
| 51 |
|
| 52 |
2. **meta-llama/Llama-3.2-90B-Vision-Instruct**
|
|
|
|
| 53 |
- 90B parameters, multimodal
|
| 54 |
- Supports: images + text
|
| 55 |
- API: HuggingFace Inference API
|
|
@@ -84,10 +87,12 @@ Fix LLM selection routing so UI provider selection propagates to ALL tools (plan
|
|
| 84 |
**Tools available:**
|
| 85 |
|
| 86 |
1. **BLIP-2** (Salesforce/blip2-opt-2.7b)
|
|
|
|
| 87 |
- Image captioning model
|
| 88 |
- Converts image → text description
|
| 89 |
|
| 90 |
2. **LLaVA** (llava-hf/llava-1.5-7b-hf)
|
|
|
|
| 91 |
- Vision-language assistant
|
| 92 |
- Image → detailed text
|
| 93 |
|
|
@@ -167,19 +172,28 @@ Fix LLM selection routing so UI provider selection propagates to ALL tools (plan
|
|
| 167 |
- **Option D:** Local transformers library (no API)
|
| 168 |
- **Option E:** Hybrid (HF text + Gemini/Claude vision only)
|
| 169 |
|
|
|
|
|
|
|
| 170 |
---
|
| 171 |
|
| 172 |
-
### Phase 1: HuggingFace Vision Implementation
|
| 173 |
|
| 174 |
**Goal:** Implement `analyze_image_hf()` using validated API pattern
|
| 175 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
#### Step 1.1: Implement `analyze_image_hf()` in vision.py
|
| 177 |
|
| 178 |
- [ ] Add function signature matching existing pattern
|
| 179 |
-
- [ ] Use validated
|
| 180 |
-
- [ ] Format
|
| 181 |
-
- [ ] Add retry logic with exponential backoff
|
| 182 |
- [ ] Handle API errors with clear error messages
|
|
|
|
| 183 |
- [ ] **NO fallback logic** - fail loudly for debugging
|
| 184 |
|
| 185 |
#### Step 1.2: Fix Vision Tool Routing (NO FALLBACKS)
|
|
@@ -202,9 +216,9 @@ Fix LLM selection routing so UI provider selection propagates to ALL tools (plan
|
|
| 202 |
|
| 203 |
#### Step 1.3: Update Configuration
|
| 204 |
|
| 205 |
-
- [ ] Add `HF_VISION_MODEL` to .env (
|
| 206 |
- [ ] Update `src/config/settings.py` with vision model setting
|
| 207 |
-
- [ ] Document
|
| 208 |
|
| 209 |
---
|
| 210 |
|
|
@@ -273,10 +287,10 @@ Fix LLM selection routing so UI provider selection propagates to ALL tools (plan
|
|
| 273 |
|
| 274 |
- [ ] Document per-provider results:
|
| 275 |
|
| 276 |
-
| Provider
|
| 277 |
-
|
| 278 |
-
| HuggingFace (Phi-3.5) | 8/8 attempted
|
| 279 |
-
| Gemini (baseline)
|
| 280 |
|
| 281 |
#### Step 3.4: Decision Gate - Optimization Decision
|
| 282 |
|
|
@@ -352,12 +366,14 @@ Fix LLM selection routing so UI provider selection propagates to ALL tools (plan
|
|
| 352 |
### Phase 0-1: Core Vision Integration
|
| 353 |
|
| 354 |
1. **src/tools/vision.py** (~150 lines added/modified)
|
|
|
|
| 355 |
- Add `analyze_image_hf()` function (Phase 1)
|
| 356 |
- Modify `analyze_image()` routing logic - NO FALLBACKS (Phase 1)
|
| 357 |
- Add retry logic with exponential backoff
|
| 358 |
- Clear error messages for debugging
|
| 359 |
|
| 360 |
2. **.env** (~3 lines added)
|
|
|
|
| 361 |
- Add `HF_VISION_MODEL=microsoft/Phi-3.5-vision-instruct` (start small)
|
| 362 |
- Document alternatives: Llama-3.2-11B-Vision, Qwen2-VL-72B
|
| 363 |
|
|
@@ -368,6 +384,7 @@ Fix LLM selection routing so UI provider selection propagates to ALL tools (plan
|
|
| 368 |
### Phase 2-3: Testing Infrastructure
|
| 369 |
|
| 370 |
1. **test/test_vision_smoke.py** (NEW - ~100 lines)
|
|
|
|
| 371 |
- Smoke test suite: description, OCR, counting, single GAIA
|
| 372 |
- Export individual test results
|
| 373 |
|
|
@@ -378,14 +395,17 @@ Fix LLM selection routing so UI provider selection propagates to ALL tools (plan
|
|
| 378 |
### Phase 4: Media Processing
|
| 379 |
|
| 380 |
1. **src/tools/youtube.py** (NEW - ~80 lines)
|
|
|
|
| 381 |
- YouTube transcript extraction
|
| 382 |
- Use `youtube-transcript-api`
|
| 383 |
|
| 384 |
2. **src/tools/audio.py** (NEW - ~80 lines)
|
|
|
|
| 385 |
- Audio transcription (Whisper or HF audio models)
|
| 386 |
- Convert audio → text
|
| 387 |
|
| 388 |
-
3. **src/tools
|
|
|
|
| 389 |
- Register new tools: youtube_transcript, audio_transcribe
|
| 390 |
|
| 391 |
4. **requirements.txt** (~3 lines)
|
|
@@ -395,9 +415,9 @@ Fix LLM selection routing so UI provider selection propagates to ALL tools (plan
|
|
| 395 |
### Phase 6: Documentation
|
| 396 |
|
| 397 |
1. **README.md** (~30 lines modified)
|
| 398 |
-
|
| 399 |
-
|
| 400 |
-
|
| 401 |
|
| 402 |
## Success Criteria
|
| 403 |
|
|
@@ -475,11 +495,13 @@ If Phase 0 reveals HF Inference API doesn't support vision:
|
|
| 475 |
## Phase 0 Research Questions (Answer These First)
|
| 476 |
|
| 477 |
1. **Does HF Inference API support vision models?**
|
|
|
|
| 478 |
- Test Phi-3.5-vision-instruct with simple image
|
| 479 |
- Test Llama-3.2-11B-Vision-Instruct
|
| 480 |
- Test Qwen2-VL-72B-Instruct
|
| 481 |
|
| 482 |
2. **What's the image input format?**
|
|
|
|
| 483 |
- Base64 encoding in messages?
|
| 484 |
- Direct URL support?
|
| 485 |
- File path support?
|
|
@@ -493,7 +515,7 @@ If Phase 0 reveals HF Inference API doesn't support vision:
|
|
| 493 |
|
| 494 |
**Phase 0 starts with:**
|
| 495 |
|
| 496 |
-
1. Research HF Inference API documentation for vision support
|
| 497 |
2. Test simple vision API call with Phi-3.5-vision-instruct
|
| 498 |
3. Document working pattern or confirm API doesn't support vision
|
| 499 |
4. Decision gate: GO to Phase 1 or pivot to backup options
|
|
|
|
| 22 |
**Changes needed:**
|
| 23 |
|
| 24 |
1. **Vision tool (src/tools/vision.py):**
|
| 25 |
+
|
| 26 |
- Add `analyze_image_hf()` function for HuggingFace multimodal models
|
| 27 |
- Modify `analyze_image()` to check `os.getenv("LLM_PROVIDER")`
|
| 28 |
- Route to correct provider: `gemini`, `huggingface`, `groq`, `claude`
|
|
|
|
| 45 |
**Candidate models:**
|
| 46 |
|
| 47 |
1. **Qwen/Qwen2-VL-72B-Instruct** (Recommended)
|
| 48 |
+
|
| 49 |
- 72B parameters, vision-language model
|
| 50 |
- Supports: images, video, text
|
| 51 |
- API: HuggingFace Inference API (paid tier)
|
| 52 |
- Format: Base64 image + text prompt
|
| 53 |
|
| 54 |
2. **meta-llama/Llama-3.2-90B-Vision-Instruct**
|
| 55 |
+
|
| 56 |
- 90B parameters, multimodal
|
| 57 |
- Supports: images + text
|
| 58 |
- API: HuggingFace Inference API
|
|
|
|
| 87 |
**Tools available:**
|
| 88 |
|
| 89 |
1. **BLIP-2** (Salesforce/blip2-opt-2.7b)
|
| 90 |
+
|
| 91 |
- Image captioning model
|
| 92 |
- Converts image → text description
|
| 93 |
|
| 94 |
2. **LLaVA** (llava-hf/llava-1.5-7b-hf)
|
| 95 |
+
|
| 96 |
- Vision-language assistant
|
| 97 |
- Image → detailed text
|
| 98 |
|
|
|
|
| 172 |
- **Option D:** Local transformers library (no API)
|
| 173 |
- **Option E:** Hybrid (HF text + Gemini/Claude vision only)
|
| 174 |
|
| 175 |
+
**Phase 0 Status:** ✅ COMPLETED - See CHANGELOG.md for results
|
| 176 |
+
|
| 177 |
---
|
| 178 |
|
| 179 |
+
### Phase 1: HuggingFace Vision Implementation
|
| 180 |
|
| 181 |
**Goal:** Implement `analyze_image_hf()` using validated API pattern
|
| 182 |
|
| 183 |
+
**Validated from Phase 0:**
|
| 184 |
+
|
| 185 |
+
- Model: `CohereLabs/aya-vision-32b` (Cohere provider)
|
| 186 |
+
- Format: Base64 encoding in messages array
|
| 187 |
+
- Timeout: 120+ seconds for large images
|
| 188 |
+
|
| 189 |
#### Step 1.1: Implement `analyze_image_hf()` in vision.py
|
| 190 |
|
| 191 |
- [ ] Add function signature matching existing pattern
|
| 192 |
+
- [ ] Use **CohereLabs/aya-vision-32b** (validated from Phase 0)
|
| 193 |
+
- [ ] Format: Base64 encode images in messages array
|
| 194 |
+
- [ ] Add retry logic with exponential backoff (3 attempts)
|
| 195 |
- [ ] Handle API errors with clear error messages
|
| 196 |
+
- [ ] Set 120s timeout for large images
|
| 197 |
- [ ] **NO fallback logic** - fail loudly for debugging
|
| 198 |
|
| 199 |
#### Step 1.2: Fix Vision Tool Routing (NO FALLBACKS)
|
|
|
|
| 216 |
|
| 217 |
#### Step 1.3: Update Configuration
|
| 218 |
|
| 219 |
+
- [ ] Add `HF_VISION_MODEL=CohereLabs/aya-vision-32b` to .env (validated from Phase 0)
|
| 220 |
- [ ] Update `src/config/settings.py` with vision model setting
|
| 221 |
+
- [ ] Document alternatives (Qwen/Qwen3-VL-8B-Instruct for small images only)
|
| 222 |
|
| 223 |
---
|
| 224 |
|
|
|
|
| 287 |
|
| 288 |
- [ ] Document per-provider results:
|
| 289 |
|
| 290 |
+
| Provider | Vision Questions | Accuracy | Notes |
|
| 291 |
+
| --------------------- | ---------------- | -------- | -------------- |
|
| 292 |
+
| HuggingFace (Phi-3.5) | 8/8 attempted | X% | [observations] |
|
| 293 |
+
| Gemini (baseline) | 8/8 attempted | Y% | [comparison] |
|
| 294 |
|
| 295 |
#### Step 3.4: Decision Gate - Optimization Decision
|
| 296 |
|
|
|
|
| 366 |
### Phase 0-1: Core Vision Integration
|
| 367 |
|
| 368 |
1. **src/tools/vision.py** (~150 lines added/modified)
|
| 369 |
+
|
| 370 |
- Add `analyze_image_hf()` function (Phase 1)
|
| 371 |
- Modify `analyze_image()` routing logic - NO FALLBACKS (Phase 1)
|
| 372 |
- Add retry logic with exponential backoff
|
| 373 |
- Clear error messages for debugging
|
| 374 |
|
| 375 |
2. **.env** (~3 lines added)
|
| 376 |
+
|
| 377 |
- Add `HF_VISION_MODEL=microsoft/Phi-3.5-vision-instruct` (start small)
|
| 378 |
- Document alternatives: Llama-3.2-11B-Vision, Qwen2-VL-72B
|
| 379 |
|
|
|
|
| 384 |
### Phase 2-3: Testing Infrastructure
|
| 385 |
|
| 386 |
1. **test/test_vision_smoke.py** (NEW - ~100 lines)
|
| 387 |
+
|
| 388 |
- Smoke test suite: description, OCR, counting, single GAIA
|
| 389 |
- Export individual test results
|
| 390 |
|
|
|
|
| 395 |
### Phase 4: Media Processing
|
| 396 |
|
| 397 |
1. **src/tools/youtube.py** (NEW - ~80 lines)
|
| 398 |
+
|
| 399 |
- YouTube transcript extraction
|
| 400 |
- Use `youtube-transcript-api`
|
| 401 |
|
| 402 |
2. **src/tools/audio.py** (NEW - ~80 lines)
|
| 403 |
+
|
| 404 |
- Audio transcription (Whisper or HF audio models)
|
| 405 |
- Convert audio → text
|
| 406 |
|
| 407 |
+
3. **src/tools/**init**.py** (~10 lines)
|
| 408 |
+
|
| 409 |
- Register new tools: youtube_transcript, audio_transcribe
|
| 410 |
|
| 411 |
4. **requirements.txt** (~3 lines)
|
|
|
|
| 415 |
### Phase 6: Documentation
|
| 416 |
|
| 417 |
1. **README.md** (~30 lines modified)
|
| 418 |
+
- Document HF vision support
|
| 419 |
+
- List model options and selection strategy
|
| 420 |
+
- Update architecture diagram with media processing tools
|
| 421 |
|
| 422 |
## Success Criteria
|
| 423 |
|
|
|
|
| 495 |
## Phase 0 Research Questions (Answer These First)
|
| 496 |
|
| 497 |
1. **Does HF Inference API support vision models?**
|
| 498 |
+
|
| 499 |
- Test Phi-3.5-vision-instruct with simple image
|
| 500 |
- Test Llama-3.2-11B-Vision-Instruct
|
| 501 |
- Test Qwen2-VL-72B-Instruct
|
| 502 |
|
| 503 |
2. **What's the image input format?**
|
| 504 |
+
|
| 505 |
- Base64 encoding in messages?
|
| 506 |
- Direct URL support?
|
| 507 |
- File path support?
|
|
|
|
| 515 |
|
| 516 |
**Phase 0 starts with:**
|
| 517 |
|
| 518 |
+
1. ==Research HF Inference API documentation for vision support==
|
| 519 |
2. Test simple vision API call with Phi-3.5-vision-instruct
|
| 520 |
3. Document working pattern or confirm API doesn't support vision
|
| 521 |
4. Decision gate: GO to Phase 1 or pivot to backup options
|
README.md
CHANGED
|
@@ -403,6 +403,7 @@ When /update-dev runs:
|
|
| 403 |
**Phase 1: Current State (What's happening NOW)**
|
| 404 |
|
| 405 |
1. **Read workspace files:**
|
|
|
|
| 406 |
- `CHANGELOG.md` - Active session changes (reverse chronological, newest first)
|
| 407 |
- `PLAN.md` - Current implementation plan (if exists)
|
| 408 |
- `TODO.md` - Active task tracking (if exists)
|
|
@@ -427,6 +428,7 @@ When /update-dev runs:
|
|
| 427 |
**Phase 3: Project Structure (How it works)**
|
| 428 |
|
| 429 |
4. **Read README.md sections in order:**
|
|
|
|
| 430 |
- Section 1: Overview (purpose, objectives)
|
| 431 |
- Section 2: Architecture (tech stack, components, diagrams)
|
| 432 |
- Section 3: Specification (current state, workflows, requirements)
|
|
|
|
| 403 |
**Phase 1: Current State (What's happening NOW)**
|
| 404 |
|
| 405 |
1. **Read workspace files:**
|
| 406 |
+
|
| 407 |
- `CHANGELOG.md` - Active session changes (reverse chronological, newest first)
|
| 408 |
- `PLAN.md` - Current implementation plan (if exists)
|
| 409 |
- `TODO.md` - Active task tracking (if exists)
|
|
|
|
| 428 |
**Phase 3: Project Structure (How it works)**
|
| 429 |
|
| 430 |
4. **Read README.md sections in order:**
|
| 431 |
+
|
| 432 |
- Section 1: Overview (purpose, objectives)
|
| 433 |
- Section 2: Architecture (tech stack, components, diagrams)
|
| 434 |
- Section 3: Specification (current state, workflows, requirements)
|
output/phase0_vision_validation_20260107_174146.json
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"total_tests": 3,
|
| 3 |
+
"successful": 0,
|
| 4 |
+
"failed": 3,
|
| 5 |
+
"working_models": [],
|
| 6 |
+
"working_formats": [],
|
| 7 |
+
"results": [
|
| 8 |
+
{
|
| 9 |
+
"model": "microsoft/Phi-3.5-vision-instruct",
|
| 10 |
+
"format": "base64",
|
| 11 |
+
"question": "What is in this image?",
|
| 12 |
+
"status": "failed",
|
| 13 |
+
"response": null,
|
| 14 |
+
"error": "(Request ID: Root=1-695e8cc9-10fc913b2c5bd9646e264dbc;f037df6f-d7d9-450e-9004-ed2373079cd1)\n\nBad request:\n{'message': \"The requested model 'microsoft/Phi-3.5-vision-instruct' is not supported by any provider you have enabled.\", 'type': 'invalid_request_error', 'param': 'model', 'code': 'model_not_supported'}"
|
| 15 |
+
},
|
| 16 |
+
{
|
| 17 |
+
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
|
| 18 |
+
"format": "base64",
|
| 19 |
+
"question": "What is in this image?",
|
| 20 |
+
"status": "failed",
|
| 21 |
+
"response": null,
|
| 22 |
+
"error": "Client error '404 Not Found' for url 'https://router.huggingface.co/v1/chat/completions' (Request ID: oSMo8MM-2kFHot-9ba4e78d4a58ffbc)\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404\n\n{'message': 'Unable to access model meta-llama/Llama-3.2-11B-Vision-Instruct. Please visit https://api.together.ai/models to view the list of supported models.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_available'}"
|
| 23 |
+
},
|
| 24 |
+
{
|
| 25 |
+
"model": "Qwen/Qwen2-VL-72B-Instruct",
|
| 26 |
+
"format": "base64",
|
| 27 |
+
"question": "What is in this image?",
|
| 28 |
+
"status": "failed",
|
| 29 |
+
"response": null,
|
| 30 |
+
"error": "(Request ID: Root=1-695e8cca-76332aa653509ea749a10232;3e104e95-9f53-4571-8dd4-7122c99800d5)\n\nBad request:"
|
| 31 |
+
}
|
| 32 |
+
]
|
| 33 |
+
}
|
output/phase0_vision_validation_20260107_174401.json
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"total_tests": 8,
|
| 3 |
+
"successful": 2,
|
| 4 |
+
"failed": 6,
|
| 5 |
+
"working_models": [
|
| 6 |
+
"CohereLabs/aya-vision-32b",
|
| 7 |
+
"baidu/ERNIE-4.5-VL-424B-A47B-Base-PT"
|
| 8 |
+
],
|
| 9 |
+
"working_formats": [
|
| 10 |
+
"base64"
|
| 11 |
+
],
|
| 12 |
+
"results": [
|
| 13 |
+
{
|
| 14 |
+
"model": "CohereLabs/command-a-vision-07-2025",
|
| 15 |
+
"format": "base64",
|
| 16 |
+
"question": "What is in this image?",
|
| 17 |
+
"status": "failed",
|
| 18 |
+
"response": null,
|
| 19 |
+
"error": "(Request ID: Root=1-695e8d49-1b3bcbe670ba9df15d6d2c42;ef8bca12-16e4-429d-9fb8-36d160e3a272)\n\n429 Too Many Requests for url: https://router.huggingface.co/v1/chat/completions."
|
| 20 |
+
},
|
| 21 |
+
{
|
| 22 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 23 |
+
"format": "base64",
|
| 24 |
+
"question": "What is in this image?",
|
| 25 |
+
"status": "success",
|
| 26 |
+
"response": "The image is a solid red square with no additional details or objects within it. The color is vibrant and uniform across the entire frame.",
|
| 27 |
+
"error": null
|
| 28 |
+
},
|
| 29 |
+
{
|
| 30 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 31 |
+
"format": "file_path",
|
| 32 |
+
"question": "What is in this image?",
|
| 33 |
+
"status": "failed",
|
| 34 |
+
"response": null,
|
| 35 |
+
"error": "(Request ID: Root=1-695e8d4a-0a03ab902bce96f455386eef;a6cae202-9058-4837-9c9b-afe475360b65)\n\nBad request:"
|
| 36 |
+
},
|
| 37 |
+
{
|
| 38 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 39 |
+
"format": "direct_image",
|
| 40 |
+
"question": "What is in this image?",
|
| 41 |
+
"status": "failed",
|
| 42 |
+
"response": null,
|
| 43 |
+
"error": "InferenceClient.chat_completion() got an unexpected keyword argument 'message'"
|
| 44 |
+
},
|
| 45 |
+
{
|
| 46 |
+
"model": "zai-org/GLM-4.1V-9B-Thinking",
|
| 47 |
+
"format": "base64",
|
| 48 |
+
"question": "What is in this image?",
|
| 49 |
+
"status": "failed",
|
| 50 |
+
"response": null,
|
| 51 |
+
"error": "(Request ID: Root=1-695e8d4a-1b9a5cc8212823c92840be83;cf83885e-1bad-4acb-9057-71b5d28fc401)\n\nBad request:\n{'message': \"The requested model 'zai-org/GLM-4.1V-9B-Thinking' is not supported by provider 'zai-org'.\", 'type': 'invalid_request_error', 'param': 'model', 'code': 'model_not_supported'}"
|
| 52 |
+
},
|
| 53 |
+
{
|
| 54 |
+
"model": "baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 55 |
+
"format": "base64",
|
| 56 |
+
"question": "What is in this image?",
|
| 57 |
+
"status": "success",
|
| 58 |
+
"response": "This image is a solid red color. There are no discernible objects, shapes, or features within it\u2014just a uniform red background.",
|
| 59 |
+
"error": null
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"model": "baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 63 |
+
"format": "file_path",
|
| 64 |
+
"question": "What is in this image?",
|
| 65 |
+
"status": "failed",
|
| 66 |
+
"response": null,
|
| 67 |
+
"error": "(Request ID: Root=1-695e8d4d-682117514be7d3b870ab0f34;44295a40-7291-4d39-a258-4763f3c74dd2)\n\nBad request:"
|
| 68 |
+
},
|
| 69 |
+
{
|
| 70 |
+
"model": "baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 71 |
+
"format": "direct_image",
|
| 72 |
+
"question": "What is in this image?",
|
| 73 |
+
"status": "failed",
|
| 74 |
+
"response": null,
|
| 75 |
+
"error": "InferenceClient.chat_completion() got an unexpected keyword argument 'message'"
|
| 76 |
+
}
|
| 77 |
+
]
|
| 78 |
+
}
|
output/phase0_vision_validation_20260107_182113.json
ADDED
|
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"total_tests": 7,
|
| 3 |
+
"successful": 2,
|
| 4 |
+
"failed": 5,
|
| 5 |
+
"working_models": [
|
| 6 |
+
"CohereLabs/aya-vision-32b",
|
| 7 |
+
"baidu/ERNIE-4.5-VL-424B-A47B-Base-PT"
|
| 8 |
+
],
|
| 9 |
+
"working_formats": [
|
| 10 |
+
"base64"
|
| 11 |
+
],
|
| 12 |
+
"results": [
|
| 13 |
+
{
|
| 14 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 15 |
+
"format": "base64",
|
| 16 |
+
"question": "What is in this image?",
|
| 17 |
+
"status": "success",
|
| 18 |
+
"response": "The image is a solid red square with no additional details or objects present. It is a uniform color throughout, and there are no variations in shade or texture. The red is vibrant and intense, filling the entire frame of the image. There are no borders or edges visible, giving the impression that the red extends infinitely in all directions. The simplicity of the image draws attention to the color itself, making it the sole focus of the viewer's gaze.",
|
| 19 |
+
"error": null
|
| 20 |
+
},
|
| 21 |
+
{
|
| 22 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 23 |
+
"format": "file_path",
|
| 24 |
+
"question": "What is in this image?",
|
| 25 |
+
"status": "failed",
|
| 26 |
+
"response": null,
|
| 27 |
+
"error": "(Request ID: Root=1-695e9605-6005e15e4e97777133dd6086;ebd2d288-9e0f-4a56-898d-c63ff990db2f)\n\nBad request:"
|
| 28 |
+
},
|
| 29 |
+
{
|
| 30 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 31 |
+
"format": "direct_image",
|
| 32 |
+
"question": "What is in this image?",
|
| 33 |
+
"status": "failed",
|
| 34 |
+
"response": null,
|
| 35 |
+
"error": "InferenceClient.chat_completion() got an unexpected keyword argument 'message'"
|
| 36 |
+
},
|
| 37 |
+
{
|
| 38 |
+
"model": "deepseek-ai/DeepSeek-OCR",
|
| 39 |
+
"format": "base64",
|
| 40 |
+
"question": "What is in this image?",
|
| 41 |
+
"status": "failed",
|
| 42 |
+
"response": null,
|
| 43 |
+
"error": "(Request ID: Root=1-695e9605-2ca5fcd415abf4ed4ab69c3f;02f77bac-3fee-420f-aa97-dd8c7e829619)\n\nBad request:\n{'message': \"The requested model 'deepseek-ai/DeepSeek-OCR' is not a chat model.\", 'type': 'invalid_request_error', 'param': 'model', 'code': 'model_not_supported'}"
|
| 44 |
+
},
|
| 45 |
+
{
|
| 46 |
+
"model": "baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 47 |
+
"format": "base64",
|
| 48 |
+
"question": "What is in this image?",
|
| 49 |
+
"status": "success",
|
| 50 |
+
"response": "This image is a solid red color. There are no discernible objects, patterns, or variations within the image\u2014it is uniformly red.",
|
| 51 |
+
"error": null
|
| 52 |
+
},
|
| 53 |
+
{
|
| 54 |
+
"model": "baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 55 |
+
"format": "file_path",
|
| 56 |
+
"question": "What is in this image?",
|
| 57 |
+
"status": "failed",
|
| 58 |
+
"response": null,
|
| 59 |
+
"error": "(Request ID: Root=1-695e9608-465bea4365c79b9b27ec8cd0;bb5eec23-4c50-48f6-a2e9-cc0dfc516e8f)\n\nBad request:"
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"model": "baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 63 |
+
"format": "direct_image",
|
| 64 |
+
"question": "What is in this image?",
|
| 65 |
+
"status": "failed",
|
| 66 |
+
"response": null,
|
| 67 |
+
"error": "InferenceClient.chat_completion() got an unexpected keyword argument 'message'"
|
| 68 |
+
}
|
| 69 |
+
]
|
| 70 |
+
}
|
output/phase0_vision_validation_20260107_182155.json
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"total_tests": 5,
|
| 3 |
+
"successful": 2,
|
| 4 |
+
"failed": 3,
|
| 5 |
+
"working_models": [
|
| 6 |
+
"baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 7 |
+
"CohereLabs/aya-vision-32b"
|
| 8 |
+
],
|
| 9 |
+
"working_formats": [
|
| 10 |
+
"base64"
|
| 11 |
+
],
|
| 12 |
+
"results": [
|
| 13 |
+
{
|
| 14 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 15 |
+
"format": "base64",
|
| 16 |
+
"question": "What is in this image?",
|
| 17 |
+
"status": "success",
|
| 18 |
+
"response": "The image is a solid red color with no discernible features or objects. It appears to be a uniform, flat red surface.",
|
| 19 |
+
"error": null
|
| 20 |
+
},
|
| 21 |
+
{
|
| 22 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 23 |
+
"format": "file_path",
|
| 24 |
+
"question": "What is in this image?",
|
| 25 |
+
"status": "failed",
|
| 26 |
+
"response": null,
|
| 27 |
+
"error": "(Request ID: Root=1-695e962e-5d464285113a3b4f217795e5;a67e30e1-f65c-4781-ab09-a8ac9735c2bd)\n\nBad request:"
|
| 28 |
+
},
|
| 29 |
+
{
|
| 30 |
+
"model": "deepseek-ai/DeepSeek-OCR",
|
| 31 |
+
"format": "image_to_text",
|
| 32 |
+
"question": "OCR/Text extraction",
|
| 33 |
+
"status": "failed",
|
| 34 |
+
"response": null,
|
| 35 |
+
"error": "Task 'image-to-text' not supported for provider 'novita'. Available tasks: ['text-generation', 'conversational', 'text-to-video']"
|
| 36 |
+
},
|
| 37 |
+
{
|
| 38 |
+
"model": "baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 39 |
+
"format": "base64",
|
| 40 |
+
"question": "What is in this image?",
|
| 41 |
+
"status": "success",
|
| 42 |
+
"response": "This image is a solid red color. There are no discernible objects, shapes, or features within it\u2014just a uniform red background.",
|
| 43 |
+
"error": null
|
| 44 |
+
},
|
| 45 |
+
{
|
| 46 |
+
"model": "baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 47 |
+
"format": "file_path",
|
| 48 |
+
"question": "What is in this image?",
|
| 49 |
+
"status": "failed",
|
| 50 |
+
"response": null,
|
| 51 |
+
"error": "(Request ID: Root=1-695e9631-56a310713b7db1415df2e897;2f0603e3-267b-469a-9058-6cb75a1b3cf8)\n\nBad request:"
|
| 52 |
+
}
|
| 53 |
+
]
|
| 54 |
+
}
|
output/phase0_vision_validation_20260107_183155.json
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"total_tests": 6,
|
| 3 |
+
"successful": 3,
|
| 4 |
+
"failed": 3,
|
| 5 |
+
"working_models": [
|
| 6 |
+
"baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 7 |
+
"Qwen/Qwen3-VL-8B-Instruct",
|
| 8 |
+
"CohereLabs/aya-vision-32b"
|
| 9 |
+
],
|
| 10 |
+
"working_formats": [
|
| 11 |
+
"base64"
|
| 12 |
+
],
|
| 13 |
+
"results": [
|
| 14 |
+
{
|
| 15 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 16 |
+
"format": "base64",
|
| 17 |
+
"question": "What is in this image?",
|
| 18 |
+
"status": "success",
|
| 19 |
+
"response": "The image is a solid red square with no additional details or objects within it. The color is vibrant and uniform across the entire square.",
|
| 20 |
+
"error": null
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 24 |
+
"format": "file_path",
|
| 25 |
+
"question": "What is in this image?",
|
| 26 |
+
"status": "failed",
|
| 27 |
+
"response": null,
|
| 28 |
+
"error": "(Request ID: Root=1-695e9884-316ead350578ba0345ae9d34;6929231a-570a-4e2f-8eb2-56c67ee79a9a)\n\nBad request:"
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"model": "Qwen/Qwen3-VL-8B-Instruct",
|
| 32 |
+
"format": "base64",
|
| 33 |
+
"question": "What is in this image?",
|
| 34 |
+
"status": "success",
|
| 35 |
+
"response": "The image contains a solid red background with no other visible elements or details.",
|
| 36 |
+
"error": null
|
| 37 |
+
},
|
| 38 |
+
{
|
| 39 |
+
"model": "Qwen/Qwen3-VL-8B-Instruct",
|
| 40 |
+
"format": "file_path",
|
| 41 |
+
"question": "What is in this image?",
|
| 42 |
+
"status": "failed",
|
| 43 |
+
"response": null,
|
| 44 |
+
"error": "(Request ID: Root=1-695e9885-2c2036d2593274cf4ea4a6d3;e74bcbbd-0bcf-493c-a07a-7c4965d015e5)\n\nBad request:"
|
| 45 |
+
},
|
| 46 |
+
{
|
| 47 |
+
"model": "baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 48 |
+
"format": "base64",
|
| 49 |
+
"question": "What is in this image?",
|
| 50 |
+
"status": "success",
|
| 51 |
+
"response": "This image is a solid red color. There are no discernible objects, shapes, or features within it\u2014just a uniform red background.",
|
| 52 |
+
"error": null
|
| 53 |
+
},
|
| 54 |
+
{
|
| 55 |
+
"model": "baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 56 |
+
"format": "file_path",
|
| 57 |
+
"question": "What is in this image?",
|
| 58 |
+
"status": "failed",
|
| 59 |
+
"response": null,
|
| 60 |
+
"error": "(Request ID: Root=1-695e988b-2827007a4f9c183643e4b477;b968a082-1630-4891-a654-260a0a1b9120)\n\nBad request:"
|
| 61 |
+
}
|
| 62 |
+
]
|
| 63 |
+
}
|
output/phase0_vision_validation_20260107_184839.json
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"total_tests": 4,
|
| 3 |
+
"successful": 1,
|
| 4 |
+
"failed": 3,
|
| 5 |
+
"working_models": [
|
| 6 |
+
"CohereLabs/aya-vision-32b"
|
| 7 |
+
],
|
| 8 |
+
"working_formats": [
|
| 9 |
+
"base64"
|
| 10 |
+
],
|
| 11 |
+
"results": [
|
| 12 |
+
{
|
| 13 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 14 |
+
"format": "base64",
|
| 15 |
+
"question": "What is in this image?",
|
| 16 |
+
"status": "success",
|
| 17 |
+
"response": "The image depicts a serene workspace setup on a wooden desk. The desk is positioned near a window, allowing natural light to illuminate the scene. On the desk, there is a white ceramic mug filled with dark liquid, likely coffee, placed to the left of a silver laptop. The laptop is open, revealing its keyboard and trackpad. To the right of the laptop, there is a rolled-up piece of paper secured with a rubber band, a pen, and a smartphone. The arrangement suggests a productive environment, with tools for both digital and analog work at hand. The overall ambiance is calm and conducive to focused work or study.",
|
| 18 |
+
"error": null
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 22 |
+
"format": "file_path",
|
| 23 |
+
"question": "What is in this image?",
|
| 24 |
+
"status": "failed",
|
| 25 |
+
"response": null,
|
| 26 |
+
"error": "(Request ID: Root=1-695e9b86-6ebf64cb17bc45654337e8dc;85a49805-27d5-42c5-bb92-d9fc1542e6e4)\n\nBad request:"
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"model": "Qwen/Qwen3-VL-8B-Instruct",
|
| 30 |
+
"format": "base64",
|
| 31 |
+
"question": "What is in this image?",
|
| 32 |
+
"status": "failed",
|
| 33 |
+
"response": null,
|
| 34 |
+
"error": "Server error '504 Gateway Time-out' for url 'https://router.huggingface.co/v1/chat/completions'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/504"
|
| 35 |
+
},
|
| 36 |
+
{
|
| 37 |
+
"model": "baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 38 |
+
"format": "base64",
|
| 39 |
+
"question": "What is in this image?",
|
| 40 |
+
"status": "failed",
|
| 41 |
+
"response": null,
|
| 42 |
+
"error": "Server error '504 Gateway Time-out' for url 'https://router.huggingface.co/v1/chat/completions'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/504"
|
| 43 |
+
}
|
| 44 |
+
]
|
| 45 |
+
}
|
output/phase0_vision_validation_20260111_162124.json
ADDED
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"total_tests": 9,
|
| 3 |
+
"successful": 2,
|
| 4 |
+
"failed": 7,
|
| 5 |
+
"working_models": [
|
| 6 |
+
"CohereLabs/aya-vision-32b",
|
| 7 |
+
"Qwen/Qwen3-VL-30B-A3B-Instruct:novita"
|
| 8 |
+
],
|
| 9 |
+
"working_formats": [
|
| 10 |
+
"base64"
|
| 11 |
+
],
|
| 12 |
+
"results": [
|
| 13 |
+
{
|
| 14 |
+
"model": "zai-org/GLM-4.7:cerebras",
|
| 15 |
+
"format": "base64",
|
| 16 |
+
"question": "What is in this image?",
|
| 17 |
+
"status": "failed",
|
| 18 |
+
"response": null,
|
| 19 |
+
"error": "Client error '422 Unprocessable Entity' for url 'https://router.huggingface.co/v1/chat/completions' (Request ID: Root=1-6963bee6-07aa59e62ab80f481dbbdb81;150f4278-75e3-402e-8d96-7a2fe5e3185e)\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/422\n{\"message\":\"Content type 'image_url' is not supported by selected model. Only 'text' content type can be used.\",\"type\":\"invalid_request_error\",\"param\":\"prompt\",\"code\":\"wrong_api_format\"}\n"
|
| 20 |
+
},
|
| 21 |
+
{
|
| 22 |
+
"model": "openai/gpt-oss-120b:novita",
|
| 23 |
+
"format": "base64",
|
| 24 |
+
"question": "What is in this image?",
|
| 25 |
+
"status": "failed",
|
| 26 |
+
"response": null,
|
| 27 |
+
"error": "(Request ID: Root=1-6963bee7-4988ea004e9a7a78658d68f7;3eb63dde-d2d0-4311-bf87-971e2be945f1)\n\nBad request:"
|
| 28 |
+
},
|
| 29 |
+
{
|
| 30 |
+
"model": "moonshotai/Kimi-K2-Instruct-0905:novita",
|
| 31 |
+
"format": "base64",
|
| 32 |
+
"question": "What is in this image?",
|
| 33 |
+
"status": "failed",
|
| 34 |
+
"response": null,
|
| 35 |
+
"error": "(Request ID: Root=1-6963bee9-1a34b45108bede994967f991;e8dcc3b5-78ae-479b-9194-5d36c4904c84)\n\nBad request:"
|
| 36 |
+
},
|
| 37 |
+
{
|
| 38 |
+
"model": "Qwen/Qwen3-VL-30B-A3B-Instruct:novita",
|
| 39 |
+
"format": "base64",
|
| 40 |
+
"question": "What is in this image?",
|
| 41 |
+
"status": "success",
|
| 42 |
+
"response": "Based on the image provided, here is a detailed description of what is present:\n\nThe image displays a work or study setup on a wooden desk. The scene is composed of several common items arranged in a way that suggests a focused work environment.\n\n- **Laptop:** On the left side of the frame, there is a silver laptop, likely a MacBook, with its screen open but turned away from the camera.\n- **Coffee Mug:** In the center of the desk, there is a white ceramic mug filled with black coffee.\n- **Notepad and Pen:** To the right of the mug, there is a small notepad with handwritten notes on it. A pen is resting on top of the notepad.\n- **Smartphone:** Further to the right, a black smartphone lies flat on the desk with its screen off.\n- **Background:** The desk is positioned next to a window with a dark frame. Behind the desk, there is a gray cinder block wall. The lighting appears to be natural light coming from the window, creating a soft, ambient glow.",
|
| 43 |
+
"error": null
|
| 44 |
+
},
|
| 45 |
+
{
|
| 46 |
+
"model": "Qwen/Qwen3-VL-30B-A3B-Instruct:novita",
|
| 47 |
+
"format": "file_path",
|
| 48 |
+
"question": "What is in this image?",
|
| 49 |
+
"status": "failed",
|
| 50 |
+
"response": null,
|
| 51 |
+
"error": "(Request ID: Root=1-6963bef8-0b95400506a54e322453beda;06443217-384c-4bd9-a080-dbc5acada0de)\n\nBad request:"
|
| 52 |
+
},
|
| 53 |
+
{
|
| 54 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 55 |
+
"format": "base64",
|
| 56 |
+
"question": "What is in this image?",
|
| 57 |
+
"status": "success",
|
| 58 |
+
"response": "The image depicts a serene workspace setup on a wooden desk. The desk is positioned near a window, allowing natural light to illuminate the scene. On the desk, there is a sleek, silver laptop with its lid open, revealing a black keyboard and trackpad. To the right of the laptop, there is a white ceramic mug filled with a dark liquid, presumably coffee or tea, and a black ceramic mug placed upside down. Next to the mugs, there is a rolled-up piece of paper with handwritten notes, secured with a black pen. A black smartphone lies next to the paper, and a white notebook is placed slightly further away. The overall atmosphere suggests a calm and organized environment conducive to work or study.",
|
| 59 |
+
"error": null
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"model": "CohereLabs/aya-vision-32b",
|
| 63 |
+
"format": "file_path",
|
| 64 |
+
"question": "What is in this image?",
|
| 65 |
+
"status": "failed",
|
| 66 |
+
"response": null,
|
| 67 |
+
"error": "(Request ID: Root=1-6963bf02-1513fc57216cd6a30267e34c;5fb8879e-b4d3-47ef-8503-07ecfc439c3e)\n\nBad request:"
|
| 68 |
+
},
|
| 69 |
+
{
|
| 70 |
+
"model": "Qwen/Qwen3-VL-8B-Instruct",
|
| 71 |
+
"format": "base64",
|
| 72 |
+
"question": "What is in this image?",
|
| 73 |
+
"status": "failed",
|
| 74 |
+
"response": null,
|
| 75 |
+
"error": "Server error '504 Gateway Time-out' for url 'https://router.huggingface.co/v1/chat/completions'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/504"
|
| 76 |
+
},
|
| 77 |
+
{
|
| 78 |
+
"model": "baidu/ERNIE-4.5-VL-424B-A47B-Base-PT",
|
| 79 |
+
"format": "base64",
|
| 80 |
+
"question": "What is in this image?",
|
| 81 |
+
"status": "failed",
|
| 82 |
+
"response": null,
|
| 83 |
+
"error": "Server error '504 Gateway Time-out' for url 'https://router.huggingface.co/v1/chat/completions'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/504"
|
| 84 |
+
}
|
| 85 |
+
]
|
| 86 |
+
}
|
output/phase0_vision_validation_20260111_163647.json
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"total_tests": 1,
|
| 3 |
+
"successful": 0,
|
| 4 |
+
"failed": 1,
|
| 5 |
+
"working_models": [],
|
| 6 |
+
"working_formats": [],
|
| 7 |
+
"results": [
|
| 8 |
+
{
|
| 9 |
+
"model": "openai/gpt-oss-120b:groq",
|
| 10 |
+
"format": "base64",
|
| 11 |
+
"question": "What is in this image?",
|
| 12 |
+
"status": "failed",
|
| 13 |
+
"response": null,
|
| 14 |
+
"error": "(Request ID: req_01kepv7rhff3gs2gy852xqyvbj)\n\nBad request:\n{'message': 'messages[0].content must be a string', 'type': 'invalid_request_error', 'param': 'messages[0].content'}"
|
| 15 |
+
}
|
| 16 |
+
]
|
| 17 |
+
}
|
output/phase0_vision_validation_20260111_164531.json
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"total_tests": 2,
|
| 3 |
+
"successful": 1,
|
| 4 |
+
"failed": 1,
|
| 5 |
+
"working_models": [
|
| 6 |
+
"zai-org/GLM-4.6V-Flash:zai-org"
|
| 7 |
+
],
|
| 8 |
+
"working_formats": [
|
| 9 |
+
"base64"
|
| 10 |
+
],
|
| 11 |
+
"results": [
|
| 12 |
+
{
|
| 13 |
+
"model": "zai-org/GLM-4.6V-Flash:zai-org",
|
| 14 |
+
"format": "base64",
|
| 15 |
+
"question": "What is in this image?",
|
| 16 |
+
"status": "success",
|
| 17 |
+
"response": "\nThe image shows a wooden desk with several items: a partially open laptop (with a white keyboard visible) on the left, a white mug filled with black coffee next to the laptop, a rolled notepad with a pen resting on it, a black smartphone lying flat on the desk, and a window with light coming through (and a dark gray brick wall in the background).",
|
| 18 |
+
"error": null
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"model": "zai-org/GLM-4.6V-Flash:zai-org",
|
| 22 |
+
"format": "file_path",
|
| 23 |
+
"question": "What is in this image?",
|
| 24 |
+
"status": "failed",
|
| 25 |
+
"response": null,
|
| 26 |
+
"error": "(Request ID: Root=1-6963c59b-22bf0ca92d51cac41557d483;7707ded2-a6e5-43c4-b381-a65dbfdbc3b8)\n\nBad request:\n{'code': '1210', 'message': '\u56fe\u7247\u8f93\u5165\u683c\u5f0f/\u89e3\u6790\u9519\u8bef'}"
|
| 27 |
+
}
|
| 28 |
+
]
|
| 29 |
+
}
|
output/phase0_vision_validation_20260111_164945.json
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"total_tests": 2,
|
| 3 |
+
"successful": 1,
|
| 4 |
+
"failed": 1,
|
| 5 |
+
"working_models": [
|
| 6 |
+
"google/gemma-3-27b-it:scaleway"
|
| 7 |
+
],
|
| 8 |
+
"working_formats": [
|
| 9 |
+
"base64"
|
| 10 |
+
],
|
| 11 |
+
"results": [
|
| 12 |
+
{
|
| 13 |
+
"model": "google/gemma-3-27b-it:scaleway",
|
| 14 |
+
"format": "base64",
|
| 15 |
+
"question": "What is in this image?",
|
| 16 |
+
"status": "success",
|
| 17 |
+
"response": "Here's a breakdown of what's in the image:\n\n* **Laptop:** A silver laptop is open on a wooden desk.\n* **Coffee Mug:** A white coffee mug filled with a dark liquid (likely coffee) sits on the desk.\n* **Notebook/Paper Roll:** There's a small roll of paper and a notepad with handwritten notes next to the mug.\n* **Pen:** A pen is lying on top of the notepad.\n* **Smartphone:** A black smartphone is also on the desk.\n* **Desk:** All the items are arranged on a warm-toned wooden desk. \n* **Window:** A window is partially visible in the background, with a gray brick wall next to it.\n\nThe overall scene suggests a workspace, possibly for a writer or someone working remotely.",
|
| 18 |
+
"error": null
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"model": "google/gemma-3-27b-it:scaleway",
|
| 22 |
+
"format": "file_path",
|
| 23 |
+
"question": "What is in this image?",
|
| 24 |
+
"status": "failed",
|
| 25 |
+
"response": null,
|
| 26 |
+
"error": "(Request ID: e07b229c-af92-4701-bf1b-729eaf165c48)\n\nBad request:"
|
| 27 |
+
}
|
| 28 |
+
]
|
| 29 |
+
}
|
test/fixtures/{test_image.jpg → test_image_red_square.jpg}
RENAMED
|
File without changes
|
test/test_phase0_hf_vision_api.py
ADDED
|
@@ -0,0 +1,378 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Phase 0: HuggingFace Inference API Vision Validation
|
| 4 |
+
Author: @mangobee
|
| 5 |
+
Date: 2026-01-07
|
| 6 |
+
|
| 7 |
+
Tests HF Inference API with vision models to validate multimodal support BEFORE
|
| 8 |
+
implementation. Decision gate: Only proceed to Phase 1 if ≥1 model works.
|
| 9 |
+
|
| 10 |
+
Models to test (smallest → largest):
|
| 11 |
+
1. microsoft/Phi-3.5-vision-instruct (3.8B)
|
| 12 |
+
2. meta-llama/Llama-3.2-11B-Vision-Instruct (11B)
|
| 13 |
+
3. Qwen/Qwen2-VL-72B-Instruct (72B)
|
| 14 |
+
"""
|
| 15 |
+
|
| 16 |
+
import os
|
| 17 |
+
import base64
|
| 18 |
+
import logging
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
from typing import Dict, Any, Optional
|
| 21 |
+
from huggingface_hub import InferenceClient
|
| 22 |
+
|
| 23 |
+
# Load environment variables from .env file
|
| 24 |
+
from dotenv import load_dotenv
|
| 25 |
+
load_dotenv()
|
| 26 |
+
|
| 27 |
+
# ============================================================================
|
| 28 |
+
# CONFIG
|
| 29 |
+
# ============================================================================
|
| 30 |
+
|
| 31 |
+
HF_TOKEN = os.getenv("HF_TOKEN")
|
| 32 |
+
TEST_IMAGE_PATH = "test/fixtures/test_image_real.png" # Real image for better testing
|
| 33 |
+
|
| 34 |
+
# Models to test (user specified with provider routing)
|
| 35 |
+
VISION_MODELS = [
|
| 36 |
+
"google/gemma-3-27b-it:scaleway",
|
| 37 |
+
]
|
| 38 |
+
|
| 39 |
+
# Test questions (progressive complexity)
|
| 40 |
+
TEST_QUESTIONS = [
|
| 41 |
+
"What is in this image?",
|
| 42 |
+
"Describe the image in detail.",
|
| 43 |
+
"What colors do you see?",
|
| 44 |
+
]
|
| 45 |
+
|
| 46 |
+
# Logging setup
|
| 47 |
+
logging.basicConfig(
|
| 48 |
+
level=logging.INFO,
|
| 49 |
+
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
| 50 |
+
)
|
| 51 |
+
logger = logging.getLogger(__name__)
|
| 52 |
+
|
| 53 |
+
# ============================================================================
|
| 54 |
+
# Helper Functions
|
| 55 |
+
# ============================================================================
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
def encode_image_to_base64(image_path: str) -> str:
|
| 59 |
+
"""Encode image file to base64 string."""
|
| 60 |
+
with open(image_path, "rb") as f:
|
| 61 |
+
return base64.b64encode(f.read()).decode("utf-8")
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
def get_test_image() -> str:
|
| 65 |
+
"""Get test image path, verify it exists."""
|
| 66 |
+
path = Path(TEST_IMAGE_PATH)
|
| 67 |
+
if not path.exists():
|
| 68 |
+
raise FileNotFoundError(f"Test image not found: {TEST_IMAGE_PATH}")
|
| 69 |
+
return TEST_IMAGE_PATH
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
# ============================================================================
|
| 73 |
+
# Test Functions
|
| 74 |
+
# ============================================================================
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def test_vision_model_with_base64(model: str, image_b64: str, question: str) -> Dict[str, Any]:
|
| 78 |
+
"""
|
| 79 |
+
Test HF Inference API with base64-encoded image.
|
| 80 |
+
|
| 81 |
+
Args:
|
| 82 |
+
model: Model name (e.g., "microsoft/Phi-3.5-vision-instruct")
|
| 83 |
+
image_b64: Base64-encoded image string
|
| 84 |
+
question: Question to ask about the image
|
| 85 |
+
|
| 86 |
+
Returns:
|
| 87 |
+
dict: Test result with status, response, error
|
| 88 |
+
"""
|
| 89 |
+
result = {
|
| 90 |
+
"model": model,
|
| 91 |
+
"format": "base64",
|
| 92 |
+
"question": question,
|
| 93 |
+
"status": "unknown",
|
| 94 |
+
"response": None,
|
| 95 |
+
"error": None,
|
| 96 |
+
}
|
| 97 |
+
|
| 98 |
+
try:
|
| 99 |
+
client = InferenceClient(token=HF_TOKEN)
|
| 100 |
+
|
| 101 |
+
# Try chat_completion with image content
|
| 102 |
+
messages = [
|
| 103 |
+
{
|
| 104 |
+
"role": "user",
|
| 105 |
+
"content": [
|
| 106 |
+
{"type": "text", "text": question},
|
| 107 |
+
{
|
| 108 |
+
"type": "image_url",
|
| 109 |
+
"image_url": {
|
| 110 |
+
"url": f"data:image/jpeg;base64,{image_b64}"
|
| 111 |
+
}
|
| 112 |
+
}
|
| 113 |
+
]
|
| 114 |
+
}
|
| 115 |
+
]
|
| 116 |
+
|
| 117 |
+
response = client.chat_completion(
|
| 118 |
+
model=model,
|
| 119 |
+
messages=messages,
|
| 120 |
+
max_tokens=500,
|
| 121 |
+
)
|
| 122 |
+
|
| 123 |
+
result["status"] = "success"
|
| 124 |
+
result["response"] = response.choices[0].message.content
|
| 125 |
+
logger.info(f"✓ {model} (base64): Success")
|
| 126 |
+
|
| 127 |
+
except Exception as e:
|
| 128 |
+
result["status"] = "failed"
|
| 129 |
+
result["error"] = str(e)
|
| 130 |
+
logger.error(f"✗ {model} (base64): {e}")
|
| 131 |
+
|
| 132 |
+
return result
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
def test_vision_model_with_url(model: str, image_path: str, question: str) -> Dict[str, Any]:
|
| 136 |
+
"""
|
| 137 |
+
Test HF Inference API with local file path (converted to URL).
|
| 138 |
+
|
| 139 |
+
Args:
|
| 140 |
+
model: Model name
|
| 141 |
+
image_path: Path to local image file
|
| 142 |
+
question: Question to ask
|
| 143 |
+
|
| 144 |
+
Returns:
|
| 145 |
+
dict: Test result
|
| 146 |
+
"""
|
| 147 |
+
result = {
|
| 148 |
+
"model": model,
|
| 149 |
+
"format": "file_path",
|
| 150 |
+
"question": question,
|
| 151 |
+
"status": "unknown",
|
| 152 |
+
"response": None,
|
| 153 |
+
"error": None,
|
| 154 |
+
}
|
| 155 |
+
|
| 156 |
+
try:
|
| 157 |
+
client = InferenceClient(token=HF_TOKEN)
|
| 158 |
+
|
| 159 |
+
# Try with file:// URL
|
| 160 |
+
file_url = f"file://{Path(image_path).absolute()}"
|
| 161 |
+
|
| 162 |
+
messages = [
|
| 163 |
+
{
|
| 164 |
+
"role": "user",
|
| 165 |
+
"content": [
|
| 166 |
+
{"type": "text", "text": question},
|
| 167 |
+
{
|
| 168 |
+
"type": "image_url",
|
| 169 |
+
"image_url": {"url": file_url}
|
| 170 |
+
}
|
| 171 |
+
]
|
| 172 |
+
}
|
| 173 |
+
]
|
| 174 |
+
|
| 175 |
+
response = client.chat_completion(
|
| 176 |
+
model=model,
|
| 177 |
+
messages=messages,
|
| 178 |
+
max_tokens=500,
|
| 179 |
+
)
|
| 180 |
+
|
| 181 |
+
result["status"] = "success"
|
| 182 |
+
result["response"] = response.choices[0].message.content
|
| 183 |
+
logger.info(f"✓ {model} (file_path): Success")
|
| 184 |
+
|
| 185 |
+
except Exception as e:
|
| 186 |
+
result["status"] = "failed"
|
| 187 |
+
result["error"] = str(e)
|
| 188 |
+
logger.error(f"✗ {model} (file_path): {e}")
|
| 189 |
+
|
| 190 |
+
return result
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
def test_ocr_model(model: str, image_path: str) -> Dict[str, Any]:
|
| 194 |
+
"""
|
| 195 |
+
Test OCR model using image-to-text approach (not chat completion).
|
| 196 |
+
|
| 197 |
+
For models like DeepSeek-OCR that are image-to-text, not chat models.
|
| 198 |
+
|
| 199 |
+
Args:
|
| 200 |
+
model: Model name
|
| 201 |
+
image_path: Path to local image file
|
| 202 |
+
|
| 203 |
+
Returns:
|
| 204 |
+
dict: Test result
|
| 205 |
+
"""
|
| 206 |
+
result = {
|
| 207 |
+
"model": model,
|
| 208 |
+
"format": "image_to_text",
|
| 209 |
+
"question": "OCR/Text extraction",
|
| 210 |
+
"status": "unknown",
|
| 211 |
+
"response": None,
|
| 212 |
+
"error": None,
|
| 213 |
+
}
|
| 214 |
+
|
| 215 |
+
try:
|
| 216 |
+
client = InferenceClient(model=model, token=HF_TOKEN)
|
| 217 |
+
|
| 218 |
+
# Try image-to-text endpoint
|
| 219 |
+
with open(image_path, "rb") as f:
|
| 220 |
+
image_data = f.read()
|
| 221 |
+
|
| 222 |
+
response = client.image_to_text(image=image_data)
|
| 223 |
+
|
| 224 |
+
result["status"] = "success"
|
| 225 |
+
result["response"] = str(response)
|
| 226 |
+
logger.info(f"✓ {model} (image_to_text): Success")
|
| 227 |
+
|
| 228 |
+
except Exception as e:
|
| 229 |
+
result["status"] = "failed"
|
| 230 |
+
result["error"] = str(e)
|
| 231 |
+
logger.error(f"✗ {model} (image_to_text): {e}")
|
| 232 |
+
|
| 233 |
+
return result
|
| 234 |
+
|
| 235 |
+
|
| 236 |
+
# ============================================================================
|
| 237 |
+
# Main Test Execution
|
| 238 |
+
# ============================================================================
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
def run_phase0_validation() -> Dict[str, Any]:
|
| 242 |
+
"""
|
| 243 |
+
Run Phase 0 validation: Test all models with all formats.
|
| 244 |
+
|
| 245 |
+
Returns:
|
| 246 |
+
dict: Summary of all test results
|
| 247 |
+
"""
|
| 248 |
+
if not HF_TOKEN:
|
| 249 |
+
raise ValueError("HF_TOKEN environment variable not set")
|
| 250 |
+
|
| 251 |
+
# Get test image
|
| 252 |
+
image_path = get_test_image()
|
| 253 |
+
image_b64 = encode_image_to_base64(image_path)
|
| 254 |
+
|
| 255 |
+
logger.info(f"Test image: {image_path}")
|
| 256 |
+
logger.info(f"Image size: {len(image_b64)} chars (base64)")
|
| 257 |
+
logger.info(f"Testing {len(VISION_MODELS)} models with 3 formats each")
|
| 258 |
+
logger.info("=" * 60)
|
| 259 |
+
|
| 260 |
+
all_results = []
|
| 261 |
+
|
| 262 |
+
# Test each model
|
| 263 |
+
for model in VISION_MODELS:
|
| 264 |
+
logger.info(f"\nTesting model: {model}")
|
| 265 |
+
logger.info("-" * 60)
|
| 266 |
+
|
| 267 |
+
model_results = []
|
| 268 |
+
|
| 269 |
+
# Check if this is an OCR model (contains "OCR" in name)
|
| 270 |
+
is_ocr_model = "ocr" in model.lower()
|
| 271 |
+
|
| 272 |
+
if is_ocr_model:
|
| 273 |
+
# Test with image_to_text endpoint for OCR models
|
| 274 |
+
result = test_ocr_model(model, image_path)
|
| 275 |
+
model_results.append(result)
|
| 276 |
+
else:
|
| 277 |
+
# Test with base64 (most likely to work for chat models)
|
| 278 |
+
for question in TEST_QUESTIONS[:1]: # Just 1 question for speed
|
| 279 |
+
result = test_vision_model_with_base64(model, image_b64, question)
|
| 280 |
+
model_results.append(result)
|
| 281 |
+
|
| 282 |
+
# If base64 works, test other formats
|
| 283 |
+
if result["status"] == "success":
|
| 284 |
+
# Test file path
|
| 285 |
+
result_fp = test_vision_model_with_url(model, image_path, question)
|
| 286 |
+
model_results.append(result_fp)
|
| 287 |
+
|
| 288 |
+
# Don't test other questions if first worked
|
| 289 |
+
break
|
| 290 |
+
|
| 291 |
+
all_results.extend(model_results)
|
| 292 |
+
|
| 293 |
+
# Compile summary
|
| 294 |
+
summary = {
|
| 295 |
+
"total_tests": len(all_results),
|
| 296 |
+
"successful": sum(1 for r in all_results if r["status"] == "success"),
|
| 297 |
+
"failed": sum(1 for r in all_results if r["status"] == "failed"),
|
| 298 |
+
"working_models": list(set(r["model"] for r in all_results if r["status"] == "success")),
|
| 299 |
+
"working_formats": list(set(r["format"] for r in all_results if r["status"] == "success")),
|
| 300 |
+
"results": all_results,
|
| 301 |
+
}
|
| 302 |
+
|
| 303 |
+
return summary
|
| 304 |
+
|
| 305 |
+
|
| 306 |
+
def print_summary(summary: Dict[str, Any]) -> None:
|
| 307 |
+
"""Print test summary and decision gate."""
|
| 308 |
+
logger.info("\n" + "=" * 60)
|
| 309 |
+
logger.info("PHASE 0 VALIDATION SUMMARY")
|
| 310 |
+
logger.info("=" * 60)
|
| 311 |
+
|
| 312 |
+
logger.info(f"\nTotal tests: {summary['total_tests']}")
|
| 313 |
+
logger.info(f"✓ Successful: {summary['successful']}")
|
| 314 |
+
logger.info(f"✗ Failed: {summary['failed']}")
|
| 315 |
+
|
| 316 |
+
logger.info(f"\nWorking models: {summary['working_models']}")
|
| 317 |
+
logger.info(f"Working formats: {summary['working_formats']}")
|
| 318 |
+
|
| 319 |
+
# Decision gate
|
| 320 |
+
logger.info("\n" + "=" * 60)
|
| 321 |
+
logger.info("DECISION GATE")
|
| 322 |
+
logger.info("=" * 60)
|
| 323 |
+
|
| 324 |
+
if summary['successful'] > 0:
|
| 325 |
+
logger.info("\n✅ GO - Proceed to Phase 1 (Implementation)")
|
| 326 |
+
logger.info(f"Recommended model: {summary['working_models'][0]} (smallest working)")
|
| 327 |
+
logger.info(f"Use format: {summary['working_formats'][0]}")
|
| 328 |
+
else:
|
| 329 |
+
logger.info("\n❌ NO-GO - Pivot to backup options")
|
| 330 |
+
logger.info("Backup options:")
|
| 331 |
+
logger.info(" - Option C: HF Spaces deployment (custom endpoint)")
|
| 332 |
+
logger.info(" - Option D: Local transformers library (no API)")
|
| 333 |
+
logger.info(" - Option E: Hybrid (HF text + Gemini/Claude vision only)")
|
| 334 |
+
|
| 335 |
+
# Print detailed results
|
| 336 |
+
logger.info("\n" + "=" * 60)
|
| 337 |
+
logger.info("DETAILED RESULTS")
|
| 338 |
+
logger.info("=" * 60)
|
| 339 |
+
|
| 340 |
+
for result in summary['results']:
|
| 341 |
+
logger.info(f"\nModel: {result['model']}")
|
| 342 |
+
logger.info(f"Format: {result['format']}")
|
| 343 |
+
logger.info(f"Status: {result['status']}")
|
| 344 |
+
if result['error']:
|
| 345 |
+
logger.info(f"Error: {result['error']}")
|
| 346 |
+
if result['response']:
|
| 347 |
+
logger.info(f"Response: {result['response'][:200]}...")
|
| 348 |
+
|
| 349 |
+
|
| 350 |
+
if __name__ == "__main__":
|
| 351 |
+
print("\n" + "=" * 60)
|
| 352 |
+
print("PHASE 0: HF INFERENCE API VISION VALIDATION")
|
| 353 |
+
print("=" * 60)
|
| 354 |
+
print(f"HF Token: {'Set' if HF_TOKEN else 'NOT SET'}")
|
| 355 |
+
print(f"Test image: {TEST_IMAGE_PATH}")
|
| 356 |
+
print("=" * 60 + "\n")
|
| 357 |
+
|
| 358 |
+
try:
|
| 359 |
+
summary = run_phase0_validation()
|
| 360 |
+
print_summary(summary)
|
| 361 |
+
|
| 362 |
+
# Export results for documentation
|
| 363 |
+
import json
|
| 364 |
+
from datetime import datetime
|
| 365 |
+
|
| 366 |
+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 367 |
+
output_dir = Path("output")
|
| 368 |
+
output_dir.mkdir(exist_ok=True)
|
| 369 |
+
|
| 370 |
+
output_file = output_dir / f"phase0_vision_validation_{timestamp}.json"
|
| 371 |
+
with open(output_file, "w") as f:
|
| 372 |
+
json.dump(summary, f, indent=2)
|
| 373 |
+
|
| 374 |
+
logger.info(f"\n✓ Results exported to: {output_file}")
|
| 375 |
+
|
| 376 |
+
except Exception as e:
|
| 377 |
+
logger.error(f"\nPhase 0 validation failed: {e}")
|
| 378 |
+
raise
|