agentbee

Sleeping

mangubee Claude commited on Jan 12

Commit

2bb2d3d

1 Parent(s): c86df49

Phase 0-2: HF Vision Integration Complete - Google Gemma 3 Selected

Phase 0 (Extended Testing):
- Tested 4 additional user-requested models
- Validated: google/gemma-3-27b-it:scaleway (6s, RECOMMENDED)
- Validated: zai-org/GLM-4.6V-Flash:zai-org (16s)
- Validated: Qwen/Qwen3-VL-30B-A3B-Instruct:novita (14s)
- Failed: GLM-4.7, gpt-oss-120b (text-only models)

Phase 1 (Implementation):
- Added analyze_image_hf() in src/tools/vision.py
- Fixed analyze_image() routing to respect LLM_PROVIDER
- Added HF_TOKEN, HF_VISION_MODEL to settings
- Each provider fails independently (no fallback chains)

Phase 2 (Smoke Tests):
- Created test/test_smoke_hf_vision.py
- Smoke test PASSED: red square image correctly identified
- Fixed Settings.hf_token integration
- Removed unsupported timeout parameter

Modified Files:
- src/tools/vision.py: +120 lines (HF vision function + routing fix)
- src/config/settings.py: +5 lines (HF config)
- .env.example: +7 lines (HF_TOKEN, HF_VISION_MODEL docs)
- CHANGELOG.md: +93 lines (Phase 0-2 documentation)
- PLAN.md: +20 lines (Phase 0 results, Phase 1 updates)
- test/test_smoke_hf_vision.py: NEW (smoke test script)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (6) hide show

.env.example +7 -0
CHANGELOG.md +93 -0
PLAN.md +16 -4
src/config/settings.py +5 -0
src/tools/vision.py +142 -17
test/test_smoke_hf_vision.py +63 -0

.env.example CHANGED Viewed

@@ -15,6 +15,13 @@ ANTHROPIC_API_KEY=your_anthropic_api_key_here
 # Free baseline alternative: Gemini 2.0 Flash
 GOOGLE_API_KEY=your_google_api_key_here
 # ============================================================================
 # Tool API Keys (Level 5 - Component Selection)
 # ============================================================================

 # Free baseline alternative: Gemini 2.0 Flash
 GOOGLE_API_KEY=your_google_api_key_here
+# HuggingFace Inference API (for vision and text models)
+HF_TOKEN=your_huggingface_token_here
+# HuggingFace Vision Model (validated from Phase 0)
+# Options: google/gemma-3-27b-it:scaleway (recommended), CohereLabs/aya-vision-32b
+HF_VISION_MODEL=google/gemma-3-27b-it:scaleway
 # ============================================================================
 # Tool API Keys (Level 5 - Component Selection)
 # ============================================================================

CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,98 @@
 # Session Changelog
 ## [2026-01-07] [Phase 0: API Validation] [COMPLETED] HF Inference Vision Support - GO Decision
 **Problem:** Needed to validate HF Inference API supports vision models before implementation.

 # Session Changelog
+## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision Validated - Ready for GAIA
+**Problem:** Need to validate HF vision works before complex GAIA evaluation.
+**Solution:** Single smoke test with simple red square image.
+**Result:** ✅ PASSED
+- Model: `google/gemma-3-27b-it:scaleway`
+- Answer: "The image is a solid, uniform field of red color..."
+- Provider routing: Working correctly
+- Settings integration: Fixed
+**Modified Files:**
+- **src/config/settings.py** (~5 lines added)
+  - Added `HF_TOKEN` and `HF_VISION_MODEL` config
+  - Added `hf_token` and `hf_vision_model` to Settings class
+  - Updated `validate_api_keys()` to include huggingface
+- **test/test_smoke_hf_vision.py** (NEW - ~50 lines)
+  - Simple smoke test script
+  - Tests basic image description
+**Bug Fixes:**
+- Removed unsupported `timeout` parameter from `chat_completion()`
+**Next Steps:** Phase 3 - GAIA evaluation with HF vision
+---
+## [2026-01-11] [Phase 1: Implementation] [COMPLETED] HF Vision Integration - Routing Fixed
+**Problem:** Vision tool hardcoded to Gemini → Claude, ignoring UI LLM selection.
+**Solution:**
+- Added `analyze_image_hf()` function using `google/gemma-3-27b-it:scaleway` (fastest, ~6s)
+- Fixed `analyze_image()` routing to respect `LLM_PROVIDER` environment variable
+- Each provider fails independently (NO fallback chains during testing)
+**Modified Files:**
+- **src/tools/vision.py** (~120 lines added/modified)
+  - Added `analyze_image_hf()` function with retry logic
+  - Updated `analyze_image()` routing with provider selection
+  - Added HF_VISION_MODEL and HF_TIMEOUT config
+- **.env.example** (~4 lines added)
+  - Documented HF_TOKEN and HF_VISION_MODEL settings
+**Validated Models (Phase 0 Extended Testing):**
+| Rank | Model | Provider | Speed | Notes |
+|------|-------|----------|-------|-------|
+| 1 | `google/gemma-3-27b-it` | Scaleway | ~6s | **RECOMMENDED** - Google brand |
+| 2 | `CohereLabs/aya-vision-32b` | Cohere | ~7s | Fast, less known brand |
+| 3 | `Qwen/Qwen3-VL-30B-A3B-Instruct` | Novita | ~14s | Qwen brand, reputable |
+| 4 | `zai-org/GLM-4.6V-Flash` | zai-org | ~16s | Zhipu AI brand |
+**Failed Models (not vision-capable):**
+- `zai-org/GLM-4.7:cerebras` - Text-only (422 error: "Content type 'image_url' not supported")
+- `openai/gpt-oss-120b:novita` - Text-only (400 Bad request)
+- `openai/gpt-oss-120b:groq` - Text-only (400: "content must be a string")
+- `moonshotai/Kimi-K2-Instruct-0905:novita` - 400 Bad request
+**Next Steps:** Smoke tests (Phase 2) to validate integration
+---
+## [2026-01-11] [Phase 0 Extended] [COMPLETED] Additional Vision Models Tested - Google Gemma 3 Selected
+**Problem:** Needed to find more reputable vision models (aya-vision-32b brand unknown to user).
+**Solution:** Tested user-requested models with provider routing.
+**Test Results:**
+**Working Models:**
+- `google/gemma-3-27b-it:scaleway` ✅ - ~6s, Google brand, **RECOMMENDED**
+- `zai-org/GLM-4.6V-Flash:zai-org` ✅ - ~16s, Zhipu AI brand
+- `Qwen/Qwen3-VL-30B-A3B-Instruct:novita` ✅ - ~14s, Qwen brand
+**Failed Models:**
+- `zai-org/GLM-4.7:cerebras` ❌ - Text-only model (422: "image_url not supported")
+- `openai/gpt-oss-120b:novita` ❌ - Generic 400 Bad request
+- `openai/gpt-oss-120b:groq` ❌ - Text-only (400: "content must be a string")
+- `moonshotai/Kimi-K2-Instruct-0905:novita` ❌ - Generic 400 Bad request
+**Output Files:**
+- `output/phase0_vision_validation_20260111_162124.json` - 4 new models test
+- `output/phase0_vision_validation_20260111_163647.json` - Groq provider test
+- `output/phase0_vision_validation_20260111_164531.json` - GLM-4.6V test
+- `output/phase0_vision_validation_20260111_164945.json` - Gemma-3-27B test
+**Decision:** Use `google/gemma-3-27b-it:scaleway` for production (fastest, most reputable brand)
+---
 ## [2026-01-07] [Phase 0: API Validation] [COMPLETED] HF Inference Vision Support - GO Decision
 **Problem:** Needed to validate HF Inference API supports vision models before implementation.

PLAN.md CHANGED Viewed

@@ -172,7 +172,19 @@ Fix LLM selection routing so UI provider selection propagates to ALL tools (plan
   - **Option D:** Local transformers library (no API)
   - **Option E:** Hybrid (HF text + Gemini/Claude vision only)
-**Phase 0 Status:** ✅ COMPLETED - See CHANGELOG.md for results
 ---
@@ -182,14 +194,14 @@ Fix LLM selection routing so UI provider selection propagates to ALL tools (plan
 **Validated from Phase 0:**
-- Model: `CohereLabs/aya-vision-32b` (Cohere provider)
 - Format: Base64 encoding in messages array
-- Timeout: 120+ seconds for large images
 #### Step 1.1: Implement `analyze_image_hf()` in vision.py
 - [ ] Add function signature matching existing pattern
-- [ ] Use **CohereLabs/aya-vision-32b** (validated from Phase 0)
 - [ ] Format: Base64 encode images in messages array
 - [ ] Add retry logic with exponential backoff (3 attempts)
 - [ ] Handle API errors with clear error messages

   - **Option D:** Local transformers library (no API)
   - **Option E:** Hybrid (HF text + Gemini/Claude vision only)
+**Phase 0 Status:** ✅ COMPLETED - Multiple working models found
+**Validated Models (Ranked by Speed):**
+| Rank | Model | Provider | Speed | Notes |
+|------|-------|----------|-------|-------|
+| 1 | `google/gemma-3-27b-it` | Scaleway | ~6s | **RECOMMENDED** - Google brand, fastest |
+| 2 | `CohereLabs/aya-vision-32b` | Cohere | ~7s | Fast, less known brand |
+| 3 | `Qwen/Qwen3-VL-30B-A3B-Instruct` | Novita | ~14s | Qwen brand, reputable |
+| 4 | `zai-org/GLM-4.6V-Flash` | zai-org | ~16s | Zhipu AI brand |
+**Format:** Base64 encoding only (file:// URLs don't work)
+**Test image:** 2.1MB workspace photo (realistic large image)
 ---
 **Validated from Phase 0:**
+- Model: `google/gemma-3-27b-it:scaleway` (RECOMMENDED - fastest, Google brand)
 - Format: Base64 encoding in messages array
+- Timeout: ~6 seconds for 2.1MB image
 #### Step 1.1: Implement `analyze_image_hf()` in vision.py
 - [ ] Add function signature matching existing pattern
+- [ ] Use **google/gemma-3-27b-it:scaleway** (validated, fastest)
 - [ ] Format: Base64 encode images in messages array
 - [ ] Add retry logic with exponential backoff (3 attempts)
 - [ ] Handle API errors with clear error messages

src/config/settings.py CHANGED Viewed

@@ -21,6 +21,8 @@ load_dotenv()
 # LLM Configuration (Level 5 - Component Selection)
 ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
 GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY", "")
 DEFAULT_LLM_MODEL: Literal["gemini", "claude"] = os.getenv("DEFAULT_LLM_MODEL", "gemini")  # type: ignore
 # Tool API Keys (Level 5 - Component Selection)
@@ -53,6 +55,8 @@ class Settings:
     def __init__(self):
         self.anthropic_api_key = ANTHROPIC_API_KEY
         self.google_api_key = GOOGLE_API_KEY
         self.default_llm_model = DEFAULT_LLM_MODEL
         self.exa_api_key = EXA_API_KEY
@@ -77,6 +81,7 @@ class Settings:
         return {
             "anthropic": bool(self.anthropic_api_key),
             "google": bool(self.google_api_key),
             "exa": bool(self.exa_api_key),
             "tavily": bool(self.tavily_api_key),
         }

 # LLM Configuration (Level 5 - Component Selection)
 ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
 GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY", "")
+HF_TOKEN = os.getenv("HF_TOKEN", "")
+HF_VISION_MODEL = os.getenv("HF_VISION_MODEL", "google/gemma-3-27b-it:scaleway")
 DEFAULT_LLM_MODEL: Literal["gemini", "claude"] = os.getenv("DEFAULT_LLM_MODEL", "gemini")  # type: ignore
 # Tool API Keys (Level 5 - Component Selection)
     def __init__(self):
         self.anthropic_api_key = ANTHROPIC_API_KEY
         self.google_api_key = GOOGLE_API_KEY
+        self.hf_token = HF_TOKEN
+        self.hf_vision_model = HF_VISION_MODEL
         self.default_llm_model = DEFAULT_LLM_MODEL
         self.exa_api_key = EXA_API_KEY
         return {
             "anthropic": bool(self.anthropic_api_key),
             "google": bool(self.google_api_key),
+            "huggingface": bool(self.hf_token),
             "exa": bool(self.exa_api_key),
             "tavily": bool(self.tavily_api_key),
         }

src/tools/vision.py CHANGED Viewed

@@ -4,8 +4,9 @@ Author: @mangobee
 Date: 2026-01-02
 Provides image analysis functionality using:
-- Gemini 2.0 Flash (default, free tier)
-- Claude Sonnet 4.5 (fallback, if configured)
 Supports:
 - Image file loading and encoding
@@ -15,6 +16,7 @@ Supports:
 - Visual reasoning
 """
 import base64
 import logging
 from pathlib import Path
@@ -36,6 +38,8 @@ RETRY_MIN_WAIT = 1  # seconds
 RETRY_MAX_WAIT = 10  # seconds
 MAX_IMAGE_SIZE_MB = 10  # Maximum image size in MB
 SUPPORTED_IMAGE_FORMATS = {'.jpg', '.jpeg', '.png', '.gif', '.webp', '.bmp'}
 # ============================================================================
 # Logging Setup
@@ -296,44 +300,165 @@ def analyze_image_claude(image_path: str, question: Optional[str] = None) -> Dic
         raise Exception(f"Claude vision failed: {str(e)}")
 # ============================================================================
 # Unified Vision Analysis
 # ============================================================================
 def analyze_image(image_path: str, question: Optional[str] = None) -> Dict:
     """
-    Analyze image using available multimodal LLM.
-    Tries Gemini first (free tier), falls back to Claude if configured.
     Args:
         image_path: Path to image file
         question: Optional question about the image
     Returns:
-        Dict with analysis results from either Gemini or Claude
     Raises:
-        Exception: If both Gemini and Claude fail or are not configured
     """
     settings = Settings()
-    # Try Gemini first (default, free tier)
-    if settings.google_api_key:
         try:
             return analyze_image_gemini(image_path, question)
         except Exception as e:
-            logger.warning(f"Gemini failed, trying Claude: {e}")
-    # Fallback to Claude
-    if settings.anthropic_api_key:
         try:
             return analyze_image_claude(image_path, question)
         except Exception as e:
-            logger.error(f"Claude also failed: {e}")
-            raise Exception(f"Vision analysis failed - Gemini and Claude both failed")
-    # No API keys configured
-    raise ValueError(
-        "No vision API configured. Please set GOOGLE_API_KEY or ANTHROPIC_API_KEY"
-    )

 Date: 2026-01-02
 Provides image analysis functionality using:
+- HuggingFace Inference API (Gemini-3-27B, recommended)
+- Gemini 2.0 Flash (fallback)
+- Claude Sonnet 4.5 (fallback)
 Supports:
 - Image file loading and encoding
 - Visual reasoning
 """
+import os
 import base64
 import logging
 from pathlib import Path
 RETRY_MAX_WAIT = 10  # seconds
 MAX_IMAGE_SIZE_MB = 10  # Maximum image size in MB
 SUPPORTED_IMAGE_FORMATS = {'.jpg', '.jpeg', '.png', '.gif', '.webp', '.bmp'}
+HF_VISION_MODEL = os.getenv("HF_VISION_MODEL", "google/gemma-3-27b-it:scaleway")
+HF_TIMEOUT = 120  # seconds for large images
 # ============================================================================
 # Logging Setup
         raise Exception(f"Claude vision failed: {str(e)}")
+# ============================================================================
+# HuggingFace Vision
+# ============================================================================
+@retry(
+    stop=stop_after_attempt(MAX_RETRIES),
+    wait=wait_exponential(multiplier=1, min=RETRY_MIN_WAIT, max=RETRY_MAX_WAIT),
+    retry=retry_if_exception_type((ConnectionError, TimeoutError)),
+    reraise=True,
+)
+def analyze_image_hf(image_path: str, question: Optional[str] = None) -> Dict:
+    """
+    Analyze image using HuggingFace Inference API.
+    Validated models (Phase 0 testing):
+    - google/gemma-3-27b-it:scaleway (recommended, ~6s)
+    - CohereLabs/aya-vision-32b (~7s)
+    - Qwen/Qwen3-VL-30B-A3B-Instruct:novita (~14s)
+    Args:
+        image_path: Path to image file
+        question: Optional question about the image (default: "Describe this image")
+    Returns:
+        Dict with structure: {
+            "answer": str,
+            "model": str,
+            "image_path": str,
+            "question": str
+        }
+    Raises:
+        ValueError: If HF_TOKEN not configured or image invalid
+        ConnectionError: If API connection fails (triggers retry)
+    """
+    try:
+        from huggingface_hub import InferenceClient
+        settings = Settings()
+        hf_token = settings.hf_token
+        if not hf_token:
+            raise ValueError("HF_TOKEN not configured in settings")
+        # Load and encode image
+        image_data = load_and_encode_image(image_path)
+        # Default question
+        if not question:
+            question = "Describe this image in detail."
+        logger.info(f"HF vision analysis: {Path(image_path).name} - '{question}'")
+        logger.info(f"Using model: {HF_VISION_MODEL}")
+        # Configure HF client
+        client = InferenceClient(token=hf_token)
+        # Create messages with base64 image
+        messages = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": question},
+                    {
+                        "type": "image_url",
+                        "image_url": {
+                            "url": f"data:{image_data['mime_type']};base64,{image_data['data']}"
+                        }
+                    }
+                ]
+            }
+        ]
+        # Call chat completion
+        response = client.chat_completion(
+            model=HF_VISION_MODEL,
+            messages=messages,
+            max_tokens=1024,
+        )
+        answer = response.choices[0].message.content.strip()
+        logger.info(f"HF vision successful: {len(answer)} chars")
+        return {
+            "answer": answer,
+            "model": HF_VISION_MODEL,
+            "image_path": image_path,
+            "question": question,
+        }
+    except ValueError as e:
+        logger.error(f"HF configuration/input error: {e}")
+        raise
+    except (ConnectionError, TimeoutError) as e:
+        logger.warning(f"HF connection error (will retry): {e}")
+        raise
+    except Exception as e:
+        logger.error(f"HF vision error: {e}")
+        raise Exception(f"HF vision failed: {str(e)}")
 # ============================================================================
 # Unified Vision Analysis
 # ============================================================================
 def analyze_image(image_path: str, question: Optional[str] = None) -> Dict:
     """
+    Analyze image using provider specified by LLM_PROVIDER environment variable.
+    Respects LLM_PROVIDER setting:
+    - "huggingface" -> Uses HF Inference API
+    - "gemini" -> Uses Gemini 2.0 Flash
+    - "claude" -> Uses Claude Sonnet 4.5
+    - "groq" -> Not yet implemented
     Args:
         image_path: Path to image file
         question: Optional question about the image
     Returns:
+        Dict with analysis results from selected provider
     Raises:
+        Exception: If selected provider fails or is not configured
     """
+    provider = os.getenv("LLM_PROVIDER", "gemini").lower()
     settings = Settings()
+    logger.info(f"Vision analysis with provider: {provider}")
+    # Route to selected provider (each fails independently - NO fallback chains)
+    if provider == "huggingface":
+        try:
+            return analyze_image_hf(image_path, question)
+        except Exception as e:
+            logger.error(f"HF vision failed: {e}")
+            raise Exception(f"HF vision failed: {str(e)}")
+    elif provider == "gemini":
+        if not settings.google_api_key:
+            raise ValueError("GOOGLE_API_KEY not configured for Gemini provider")
         try:
             return analyze_image_gemini(image_path, question)
         except Exception as e:
+            logger.error(f"Gemini vision failed: {e}")
+            raise Exception(f"Gemini vision failed: {str(e)}")
+    elif provider == "claude":
+        if not settings.anthropic_api_key:
+            raise ValueError("ANTHROPIC_API_KEY not configured for Claude provider")
         try:
             return analyze_image_claude(image_path, question)
         except Exception as e:
+            logger.error(f"Claude vision failed: {e}")
+            raise Exception(f"Claude vision failed: {str(e)}")
+    elif provider == "groq":
+        raise NotImplementedError("Groq vision not yet implemented (Phase 5)")
+    else:
+        raise ValueError(f"Unknown LLM_PROVIDER: {provider}. Valid: huggingface, gemini, claude, groq")

test/test_smoke_hf_vision.py ADDED Viewed

	@@ -0,0 +1,63 @@

+#!/usr/bin/env python3
+"""
+Phase 2: Smoke Tests for HF Vision Integration
+Author: @mangobee
+Date: 2026-01-11
+Quick validation that HF vision works before GAIA evaluation.
+"""
+import os
+import sys
+import logging
+from pathlib import Path
+# Add project root to path
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from dotenv import load_dotenv
+load_dotenv()
+# Set HF provider for testing
+os.environ["LLM_PROVIDER"] = "huggingface"
+from src.tools.vision import analyze_image
+# ============================================================================
+# CONFIG
+# ============================================================================
+TEST_IMAGE = "test/fixtures/test_image_red_square.jpg"
+logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
+logger = logging.getLogger(__name__)
+# ============================================================================
+# Smoke Test
+# ============================================================================
+def run_smoke_test():
+    """Run single smoke test: simple image description."""
+    logger.info("=" * 60)
+    logger.info("PHASE 2: SMOKE TEST - HF Vision Integration")
+    logger.info("=" * 60)
+    logger.info(f"Test image: {TEST_IMAGE}")
+    logger.info(f"Provider: {os.getenv('LLM_PROVIDER')}")
+    logger.info(f"Model: {os.getenv('HF_VISION_MODEL', 'google/gemma-3-27b-it:scaleway')}")
+    logger.info("=" * 60)
+    try:
+        result = analyze_image(TEST_IMAGE, "What is in this image?")
+        logger.info("\n✅ SMOKE TEST PASSED")
+        logger.info("-" * 60)
+        logger.info(f"Model used: {result['model']}")
+        logger.info(f"Answer: {result['answer'][:200]}...")
+        logger.info("-" * 60)
+        return True
+    except Exception as e:
+        logger.error(f"\n❌ SMOKE TEST FAILED: {e}")
+        return False
+if __name__ == "__main__":
+    success = run_smoke_test()
+    sys.exit(0 if success else 1)