Spaces:

Blablablab
/

codebook

Paused

App Files Files Community

davidjurgens commited on 13 days ago

Commit

aceb1b2

verified ·

1 Parent(s): e6c39d1

Deploy: Potato — Codebook Annotation

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +4 -0
Dockerfile +41 -0
README.md +42 -5
annotation_output/.gitkeep +0 -0
config.yaml +27 -0
data/codebook-example.json +7 -0
entrypoint.sh +30 -0
layouts/task_layout.html +90 -0
potato/__init__.py +45 -0
potato/__main__.py +3 -0
potato/active_learning_manager.py +1623 -0
potato/adjudication.py +1224 -0
potato/adjudication_export.py +162 -0
potato/admin.py +0 -0
potato/agent_proxy/__init__.py +44 -0
potato/agent_proxy/base.py +138 -0
potato/agent_proxy/coding_proxy.py +466 -0
potato/agent_proxy/echo_proxy.py +55 -0
potato/agent_proxy/http_proxy.py +108 -0
potato/agent_proxy/openai_proxy.py +105 -0
potato/agent_proxy/sandbox.py +76 -0
potato/agent_proxy/session.py +119 -0
potato/agent_runner.py +1008 -0
potato/agent_runner_manager.py +226 -0
potato/agreement.py +278 -0
potato/ai/__init__.py +1 -0
potato/ai/ai_cache.py +1473 -0
potato/ai/ai_endpoint.py +688 -0
potato/ai/ai_help_wrapper.py +203 -0
potato/ai/ai_prompt.py +94 -0
potato/ai/anthropic_endpoint.py +118 -0
potato/ai/anthropic_vision_endpoint.py +405 -0
potato/ai/gemini_endpoint.py +58 -0
potato/ai/huggingface_endpoint.py +62 -0
potato/ai/icl_labeler.py +1110 -0
potato/ai/icl_prompt_builder.py +315 -0
potato/ai/judge.py +265 -0
potato/ai/llm_active_learning.py +733 -0
potato/ai/ollama_endpoint.py +160 -0
potato/ai/ollama_vision_endpoint.py +313 -0
potato/ai/openai_endpoint.py +94 -0
potato/ai/openai_vision_endpoint.py +324 -0
potato/ai/openrouter_endpoint.py +93 -0
potato/ai/prompt/image_annotation.json +44 -0
potato/ai/prompt/likert.json +20 -0
potato/ai/prompt/models_module.py +258 -0
potato/ai/prompt/multiselect.json +20 -0
potato/ai/prompt/number.json +18 -0
potato/ai/prompt/option_highlight.json +8 -0
potato/ai/prompt/radio.json +18 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+potato/static/vendor/font-awesome-6.7.2/webfonts/fa-brands-400.ttf filter=lfs diff=lfs merge=lfs -text
+potato/static/vendor/font-awesome-6.7.2/webfonts/fa-brands-400.woff2 filter=lfs diff=lfs merge=lfs -text
+potato/static/vendor/font-awesome-6.7.2/webfonts/fa-solid-900.ttf filter=lfs diff=lfs merge=lfs -text
+potato/static/vendor/font-awesome-6.7.2/webfonts/fa-solid-900.woff2 filter=lfs diff=lfs merge=lfs -text

Dockerfile ADDED Viewed

	@@ -0,0 +1,41 @@

+FROM python:3.11-slim
+# Create non-root user (HF Spaces requires UID 1000)
+RUN useradd -m -u 1000 potato
+# Install system dependencies
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Set working directory
+WORKDIR /app
+# Copy requirements first for layer caching
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt gunicorn
+# Copy the application source
+COPY . .
+# Install potato
+RUN pip install --no-cache-dir -e .
+# Copy entrypoint
+COPY entrypoint.sh /entrypoint.sh
+RUN chmod +x /entrypoint.sh
+# Create directories for output
+RUN mkdir -p /app/annotation_output && \
+    chown -R potato:potato /app
+# Switch to non-root user
+USER potato
+# HuggingFace Spaces expects port 7860
+EXPOSE 7860
+ENV POTATO_CONFIG=config.yaml
+ENV PORT=7860
+ENTRYPOINT ["/entrypoint.sh"]

README.md CHANGED Viewed

@@ -1,10 +1,47 @@
 ---
-title: Codebook
-emoji: 🐨
-colorFrom: purple
-colorTo: yellow
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Potato — Codebook Annotation
+emoji: 🥔
+colorFrom: blue
+colorTo: indigo
 sdk: docker
+app_port: 7860
 pinned: false
+license: apache-2.0
+tags:
+  - annotation
+  - potato
+  - advanced
+  - qda
+  - codebook
 ---
+# Potato — Codebook Annotation
+Shared evolving codebook across annotators.
+A live demo of **[Potato](https://www.potatoannotator.com)** — the free, self-hosted annotation platform for NLP,
+agentic, and GenAI research, configured entirely through YAML.
+Visit **[www.potatoannotator.com](https://www.potatoannotator.com)** for docs, the schema gallery, and more demos.
+## Try it out
+1. Enter any username to log in (no password required).
+2. Read the item shown in the main panel.
+3. Annotate using the schemes on the right.
+4. Click **Next** to continue.
+> **Run your own copy:** click the **⋮ → Duplicate this Space** button (top-right) to launch
+> this exact demo in your own account on free hardware — change the data and config to make it yours.
+> Annotations in this demo are ephemeral. To collect and keep data, deploy your own
+> Space — see the [deployment guide](https://github.com/davidjurgens/potato/blob/master/deployment/huggingface-spaces/deploy.md).
+## About Potato
+Potato supports 20+ annotation types — text, spans, images, audio, video, documents,
+and agent traces — with AI-assisted labeling, quality control, and adjudication.
+🥔 **Website: [www.potatoannotator.com](https://www.potatoannotator.com)** &nbsp;·&nbsp;
+[Documentation](https://www.potatoannotator.com) &nbsp;·&nbsp;
+[GitHub](https://github.com/davidjurgens/potato) &nbsp;·&nbsp;
+[All demos](https://github.com/davidjurgens/potato/blob/master/docs/data-export/potato_on_huggingface.md)

annotation_output/.gitkeep ADDED Viewed

File without changes

config.yaml ADDED Viewed

	@@ -0,0 +1,27 @@

+port: 8000
+annotation_task_name: Codebook Example
+task_dir: .
+output_annotation_dir: annotation_output/codebook-example/
+output_annotation_format: tsv
+data_files:
+- data/codebook-example.json
+item_properties:
+  id_key: id
+  text_key: text
+user_config:
+  allow_all_users: true
+  users: []
+alert_time_each_instance: 10000000
+codebook_mode: open
+annotation_schemes:
+- annotation_type: multiselect
+  name: themes
+  description: Which qualitative themes appear in this excerpt?
+  codebook: true
+  labels:
+  - access barriers
+  - cost concerns
+  - provider trust
+  - wait times
+site_dir: default
+require_no_password: true

data/codebook-example.json ADDED Viewed

	@@ -0,0 +1,7 @@

+[
+  {"id": "1", "text": "I couldn't get an appointment for three weeks, and even then the clinic was far from a bus line."},
+  {"id": "2", "text": "The doctor explained everything clearly and I felt listened to for the first time."},
+  {"id": "3", "text": "My copay went up again this year, so I've been skipping the follow-up visits."},
+  {"id": "4", "text": "Front desk staff were dismissive, and I waited over an hour past my scheduled time."},
+  {"id": "5", "text": "Telehealth made it much easier, but I still worry whether they have my full history."}
+]

entrypoint.sh ADDED Viewed

	@@ -0,0 +1,30 @@

+#!/bin/bash
+set -e
+# Configuration
+CONFIG_FILE="${POTATO_CONFIG:-config.yaml}"
+PORT="${PORT:-7860}"
+WORKERS="${GUNICORN_WORKERS:-2}"
+THREADS="${GUNICORN_THREADS:-4}"
+TIMEOUT="${GUNICORN_TIMEOUT:-120}"
+echo "Starting Potato Demo Space..."
+echo "  Config: ${CONFIG_FILE}"
+echo "  Port: ${PORT}"
+echo "  Workers: ${WORKERS}"
+# Validate config exists
+if [ ! -f "${CONFIG_FILE}" ]; then
+    echo "ERROR: Config file not found: ${CONFIG_FILE}"
+    exit 1
+fi
+# Start with gunicorn using the factory pattern
+exec gunicorn \
+    --bind "0.0.0.0:${PORT}" \
+    --workers "${WORKERS}" \
+    --threads "${THREADS}" \
+    --timeout "${TIMEOUT}" \
+    --access-logfile - \
+    --error-logfile - \
+    "potato.flask_server:create_app('${CONFIG_FILE}')"

layouts/task_layout.html ADDED Viewed

	@@ -0,0 +1,90 @@

+<!-- CONFIG_HASH: 80d495277c377949581be1771e4d8c3e_c4da35694b3e1122725c151f4a96e755 -->
+<!-- Generated annotation layout file -->
+<!-- This file was automatically generated based on the annotation schemes in your config -->
+<!-- You can customize this file to modify the layout of your annotation interface -->
+<!-- Changes to this file will be preserved across server restarts -->
+<div class="annotation_schema">
+    <form id="themes" class="annotation-form multiselect shadcn-multiselect-container" action="javascript:void(0)" data-annotation-id="0" data-annotation-type="multiselect" data-schema-name="themes" data-grid-columns="1">
+        <fieldset schema="themes">
+            <legend class="shadcn-multiselect-title">Which qualitative themes appear in this excerpt?</legend>
+    <div class="shadcn-multiselect-grid" style="grid-template-columns: repeat(1, 1fr);">
+            <div class="shadcn-multiselect-item">
+                <input class="themes shadcn-multiselect-checkbox annotation-input"
+                       type="checkbox"
+                       id="themes_access barriers_checkbox"
+                       name="themes:::access barriers"
+                       value="access barriers"
+                       label_name="access barriers"
+                       schema="themes"
+                       onclick="whetherNone(this);registerAnnotation(this)"
+                       validation="">
+                <label for="themes_access barriers_checkbox"  schema="themes" class="shadcn-multiselect-label">
+                    Access Barriers
+                </label>
+            </div>
+            <div class="shadcn-multiselect-item">
+                <input class="themes shadcn-multiselect-checkbox annotation-input"
+                       type="checkbox"
+                       id="themes_cost concerns_checkbox"
+                       name="themes:::cost concerns"
+                       value="cost concerns"
+                       label_name="cost concerns"
+                       schema="themes"
+                       onclick="whetherNone(this);registerAnnotation(this)"
+                       validation="">
+                <label for="themes_cost concerns_checkbox"  schema="themes" class="shadcn-multiselect-label">
+                    Cost Concerns
+                </label>
+            </div>
+            <div class="shadcn-multiselect-item">
+                <input class="themes shadcn-multiselect-checkbox annotation-input"
+                       type="checkbox"
+                       id="themes_provider trust_checkbox"
+                       name="themes:::provider trust"
+                       value="provider trust"
+                       label_name="provider trust"
+                       schema="themes"
+                       onclick="whetherNone(this);registerAnnotation(this)"
+                       validation="">
+                <label for="themes_provider trust_checkbox"  schema="themes" class="shadcn-multiselect-label">
+                    Provider Trust
+                </label>
+            </div>
+            <div class="shadcn-multiselect-item">
+                <input class="themes shadcn-multiselect-checkbox annotation-input"
+                       type="checkbox"
+                       id="themes_wait times_checkbox"
+                       name="themes:::wait times"
+                       value="wait times"
+                       label_name="wait times"
+                       schema="themes"
+                       onclick="whetherNone(this);registerAnnotation(this)"
+                       validation="">
+                <label for="themes_wait times_checkbox"  schema="themes" class="shadcn-multiselect-label">
+                    Wait Times
+                </label>
+            </div>
+            <div class="shadcn-multiselect-item">
+                <input class="themes shadcn-multiselect-checkbox annotation-input"
+                       type="checkbox"
+                       id="themes_InspectCode_checkbox"
+                       name="themes:::InspectCode"
+                       value="InspectCode"
+                       label_name="InspectCode"
+                       schema="themes"
+                       onclick="whetherNone(this);registerAnnotation(this)"
+                       validation="">
+                <label for="themes_InspectCode_checkbox"  schema="themes" class="shadcn-multiselect-label">
+                    InspectCode
+                </label>
+            </div>
+        </div></fieldset></form>
+</div>

potato/__init__.py ADDED Viewed

	@@ -0,0 +1,45 @@

+"""
+Potato Annotation Platform
+A flexible, web-based platform for text annotation tasks.
+This package provides a comprehensive annotation system with the following features:
+- Multi-phase annotation workflows (consent, instructions, training, annotation, post-study)
+- Support for various annotation types (labels, spans, text, likert scales, best-worst scaling)
+- User authentication and session management
+- Active learning capabilities
+- Admin dashboard for monitoring progress
+- Configurable assignment strategies
+- Multi-language and multi-task support
+Main Components:
+- flask_server: Core Flask application and server logic
+- routes: HTTP route handlers and request processing
+- user_state_management: User progress tracking and state persistence
+- item_state_management: Data item management and assignment
+- authentificaton: User authentication backends
+- admin: Admin dashboard functionality
+- activelearning: Active learning algorithms and model training
+Usage:
+    from potato.flask_server import create_app
+    app = create_app()
+    app.run()
+"""
+from .flask_server import create_app
+__version__ = "2.6.0"
+__author__ = "Potato Annotation Platform Team"
+__description__ = "A flexible, web-based platform for text annotation tasks"
+def __getattr__(name):
+    """Lazy imports for optional heavy dependencies."""
+    if name == "load_as_dataset":
+        from .datasets_integration import load_as_dataset
+        return load_as_dataset
+    if name == "load_annotations":
+        from .datasets_integration import load_annotations
+        return load_annotations
+    raise AttributeError(f"module 'potato' has no attribute {name!r}")

potato/__main__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from potato.flask_server import main
2	+
3	+ main()

potato/active_learning_manager.py ADDED Viewed

	@@ -0,0 +1,1623 @@

+"""
+Enhanced Active Learning Manager with Database Persistence
+This module provides a comprehensive active learning system with optional
+database persistence, model saving, LLM integration, and multiple query
+strategies including uncertainty sampling, diversity sampling, BADGE, BALD,
+and hybrid combinations.
+References:
+    [1] Ash et al. (2020) "Deep Batch Active Learning by Diverse, Uncertain
+        Gradient Lower Bounds" (BADGE). ICLR 2020.
+    [2] Houlsby et al. (2011) "Bayesian Active Learning for Classification
+        and Preference Learning" (BALD).
+    [3] Bayer et al. (2024) "ActiveLLM: Large Language Model-Based Active
+        Learning for Textual Few-Shot Scenarios". TACL.
+    [4] Yuan et al. (2024) "Hide and Seek in Noise Labels: Noise-Robust
+        Collaborative Active Learning" (NoiseAL). ACL 2024.
+    [5] Mavromatis et al. (2024) "CoverICL: Selective Annotation for
+        In-Context Learning via Active Graph Coverage". EMNLP 2024.
+"""
+import threading
+import logging
+import time
+import os
+import pickle
+import json
+from typing import Dict, List, Optional, Tuple, Any, Union
+from collections import defaultdict, Counter
+import dataclasses
+from dataclasses import dataclass, field, asdict
+from enum import Enum
+import random
+import queue
+from datetime import datetime
+from abc import ABC, abstractmethod
+from sklearn.pipeline import Pipeline
+from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
+from sklearn.linear_model import LogisticRegression
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.svm import SVC
+from sklearn.metrics import accuracy_score, classification_report
+import numpy as np
+from potato.item_state_management import ItemStateManager, get_item_state_manager
+from potato.user_state_management import get_user_state_manager
+logger = logging.getLogger(__name__)
+class ResolutionStrategy(Enum):
+    """Strategies for resolving multiple annotations per instance."""
+    MAJORITY_VOTE = "majority_vote"
+    RANDOM = "random"
+    CONSENSUS = "consensus"
+    WEIGHTED_AVERAGE = "weighted_average"
+# ---------------------------------------------------------------------------
+# SentenceTransformerVectorizer
+# ---------------------------------------------------------------------------
+class SentenceTransformerVectorizer:
+    """sklearn-compatible wrapper for sentence-transformers.
+    Uses dense embeddings from pre-trained transformer models instead of
+    bag-of-words features. Produces 384-dim vectors (for default model)
+    that capture semantic meaning, enabling better classification with
+    fewer training examples.
+    The ``sentence-transformers`` package is an **optional** dependency and
+    is only imported when this vectorizer is actually used.
+    """
+    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
+        self.model_name = model_name
+        self._model = None
+    def fit(self, X, y=None):
+        from sentence_transformers import SentenceTransformer
+        self._model = SentenceTransformer(self.model_name)
+        return self
+    def transform(self, X):
+        if self._model is None:
+            raise RuntimeError("SentenceTransformerVectorizer has not been fitted yet")
+        return self._model.encode(list(X), show_progress_bar=False)
+    def fit_transform(self, X, y=None):
+        self.fit(X, y)
+        return self.transform(X)
+# ---------------------------------------------------------------------------
+# Query Strategies
+# ---------------------------------------------------------------------------
+class QueryStrategy(ABC):
+    """Base class for active learning query strategies."""
+    @abstractmethod
+    def rank(self, texts: List[str], model, vectorizer,
+             annotated_texts: Optional[List[str]] = None) -> List[Tuple[int, float]]:
+        """Return list of (index, score) sorted by selection priority (highest first)."""
+class UncertaintySampling(QueryStrategy):
+    """Select instances where classifier is least confident.
+    Selects x* = argmax_x (1 - max_y P(y|x)), i.e., instances where the
+    model's best guess has lowest confidence.
+    """
+    def rank(self, texts, model, vectorizer, annotated_texts=None):
+        try:
+            features = vectorizer.transform(texts)
+            probas = model.predict_proba(features)
+            # Score = 1 - max_prob (higher = more uncertain = higher priority)
+            scores = 1.0 - np.max(probas, axis=1)
+            ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
+            return ranked
+        except Exception as e:
+            logger.warning(f"UncertaintySampling failed: {e}")
+            return [(i, 0.5) for i in range(len(texts))]
+class DiversitySampling(QueryStrategy):
+    """Select instances that maximize feature-space coverage.
+    Uses cosine distance from already-annotated instances in the vectorized
+    feature space. Ensures the training set covers the full data distribution
+    rather than over-sampling one region.
+    """
+    def rank(self, texts, model, vectorizer, annotated_texts=None):
+        from sklearn.metrics.pairwise import cosine_distances
+        try:
+            features = vectorizer.transform(texts)
+            if hasattr(features, 'toarray'):
+                features = features.toarray()
+            if annotated_texts:
+                annotated_features = vectorizer.transform(annotated_texts)
+                if hasattr(annotated_features, 'toarray'):
+                    annotated_features = annotated_features.toarray()
+                # Score = min cosine distance to any annotated instance
+                distances = cosine_distances(features, annotated_features)
+                scores = np.min(distances, axis=1)
+            else:
+                # No annotated texts yet: use distance from centroid
+                centroid = np.mean(features, axis=0, keepdims=True)
+                scores = cosine_distances(features, centroid).ravel()
+            ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
+            return ranked
+        except Exception as e:
+            logger.warning(f"DiversitySampling failed: {e}")
+            return [(i, 0.5) for i in range(len(texts))]
+class BadgeStrategy(QueryStrategy):
+    """BADGE approximation: uncertainty-weighted diversity.
+    Inspired by Ash et al. (2020) [Ref 1]. Full BADGE uses gradient embeddings
+    from neural networks. Our approximation:
+      1. Weight feature vectors by (1 - max_prob) as uncertainty proxy
+      2. Run k-means++ initialization on weighted vectors to select
+         diverse-uncertain instances.
+    """
+    def rank(self, texts, model, vectorizer, annotated_texts=None):
+        try:
+            features = vectorizer.transform(texts)
+            if hasattr(features, 'toarray'):
+                features = features.toarray()
+            probas = model.predict_proba(features)
+            uncertainty = 1.0 - np.max(probas, axis=1)
+            # Weight features by uncertainty
+            weighted = features * uncertainty[:, np.newaxis]
+            # Use k-means++ initialization to select diverse-uncertain points
+            from sklearn.cluster import kmeans_plusplus
+            n_clusters = min(len(texts), max(1, len(texts) // 2))
+            _, indices = kmeans_plusplus(weighted, n_clusters=n_clusters,
+                                        random_state=42)
+            # Build score: selected centroids get highest scores
+            scores = np.zeros(len(texts))
+            for rank_pos, idx in enumerate(indices):
+                scores[idx] = len(indices) - rank_pos  # highest for first-selected
+            # For non-selected, use uncertainty as tiebreaker
+            for i in range(len(texts)):
+                if scores[i] == 0:
+                    scores[i] = uncertainty[i] * 0.01
+            ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
+            return ranked
+        except Exception as e:
+            logger.warning(f"BadgeStrategy failed, falling back to uncertainty: {e}")
+            return UncertaintySampling().rank(texts, model, vectorizer, annotated_texts)
+class BaldStrategy(QueryStrategy):
+    """BALD: Bayesian Active Learning by Disagreement.
+    Based on Houlsby et al. (2011) [Ref 2]. Trains an ensemble of classifiers
+    with different random seeds/bootstrap samples. Selects instances with
+    highest mutual information: H[y|x] - E_theta[H[y|x,theta]], i.e.,
+    where the ensemble disagrees most.
+    """
+    def __init__(self, n_estimators: int = 5, bootstrap_fraction: float = 0.8):
+        self.n_estimators = n_estimators
+        self.bootstrap_fraction = bootstrap_fraction
+    def rank(self, texts, model, vectorizer, annotated_texts=None):
+        try:
+            features = vectorizer.transform(texts)
+            if hasattr(features, 'toarray'):
+                features = features.toarray()
+            probas = model.predict_proba(features)
+            # Average entropy
+            avg_proba = probas
+            entropy_avg = -np.sum(avg_proba * np.log(avg_proba + 1e-10), axis=1)
+            # For a single model, we approximate BALD by using dropout-like noise
+            # or by comparing with uniform. Since we store the ensemble models
+            # on the manager, we just use the single model's entropy here and
+            # the ensemble version is handled in ActiveLearningManager._train_bald_ensemble
+            scores = entropy_avg
+            ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
+            return ranked
+        except Exception as e:
+            logger.warning(f"BaldStrategy failed: {e}")
+            return [(i, 0.5) for i in range(len(texts))]
+    def rank_with_ensemble(self, texts, ensemble_models, vectorizer):
+        """Rank using actual ensemble disagreement (mutual information)."""
+        try:
+            features = vectorizer.transform(texts)
+            if hasattr(features, 'toarray'):
+                features = features.toarray()
+            all_probas = []
+            for m in ensemble_models:
+                all_probas.append(m.predict_proba(features))
+            all_probas = np.array(all_probas)  # (n_estimators, n_samples, n_classes)
+            # Mean prediction across ensemble
+            mean_proba = np.mean(all_probas, axis=0)  # (n_samples, n_classes)
+            # H[y|x] - entropy of mean prediction
+            entropy_mean = -np.sum(mean_proba * np.log(mean_proba + 1e-10), axis=1)
+            # E_theta[H[y|x,theta]] - mean of individual entropies
+            individual_entropies = -np.sum(all_probas * np.log(all_probas + 1e-10), axis=2)
+            mean_entropy = np.mean(individual_entropies, axis=0)
+            # Mutual information = H[y|x] - E[H[y|x,theta]]
+            mutual_info = entropy_mean - mean_entropy
+            ranked = sorted(enumerate(mutual_info), key=lambda x: x[1], reverse=True)
+            return ranked
+        except Exception as e:
+            logger.warning(f"BaldStrategy ensemble ranking failed: {e}")
+            return [(i, 0.5) for i in range(len(texts))]
+class HybridStrategy(QueryStrategy):
+    """Weighted combination of uncertainty and diversity scores.
+    Combines strategies with configurable weights. Default: 0.7 uncertainty +
+    0.3 diversity.
+    """
+    def __init__(self, weights: Optional[Dict[str, float]] = None):
+        self.weights = weights or {"uncertainty": 0.7, "diversity": 0.3}
+    def rank(self, texts, model, vectorizer, annotated_texts=None):
+        try:
+            strategies = {}
+            if self.weights.get("uncertainty", 0) > 0:
+                strategies["uncertainty"] = UncertaintySampling()
+            if self.weights.get("diversity", 0) > 0:
+                strategies["diversity"] = DiversitySampling()
+            # Collect raw scores from each strategy
+            all_scores = {}
+            for name, strategy in strategies.items():
+                rankings = strategy.rank(texts, model, vectorizer, annotated_texts)
+                score_map = {idx: score for idx, score in rankings}
+                all_scores[name] = score_map
+            # Normalize each strategy's scores to [0, 1]
+            for name in all_scores:
+                vals = list(all_scores[name].values())
+                min_val, max_val = min(vals), max(vals)
+                rng = max_val - min_val if max_val > min_val else 1.0
+                all_scores[name] = {
+                    idx: (s - min_val) / rng for idx, s in all_scores[name].items()
+                }
+            # Weighted combination
+            combined = {}
+            for i in range(len(texts)):
+                combined[i] = sum(
+                    self.weights.get(name, 0) * all_scores.get(name, {}).get(i, 0)
+                    for name in self.weights
+                )
+            ranked = sorted(combined.items(), key=lambda x: x[1], reverse=True)
+            return ranked
+        except Exception as e:
+            logger.warning(f"HybridStrategy failed: {e}")
+            return UncertaintySampling().rank(texts, model, vectorizer, annotated_texts)
+# Strategy registry
+STRATEGY_REGISTRY = {
+    "uncertainty": UncertaintySampling,
+    "diversity": DiversitySampling,
+    "badge": BadgeStrategy,
+    "bald": BaldStrategy,
+    "hybrid": HybridStrategy,
+}
+def create_query_strategy(config: 'ActiveLearningConfig') -> QueryStrategy:
+    """Create a query strategy from config."""
+    strategy_name = config.query_strategy
+    if strategy_name == "hybrid":
+        return HybridStrategy(weights=config.hybrid_weights)
+    elif strategy_name == "bald":
+        params = config.bald_params
+        return BaldStrategy(
+            n_estimators=params.get("n_estimators", 5),
+            bootstrap_fraction=params.get("bootstrap_fraction", 0.8),
+        )
+    elif strategy_name in STRATEGY_REGISTRY:
+        return STRATEGY_REGISTRY[strategy_name]()
+    else:
+        logger.warning(f"Unknown strategy '{strategy_name}', falling back to uncertainty")
+        return UncertaintySampling()
+# ---------------------------------------------------------------------------
+# ICLClassifier wrapper (Phase 5A)
+# ---------------------------------------------------------------------------
+class ICLClassifier:
+    """Wraps ICLLabeler as an sklearn-compatible classifier for ensemble use.
+    Enables combining LLM-based ICL predictions with traditional classifier
+    predictions in a hybrid ensemble for active learning scoring.
+    """
+    def __init__(self, icl_labeler, schema_name: str, label_names: List[str]):
+        self.icl_labeler = icl_labeler
+        self.schema_name = schema_name
+        self.label_names = label_names
+        self.classes_ = np.array(label_names)
+    def predict_proba(self, texts: List[str]) -> np.ndarray:
+        """Get label probabilities from LLM via ICL."""
+        n_classes = len(self.label_names)
+        probas = np.full((len(texts), n_classes), 1.0 / n_classes)
+        for i, text in enumerate(texts):
+            try:
+                prediction = self.icl_labeler.label_instance(
+                    instance_id=f"_al_query_{i}",
+                    schema_name=self.schema_name,
+                    instance_text=text,
+                )
+                if prediction and prediction.predicted_label in self.label_names:
+                    idx = self.label_names.index(prediction.predicted_label)
+                    conf = prediction.confidence_score
+                    # Distribute: conf to predicted label, (1-conf)/(n-1) to others
+                    remaining = (1.0 - conf) / max(1, n_classes - 1)
+                    probas[i] = remaining
+                    probas[i, idx] = conf
+            except Exception:
+                pass  # Keep uniform distribution
+        return probas
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+@dataclass
+class ActiveLearningConfig:
+    """Enhanced configuration for active learning."""
+    enabled: bool = False
+    classifier_name: str = "sklearn.linear_model.LogisticRegression"
+    classifier_kwargs: Dict[str, Any] = None
+    vectorizer_name: str = "sklearn.feature_extraction.text.TfidfVectorizer"
+    vectorizer_kwargs: Dict[str, Any] = None
+    min_annotations_per_instance: int = 1
+    min_instances_for_training: int = 10
+    max_instances_to_reorder: Optional[int] = None
+    resolution_strategy: ResolutionStrategy = ResolutionStrategy.MAJORITY_VOTE
+    random_sample_percent: float = 0.2
+    update_frequency: int = 5
+    schema_names: List[str] = None
+    # Classifier/vectorizer passthrough params (Phase 1C)
+    classifier_params: Dict[str, Any] = field(default_factory=dict)
+    vectorizer_params: Dict[str, Any] = field(default_factory=dict)
+    # Probability calibration (Phase 1D)
+    calibrate_probabilities: bool = True
+    # Query strategy (Phase 2)
+    query_strategy: str = "uncertainty"
+    hybrid_weights: Dict[str, float] = field(
+        default_factory=lambda: {"uncertainty": 0.7, "diversity": 0.3}
+    )
+    bald_params: Dict[str, Any] = field(
+        default_factory=lambda: {"n_estimators": 5, "bootstrap_fraction": 0.8}
+    )
+    # Cold-start (Phase 3)
+    cold_start_strategy: str = "random"
+    cold_start_batch_size: int = 20
+    # ICL ensemble (Phase 5)
+    use_icl_ensemble: bool = False
+    icl_ensemble_params: Dict[str, Any] = field(default_factory=lambda: {
+        "initial_icl_weight": 0.7,
+        "final_icl_weight": 0.2,
+        "transition_instances": 100,
+    })
+    # Annotation routing (Phase 5D)
+    annotation_routing: bool = False
+    routing_thresholds: Dict[str, float] = field(default_factory=lambda: {
+        "auto_label_min_confidence": 0.9,
+        "show_suggestion_below": 0.5,
+    })
+    verification_sample_rate: float = 0.2
+    # Database persistence
+    database_enabled: bool = False
+    database_config: Dict[str, Any] = None
+    # Model persistence
+    model_persistence_enabled: bool = False
+    model_save_directory: Optional[str] = None
+    model_retention_count: int = 2
+    # LLM integration
+    llm_enabled: bool = False
+    llm_config: Dict[str, Any] = None
+    def __post_init__(self):
+        if self.classifier_kwargs is None:
+            self.classifier_kwargs = {}
+        if self.vectorizer_kwargs is None:
+            self.vectorizer_kwargs = {}
+        if self.schema_names is None:
+            self.schema_names = []
+        if self.database_config is None:
+            self.database_config = {}
+        if self.llm_config is None:
+            self.llm_config = {}
+        # Merge classifier_params into classifier_kwargs
+        if self.classifier_params:
+            self.classifier_kwargs.update(self.classifier_params)
+        # Merge vectorizer_params into vectorizer_kwargs
+        if self.vectorizer_params:
+            self.vectorizer_kwargs.update(self.vectorizer_params)
+@dataclass
+class TrainingMetrics:
+    """Metrics for a training run."""
+    schema_name: str
+    training_time: float
+    accuracy: float
+    instance_count: int
+    timestamp: datetime
+    model_file_path: Optional[str] = None
+    confidence_distribution: Dict[str, float] = None
+    error_message: Optional[str] = None
+class ModelPersistence:
+    """Handles model saving and loading with metadata."""
+    def __init__(self, save_directory: str, retention_count: int = 2):
+        self.save_directory = save_directory
+        self.retention_count = retention_count
+        self.logger = logging.getLogger(__name__)
+        # Ensure directory exists
+        os.makedirs(save_directory, exist_ok=True)
+    def save_model(self, model: Pipeline, schema_name: str, instance_count: int) -> str:
+        """Save a trained model with metadata."""
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        filename = f"{schema_name}_{instance_count}_{timestamp}.pkl"
+        filepath = os.path.join(self.save_directory, filename)
+        try:
+            # Save the complete model (including vectorizer)
+            with open(filepath, 'wb') as f:
+                pickle.dump(model, f)
+            self.logger.info(f"Saved model to {filepath}")
+            # Clean up old models
+            self._cleanup_old_models(schema_name)
+            return filepath
+        except Exception as e:
+            self.logger.error(f"Failed to save model: {e}")
+            raise
+    def load_model(self, filepath: str) -> Optional[Pipeline]:
+        """Load a saved model."""
+        try:
+            with open(filepath, 'rb') as f:
+                model = pickle.load(f)
+            # TODO: Add schema validation here in the future
+            # This is a placeholder for future schema validation enhancement
+            self.logger.info(f"Loaded model from {filepath}")
+            return model
+        except Exception as e:
+            self.logger.error(f"Failed to load model from {filepath}: {e}")
+            return None
+    def _cleanup_old_models(self, schema_name: str):
+        """Clean up old models based on retention policy."""
+        try:
+            # Find all model files for this schema
+            model_files = []
+            for filename in os.listdir(self.save_directory):
+                if filename.startswith(f"{schema_name}_") and filename.endswith(".pkl"):
+                    filepath = os.path.join(self.save_directory, filename)
+                    model_files.append((filepath, os.path.getmtime(filepath)))
+            # Sort by modification time (newest first)
+            model_files.sort(key=lambda x: x[1], reverse=True)
+            # Remove old models beyond retention count
+            for filepath, _ in model_files[self.retention_count:]:
+                try:
+                    os.remove(filepath)
+                    self.logger.info(f"Removed old model: {filepath}")
+                except Exception as e:
+                    self.logger.warning(f"Failed to remove old model {filepath}: {e}")
+        except Exception as e:
+            self.logger.error(f"Error during model cleanup: {e}")
+class DatabaseStateManager:
+    """Manages database persistence for active learning state."""
+    def __init__(self, config: Dict[str, Any]):
+        self.config = config
+        self.logger = logging.getLogger(__name__)
+        self.connection = None
+        self._initialize_database()
+    def _initialize_database(self):
+        """Initialize database connection and create tables."""
+        try:
+            # Use the same database system as main Potato application
+            if self.config.get('type') == 'mysql':
+                self._init_mysql_connection()
+            else:
+                self._init_file_based_connection()
+            self._create_tables()
+            self.logger.info("Active learning database initialized successfully")
+        except Exception as e:
+            self.logger.error(f"Failed to initialize database: {e}")
+            raise
+    def _init_mysql_connection(self):
+        """Initialize MySQL connection."""
+        # TODO: Implement MySQL connection
+        pass
+    def _init_file_based_connection(self):
+        """Initialize file-based database connection."""
+        # TODO: Implement file-based database
+        pass
+    def _create_tables(self):
+        """Create database tables for active learning."""
+        # TODO: Implement table creation
+        pass
+    def save_training_metrics(self, metrics: TrainingMetrics):
+        """Save training metrics to database."""
+        # TODO: Implement metrics saving
+        pass
+    def get_training_history(self, schema_name: Optional[str] = None) -> List[TrainingMetrics]:
+        """Get training history from database."""
+        # TODO: Implement history retrieval
+        return []
+    def save_schema_cycling_state(self, current_schema: str, schema_order: List[str]):
+        """Save current schema cycling state."""
+        # TODO: Implement state saving
+        pass
+    def get_schema_cycling_state(self) -> Tuple[str, List[str]]:
+        """Get current schema cycling state."""
+        # TODO: Implement state retrieval
+        return "", []
+class SchemaCycler:
+    """Manages cycling through multiple annotation schemes."""
+    def __init__(self, schema_names: List[str], database_manager: Optional[DatabaseStateManager] = None):
+        self.schema_names = self._validate_schemas(schema_names)
+        self.database_manager = database_manager
+        self.current_index = 0
+        self.logger = logging.getLogger(__name__)
+        self._lock = threading.Lock()
+        # Load state from database if available
+        if self.database_manager:
+            self._load_state()
+    def _validate_schemas(self, schema_names: List[str]) -> List[str]:
+        """Validate and filter schema names."""
+        valid_schemas = []
+        for schema in schema_names:
+            # Exclude text and span annotation schemes
+            if schema in ['text', 'span']:
+                raise ValueError(f"Text and span annotation schemes are not supported for active learning: {schema}")
+            valid_schemas.append(schema)
+        return valid_schemas
+    def _load_state(self):
+        """Load cycling state from database."""
+        try:
+            current_schema, schema_order = self.database_manager.get_schema_cycling_state()
+            with self._lock:
+                if current_schema in self.schema_names:
+                    self.current_index = self.schema_names.index(current_schema)
+        except Exception as e:
+            self.logger.warning(f"Failed to load schema cycling state: {e}")
+    def get_current_schema(self) -> Optional[str]:
+        """Get the current schema for training."""
+        if not self.schema_names:
+            return None
+        with self._lock:
+            return self.schema_names[self.current_index]
+    def advance_schema(self):
+        """Advance to the next schema in the cycle."""
+        if not self.schema_names:
+            return
+        with self._lock:
+            self.current_index = (self.current_index + 1) % len(self.schema_names)
+            current_schema = self.schema_names[self.current_index]
+        # Save state to database if available
+        if self.database_manager:
+            try:
+                self.database_manager.save_schema_cycling_state(
+                    current_schema,
+                    self.schema_names
+                )
+            except Exception as e:
+                self.logger.warning(f"Failed to save schema cycling state: {e}")
+    def get_schema_order(self) -> List[str]:
+        """Get the current schema cycling order."""
+        return self.schema_names.copy()
+class ActiveLearningManager:
+    """
+    Manages active learning operations including classifier training and instance reordering.
+    This class provides thread-safe operations for:
+    - Training classifiers on annotated data
+    - Predicting confidence scores for unlabeled instances
+    - Reordering instances based on configurable query strategies
+    - Cold-start LLM-based instance selection
+    - ICL/classifier ensemble for improved ranking
+    - Noise-aware annotation routing
+    - Managing training state and progress
+    - Database persistence and model saving
+    """
+    def __init__(self, config: ActiveLearningConfig):
+        self.config = config
+        self.logger = logging.getLogger(__name__)
+        # Thread safety
+        self._lock = threading.RLock()
+        self._training_queue = queue.Queue()
+        self._training_thread = None
+        self._stop_training = threading.Event()
+        # State tracking
+        self._last_training_time = 0
+        self._training_count = 0
+        self._models = {}  # schema_name -> trained_model
+        self._vectorizers = {}  # schema_name -> fitted vectorizer
+        self._bald_ensembles = {}  # schema_name -> list of classifiers
+        self._last_annotation_count = 0
+        self._training_metrics = []  # List of TrainingMetrics
+        self._annotated_texts = {}  # schema_name -> list of annotated texts
+        # Query strategy
+        self._query_strategy = create_query_strategy(config)
+        # Database and persistence
+        self.database_manager = None
+        self.model_persistence = None
+        self.schema_cycler = None
+        # Initialize components
+        self._initialize_components()
+        # Start training thread if enabled
+        if self.config.enabled:
+            self._start_training_thread()
+    def _initialize_components(self):
+        """Initialize database, model persistence, and schema cycler."""
+        # Initialize database manager if enabled
+        if self.config.database_enabled:
+            try:
+                self.database_manager = DatabaseStateManager(self.config.database_config)
+            except Exception as e:
+                self.logger.error(f"Failed to initialize database manager: {e}")
+                # Continue without database persistence
+        # Initialize model persistence if enabled
+        if self.config.model_persistence_enabled and self.config.model_save_directory:
+            try:
+                self.model_persistence = ModelPersistence(
+                    self.config.model_save_directory,
+                    self.config.model_retention_count
+                )
+            except Exception as e:
+                self.logger.error(f"Failed to initialize model persistence: {e}")
+                # Continue without model persistence
+        # Initialize schema cycler
+        try:
+            self.schema_cycler = SchemaCycler(self.config.schema_names, self.database_manager)
+        except Exception as e:
+            self.logger.error(f"Failed to initialize schema cycler: {e}")
+            raise  # Schema cycler is critical
+    def _start_training_thread(self):
+        """Start the background training thread."""
+        if self._training_thread is None or not self._training_thread.is_alive():
+            self._training_thread = threading.Thread(target=self._training_worker, daemon=True)
+            self._training_thread.start()
+            self.logger.info("Active learning training thread started")
+    def _training_worker(self):
+        """Background worker for training classifiers."""
+        while not self._stop_training.is_set():
+            try:
+                # Wait for training request
+                training_request = self._training_queue.get(timeout=1.0)
+                if training_request is None:  # Shutdown signal
+                    break
+                self._perform_training()
+                self._training_queue.task_done()
+            except queue.Empty:
+                continue
+            except Exception as e:
+                self.logger.error(f"Error in training worker: {e}")
+    def _perform_training(self):
+        """Perform the actual classifier training."""
+        with self._lock:
+            try:
+                self.logger.info("Starting active learning classifier training")
+                start_time = time.time()
+                # Get current schema for training
+                current_schema = self.schema_cycler.get_current_schema()
+                if not current_schema:
+                    self.logger.warning("No schema available for training")
+                    return
+                # Get current annotation state
+                item_manager = get_item_state_manager()
+                user_manager = get_user_state_manager()
+                # Collect training data
+                training_data = self._collect_training_data(item_manager, user_manager, current_schema)
+                if not training_data:
+                    self.logger.warning(f"No training data available for schema {current_schema}")
+                    # If in cold-start phase, try LLM-based reordering
+                    if self.config.cold_start_strategy == "llm" and self.config.llm_enabled:
+                        self._cold_start_reorder(item_manager)
+                    return
+                # Train classifier
+                model, metrics = self._train_classifier(training_data, current_schema)
+                if model:
+                    self._models[current_schema] = model
+                    self._annotated_texts[current_schema] = training_data["texts"]
+                    # Save model if persistence is enabled
+                    if self.model_persistence:
+                        try:
+                            model_path = self.model_persistence.save_model(
+                                model, current_schema, len(training_data["texts"])
+                            )
+                            metrics.model_file_path = model_path
+                        except Exception as e:
+                            self.logger.error(f"Failed to save model: {e}")
+                    # Save metrics to database if available
+                    if self.database_manager:
+                        try:
+                            self.database_manager.save_training_metrics(metrics)
+                        except Exception as e:
+                            self.logger.error(f"Failed to save metrics: {e}")
+                    # Reorder instances
+                    self._reorder_instances(item_manager, current_schema)
+                    # Advance to next schema
+                    self.schema_cycler.advance_schema()
+                    self._training_count += 1
+                    self._last_training_time = time.time()
+                    training_duration = time.time() - start_time
+                    self.logger.info(f"Active learning training completed for schema {current_schema} "
+                                   f"(run #{self._training_count}, duration: {training_duration:.2f}s)")
+                else:
+                    self.logger.warning(f"Failed to train model for schema {current_schema}")
+                    # Try cold-start if not enough data
+                    if (self.config.cold_start_strategy == "llm"
+                            and self.config.llm_enabled
+                            and len(training_data.get("texts", [])) < self.config.min_instances_for_training):
+                        self._cold_start_reorder(item_manager)
+            except Exception as e:
+                self.logger.error(f"Error during training: {e}")
+                # Continue without failing the entire system
+    def _collect_training_data(self, item_manager: ItemStateManager, user_manager, schema_name: str) -> Dict:
+        """Collect training data for a specific schema."""
+        training_data = {"texts": [], "labels": [], "instance_ids": []}
+        # Get all user states
+        user_states = user_manager.get_all_users()
+        self.logger.debug(f"Found {len(user_states)} user states")
+        # Collect annotations per instance
+        instance_annotations = defaultdict(list)
+        for user_state in user_states:
+            user_annotations = user_state.get_all_annotations()
+            self.logger.debug(f"User {user_state.user_id} has {len(user_annotations)} annotations")
+            for instance_id, annotations in user_annotations.items():
+                # Check if the schema exists in the labels section
+                if 'labels' in annotations:
+                    labels_dict = annotations['labels']
+                    # Handle Label objects as keys
+                    for label_obj, value in labels_dict.items():
+                        if hasattr(label_obj, 'get_schema') and label_obj.get_schema() == schema_name:
+                            instance_annotations[instance_id].append({
+                                "label": label_obj.get_name(),
+                                "value": value,
+                                "user": user_state.user_id
+                            })
+        self.logger.debug(f"Collected annotations for {len(instance_annotations)} instances")
+        # Filter instances with sufficient annotations
+        for instance_id, annotations in instance_annotations.items():
+            if len(annotations) >= self.config.min_annotations_per_instance:
+                # Resolve multiple annotations
+                resolved_label = self._resolve_annotations(annotations)
+                if resolved_label:
+                    item = item_manager.get_item(instance_id)
+                    if item:
+                        text = item.get_text()
+                        training_data["texts"].append(text)
+                        training_data["labels"].append(resolved_label)
+                        training_data["instance_ids"].append(instance_id)
+        self.logger.debug(f"Training data collected: {len(training_data['texts'])} texts, {len(training_data['labels'])} labels")
+        return training_data
+    def _resolve_annotations(self, annotations: List[Dict]) -> Optional[str]:
+        """Resolve multiple annotations using the configured strategy."""
+        if not annotations:
+            return None
+        if self.config.resolution_strategy == ResolutionStrategy.MAJORITY_VOTE:
+            return self._majority_vote(annotations)
+        elif self.config.resolution_strategy == ResolutionStrategy.RANDOM:
+            return self._random_selection(annotations)
+        elif self.config.resolution_strategy == ResolutionStrategy.CONSENSUS:
+            return self._consensus_resolution(annotations)
+        else:
+            return self._majority_vote(annotations)  # Default fallback
+    def _majority_vote(self, annotations: List[Dict]) -> str:
+        """Resolve annotations using majority vote with random tie-breaking."""
+        label_counts = Counter(ann["label"] for ann in annotations)
+        max_count = max(label_counts.values())
+        # Find all labels with the maximum count (handles ties)
+        tied_labels = [label for label, count in label_counts.items() if count == max_count]
+        # Break ties randomly
+        return random.choice(tied_labels)
+    def _random_selection(self, annotations: List[Dict]) -> str:
+        """Resolve annotations by random selection."""
+        return random.choice(annotations)["label"]
+    def _consensus_resolution(self, annotations: List[Dict]) -> Optional[str]:
+        """Resolve annotations by consensus (all must agree)."""
+        labels = [ann["label"] for ann in annotations]
+        if len(set(labels)) == 1:
+            return labels[0]
+        return None
+    def _train_classifier(self, training_data: Dict, schema_name: str) -> Tuple[Optional[Pipeline], TrainingMetrics]:
+        """Train a classifier for a specific schema."""
+        start_time = time.time()
+        if len(training_data["texts"]) < self.config.min_instances_for_training:
+            error_msg = f"Insufficient training data for schema {schema_name}: {len(training_data['texts'])} < {self.config.min_instances_for_training}"
+            self.logger.warning(error_msg)
+            return None, TrainingMetrics(
+                schema_name=schema_name,
+                training_time=time.time() - start_time,
+                accuracy=0.0,
+                instance_count=len(training_data["texts"]),
+                timestamp=datetime.now(),
+                error_message=error_msg
+            )
+        # Check for sufficient label diversity
+        unique_labels = set(training_data["labels"])
+        if len(unique_labels) < 2:
+            error_msg = f"Insufficient label diversity for schema {schema_name}: {len(unique_labels)} unique labels"
+            self.logger.warning(error_msg)
+            return None, TrainingMetrics(
+                schema_name=schema_name,
+                training_time=time.time() - start_time,
+                accuracy=0.0,
+                instance_count=len(training_data["texts"]),
+                timestamp=datetime.now(),
+                error_message=error_msg
+            )
+        try:
+            # Create and train classifier
+            classifier = self._create_classifier()
+            vectorizer = self._create_vectorizer()
+            pipeline = Pipeline([
+                ("vectorizer", vectorizer),
+                ("classifier", classifier)
+            ])
+            pipeline.fit(training_data["texts"], training_data["labels"])
+            # Apply probability calibration if enabled
+            if self.config.calibrate_probabilities and hasattr(classifier, 'predict_proba'):
+                num_samples = len(training_data["texts"])
+                if num_samples >= 5:
+                    try:
+                        from sklearn.calibration import CalibratedClassifierCV
+                        cv_folds = min(3, num_samples // 2)
+                        if cv_folds >= 2:
+                            calibrated = CalibratedClassifierCV(
+                                pipeline, cv=cv_folds, method='isotonic'
+                            )
+                            calibrated.fit(training_data["texts"], training_data["labels"])
+                            pipeline = calibrated
+                            self.logger.debug(f"Applied probability calibration with {cv_folds}-fold CV")
+                    except Exception as e:
+                        self.logger.warning(f"Calibration failed, using uncalibrated model: {e}")
+            # Store vectorizer separately for strategy use
+            self._vectorizers[schema_name] = pipeline.named_steps.get("vectorizer", vectorizer) if hasattr(pipeline, 'named_steps') else vectorizer
+            # Train BALD ensemble if needed
+            if self.config.query_strategy == "bald":
+                self._train_bald_ensemble(training_data, schema_name)
+            # Calculate accuracy
+            predictions = pipeline.predict(training_data["texts"])
+            accuracy = accuracy_score(training_data["labels"], predictions)
+            # Calculate confidence distribution
+            confidence_distribution = self._calculate_confidence_distribution(pipeline, training_data["texts"])
+            training_time = time.time() - start_time
+            metrics = TrainingMetrics(
+                schema_name=schema_name,
+                training_time=training_time,
+                accuracy=accuracy,
+                instance_count=len(training_data["texts"]),
+                timestamp=datetime.now(),
+                confidence_distribution=confidence_distribution
+            )
+            self.logger.info(f"Trained classifier for schema {schema_name} with {len(training_data['texts'])} instances, "
+                           f"accuracy: {accuracy:.3f}, time: {training_time:.2f}s")
+            return pipeline, metrics
+        except Exception as e:
+            error_msg = f"Error training classifier for schema {schema_name}: {e}"
+            self.logger.error(error_msg)
+            return None, TrainingMetrics(
+                schema_name=schema_name,
+                training_time=time.time() - start_time,
+                accuracy=0.0,
+                instance_count=len(training_data["texts"]),
+                timestamp=datetime.now(),
+                error_message=error_msg
+            )
+    def _train_bald_ensemble(self, training_data: Dict, schema_name: str):
+        """Train an ensemble of classifiers for BALD strategy."""
+        params = self.config.bald_params
+        n_estimators = params.get("n_estimators", 5)
+        bootstrap_fraction = params.get("bootstrap_fraction", 0.8)
+        texts = training_data["texts"]
+        labels = training_data["labels"]
+        n_samples = len(texts)
+        bootstrap_size = max(2, int(n_samples * bootstrap_fraction))
+        ensemble = []
+        for i in range(n_estimators):
+            indices = np.random.choice(n_samples, size=bootstrap_size, replace=True)
+            boot_texts = [texts[j] for j in indices]
+            boot_labels = [labels[j] for j in indices]
+            # Need at least 2 classes
+            if len(set(boot_labels)) < 2:
+                continue
+            clf = self._create_classifier()
+            vec = self._create_vectorizer()
+            pipe = Pipeline([("vectorizer", vec), ("classifier", clf)])
+            pipe.fit(boot_texts, boot_labels)
+            ensemble.append(pipe)
+        if ensemble:
+            self._bald_ensembles[schema_name] = ensemble
+            self.logger.info(f"Trained BALD ensemble with {len(ensemble)} models for {schema_name}")
+    def _calculate_confidence_distribution(self, pipeline, texts: List[str]) -> Dict[str, float]:
+        """Calculate confidence score distribution."""
+        try:
+            probas = pipeline.predict_proba(texts)
+            max_confidences = np.max(probas, axis=1)
+            # Create histogram bins
+            bins = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
+            hist, _ = np.histogram(max_confidences, bins=bins)
+            # Convert to percentages
+            total = len(max_confidences)
+            distribution = {}
+            for i, count in enumerate(hist):
+                bin_label = f"{bins[i]:.1f}-{bins[i+1]:.1f}"
+                distribution[bin_label] = (count / total) * 100 if total > 0 else 0
+            return distribution
+        except Exception as e:
+            self.logger.warning(f"Failed to calculate confidence distribution: {e}")
+            return {}
+    def _create_classifier(self):
+        """Create classifier instance based on configuration."""
+        kwargs = dict(self.config.classifier_kwargs)
+        if self.config.classifier_name == "sklearn.linear_model.LogisticRegression":
+            return LogisticRegression(**kwargs)
+        elif self.config.classifier_name == "sklearn.ensemble.RandomForestClassifier":
+            return RandomForestClassifier(**kwargs)
+        elif self.config.classifier_name == "sklearn.svm.SVC":
+            kwargs.setdefault("probability", True)
+            return SVC(**kwargs)
+        else:
+            # Try to import dynamically
+            try:
+                module_name, class_name = self.config.classifier_name.rsplit('.', 1)
+                module = __import__(module_name, fromlist=[class_name])
+                classifier_class = getattr(module, class_name)
+                return classifier_class(**kwargs)
+            except Exception as e:
+                self.logger.error(f"Failed to create classifier {self.config.classifier_name}: {e}")
+                return LogisticRegression()  # Fallback
+    def _create_vectorizer(self):
+        """Create vectorizer instance based on configuration."""
+        kwargs = dict(self.config.vectorizer_kwargs)
+        if self.config.vectorizer_name == "sklearn.feature_extraction.text.CountVectorizer":
+            return CountVectorizer(**kwargs)
+        elif self.config.vectorizer_name == "sklearn.feature_extraction.text.TfidfVectorizer":
+            return TfidfVectorizer(**kwargs)
+        elif self.config.vectorizer_name == "sentence-transformers":
+            model_name = kwargs.pop("model_name", "all-MiniLM-L6-v2")
+            return SentenceTransformerVectorizer(model_name=model_name)
+        else:
+            # Try to import dynamically
+            try:
+                module_name, class_name = self.config.vectorizer_name.rsplit('.', 1)
+                module = __import__(module_name, fromlist=[class_name])
+                vectorizer_class = getattr(module, class_name)
+                return vectorizer_class(**kwargs)
+            except Exception as e:
+                self.logger.error(f"Failed to create vectorizer {self.config.vectorizer_name}: {e}")
+                return TfidfVectorizer()  # Fallback
+    def _reorder_instances(self, item_manager: ItemStateManager, schema_name: str):
+        """Reorder instances based on the configured query strategy."""
+        if schema_name not in self._models:
+            self.logger.warning(f"No trained model available for schema {schema_name}")
+            return
+        # Get unlabeled instances
+        unlabeled_instances = []
+        unlabeled_texts = []
+        for instance_id in item_manager.get_instance_ids():
+            if not item_manager.get_annotators_for_item(instance_id):
+                item = item_manager.get_item(instance_id)
+                if item:
+                    unlabeled_instances.append(instance_id)
+                    unlabeled_texts.append(item.get_text())
+        if not unlabeled_texts:
+            self.logger.info("No unlabeled instances to reorder")
+            return
+        # Limit number of instances to process
+        if self.config.max_instances_to_reorder:
+            limit = self.config.max_instances_to_reorder
+            unlabeled_instances = unlabeled_instances[:limit]
+            unlabeled_texts = unlabeled_texts[:limit]
+        model = self._models[schema_name]
+        annotated = self._annotated_texts.get(schema_name, [])
+        # Get rankings from strategy
+        if (self.config.query_strategy == "bald"
+                and schema_name in self._bald_ensembles
+                and isinstance(self._query_strategy, BaldStrategy)):
+            vectorizer = self._vectorizers.get(schema_name)
+            if vectorizer:
+                rankings = self._query_strategy.rank_with_ensemble(
+                    unlabeled_texts, self._bald_ensembles[schema_name], vectorizer
+                )
+            else:
+                rankings = self._query_strategy.rank(unlabeled_texts, model, model, annotated)
+        else:
+            # Extract vectorizer and classifier from pipeline for strategy use
+            vectorizer = self._vectorizers.get(schema_name)
+            classifier = model
+            if vectorizer:
+                rankings = self._query_strategy.rank(
+                    unlabeled_texts, classifier, vectorizer, annotated
+                )
+            else:
+                # Fallback: use confidence scores directly
+                instance_scores = self._calculate_confidence_scores(
+                    unlabeled_instances, item_manager, schema_name
+                )
+                sorted_instances = sorted(instance_scores, key=lambda x: x[1])
+                self._apply_reordering(sorted_instances, item_manager)
+                return
+        # ICL ensemble blending (Phase 5B)
+        if self.config.use_icl_ensemble:
+            rankings = self._blend_icl_scores(
+                rankings, unlabeled_texts, schema_name
+            )
+        # Map rankings back to instance IDs
+        sorted_instances = [
+            (unlabeled_instances[idx], score) for idx, score in rankings
+            if idx < len(unlabeled_instances)
+        ]
+        # Apply reordering with random sampling
+        self._apply_reordering(sorted_instances, item_manager)
+    def _blend_icl_scores(self, rankings: List[Tuple[int, float]],
+                          texts: List[str], schema_name: str) -> List[Tuple[int, float]]:
+        """Blend query strategy scores with ICL predictions."""
+        try:
+            from potato.ai.icl_labeler import get_icl_labeler
+            icl_labeler = get_icl_labeler()
+            if icl_labeler is None or not icl_labeler.has_enough_examples(schema_name):
+                return rankings
+            # Determine interpolation weight based on annotation count
+            params = self.config.icl_ensemble_params
+            initial_w = params.get("initial_icl_weight", 0.7)
+            final_w = params.get("final_icl_weight", 0.2)
+            transition = params.get("transition_instances", 100)
+            annotated_count = len(self._annotated_texts.get(schema_name, []))
+            progress = min(1.0, annotated_count / max(1, transition))
+            icl_weight = initial_w + (final_w - initial_w) * progress
+            strategy_weight = 1.0 - icl_weight
+            # Get ICL confidence for each text
+            icl_scores = {}
+            for idx, text in enumerate(texts):
+                try:
+                    pred = icl_labeler.label_instance(
+                        instance_id=f"_al_blend_{idx}",
+                        schema_name=schema_name,
+                        instance_text=text,
+                    )
+                    if pred:
+                        # Lower confidence = higher priority (more uncertain)
+                        icl_scores[idx] = 1.0 - pred.confidence_score
+                    else:
+                        icl_scores[idx] = 0.5
+                except Exception:
+                    icl_scores[idx] = 0.5
+            # Normalize strategy scores
+            strategy_map = {idx: score for idx, score in rankings}
+            s_vals = list(strategy_map.values())
+            s_min, s_max = min(s_vals), max(s_vals)
+            s_rng = s_max - s_min if s_max > s_min else 1.0
+            # Normalize ICL scores
+            i_vals = list(icl_scores.values())
+            i_min, i_max = min(i_vals), max(i_vals)
+            i_rng = i_max - i_min if i_max > i_min else 1.0
+            blended = []
+            for idx, score in rankings:
+                norm_s = (score - s_min) / s_rng
+                norm_i = (icl_scores.get(idx, 0.5) - i_min) / i_rng
+                combined = strategy_weight * norm_s + icl_weight * norm_i
+                blended.append((idx, combined))
+            blended.sort(key=lambda x: x[1], reverse=True)
+            return blended
+        except ImportError:
+            return rankings
+        except Exception as e:
+            self.logger.warning(f"ICL blending failed: {e}")
+            return rankings
+    def _cold_start_reorder(self, item_manager: ItemStateManager):
+        """LLM-based cold-start instance selection (Phase 3A).
+        Based on Bayer et al. (2024) ActiveLLM approach. Before enough
+        annotations exist for classifier training, use LLM to estimate
+        which instances are most informative by finding those where LLM
+        confidence is moderate (on the decision boundary).
+        """
+        try:
+            from potato.ai.llm_active_learning import create_llm_active_learning
+            llm = create_llm_active_learning(self.config.llm_config)
+            # Sample candidate instances
+            all_ids = list(item_manager.get_instance_ids())
+            unannotated = [
+                iid for iid in all_ids
+                if not item_manager.get_annotators_for_item(iid)
+            ]
+            if not unannotated:
+                return
+            batch_size = min(self.config.cold_start_batch_size, len(unannotated))
+            candidates = random.sample(unannotated, batch_size)
+            instances = []
+            for iid in candidates:
+                item = item_manager.get_item(iid)
+                if item:
+                    instances.append({"id": iid, "text": item.get_text()})
+            if not instances:
+                return
+            # Get LLM predictions
+            schema_name = self.schema_cycler.get_current_schema() if self.schema_cycler else None
+            predictions = llm.predict_instances(
+                instances=instances,
+                annotation_instructions="Rate your confidence in labeling this text.",
+                schema_name=schema_name or "default",
+                label_options=["positive", "negative", "neutral"],
+            )
+            # Select instances with moderate confidence (decision boundary)
+            moderate = []
+            other = []
+            for pred in predictions:
+                if 0.4 <= pred.confidence_score <= 0.7:
+                    moderate.append((pred.instance_id, pred.confidence_score))
+                else:
+                    other.append((pred.instance_id, pred.confidence_score))
+            # Moderate-confidence first, then others, interleaved with random
+            reordered = [iid for iid, _ in moderate] + [iid for iid, _ in other]
+            # Add remaining unannotated instances not in the sample
+            sampled_set = set(candidates)
+            remaining = [iid for iid in unannotated if iid not in sampled_set]
+            random.shuffle(remaining)
+            reordered.extend(remaining)
+            item_manager.reorder_instances(reordered)
+            self.logger.info(f"Cold-start LLM reordering: {len(moderate)} moderate-confidence, "
+                           f"{len(other)} other, {len(remaining)} remaining")
+        except Exception as e:
+            self.logger.warning(f"Cold-start LLM reordering failed: {e}")
+    def _route_annotation(self, instance_id: str, instance_text: str,
+                          schema_name: str) -> Dict[str, Any]:
+        """Noise-aware annotation routing (Phase 5D).
+        Based on Yuan et al. (2024) NoiseAL approach. Routes instances
+        between LLM auto-labeling and human annotation based on LLM
+        confidence levels.
+        Returns:
+            Dict with 'route' ('human'|'auto'), optional 'suggestion',
+            and optional 'auto_label'.
+        """
+        if not self.config.annotation_routing:
+            return {"route": "human"}
+        thresholds = self.config.routing_thresholds
+        auto_min = thresholds.get("auto_label_min_confidence", 0.9)
+        suggest_below = thresholds.get("show_suggestion_below", 0.5)
+        try:
+            from potato.ai.icl_labeler import get_icl_labeler
+            icl_labeler = get_icl_labeler()
+            if icl_labeler is None or not icl_labeler.has_enough_examples(schema_name):
+                return {"route": "human"}
+            prediction = icl_labeler.label_instance(
+                instance_id=instance_id,
+                schema_name=schema_name,
+                instance_text=instance_text,
+            )
+            if prediction is None:
+                return {"route": "human"}
+            confidence = prediction.confidence_score
+            if confidence >= auto_min:
+                # High confidence: auto-label with periodic verification
+                should_verify = random.random() < self.config.verification_sample_rate
+                return {
+                    "route": "auto",
+                    "auto_label": prediction.predicted_label,
+                    "confidence": confidence,
+                    "needs_verification": should_verify,
+                }
+            elif confidence < suggest_below:
+                # Low confidence: route to human with LLM suggestion
+                return {
+                    "route": "human",
+                    "suggestion": prediction.predicted_label,
+                    "confidence": confidence,
+                }
+            else:
+                # Medium confidence: route to human (most informative)
+                return {"route": "human"}
+        except ImportError:
+            return {"route": "human"}
+        except Exception as e:
+            self.logger.warning(f"Annotation routing failed for {instance_id}: {e}")
+            return {"route": "human"}
+    def _calculate_confidence_scores(self, instance_ids: List[str], item_manager: ItemStateManager, schema_name: str) -> List[Tuple[str, float]]:
+        """Calculate confidence scores for instances."""
+        instance_scores = []
+        model = self._models[schema_name]
+        for instance_id in instance_ids:
+            item = item_manager.get_item(instance_id)
+            if not item:
+                continue
+            text = item.get_text()
+            try:
+                # Get prediction probabilities
+                probas = model.predict_proba([text])[0]
+                confidence = np.max(probas)
+                instance_scores.append((instance_id, confidence))
+            except Exception as e:
+                self.logger.warning(f"Error predicting for instance {instance_id}: {e}")
+                # Default to low confidence for failed predictions
+                instance_scores.append((instance_id, 0.1))
+        return instance_scores
+    def _apply_reordering(self, sorted_instances: List[Tuple[str, float]], item_manager: ItemStateManager):
+        """Apply the new ordering to the item manager."""
+        # Extract instance IDs in new order
+        new_order = [instance_id for instance_id, _ in sorted_instances]
+        if not new_order:
+            return
+        # Apply random sampling
+        random_count = int(len(new_order) * self.config.random_sample_percent)
+        if random_count > 0 and random_count <= len(new_order):
+            random_instances = random.sample(new_order, random_count)
+        else:
+            random_instances = []
+        # Interleave active learning and random instances
+        final_order = []
+        al_idx = 0
+        rand_idx = 0
+        while al_idx < len(new_order) or rand_idx < len(random_instances):
+            if al_idx < len(new_order):
+                final_order.append(new_order[al_idx])
+                al_idx += 1
+            if rand_idx < len(random_instances):
+                final_order.append(random_instances[rand_idx])
+                rand_idx += 1
+        # Update item manager ordering
+        item_manager.reorder_instances(final_order)
+        self.logger.info(f"Reordered {len(final_order)} instances")
+    def check_and_trigger_training(self):
+        """Check if training should be triggered and queue it if needed."""
+        if not self.config.enabled:
+            self.logger.debug("Active learning is disabled")
+            return
+        with self._lock:
+            # Count current annotations
+            user_manager = get_user_state_manager()
+            current_annotation_count = sum(
+                len(user_state.get_all_annotations())
+                for user_state in user_manager.get_all_users()
+            )
+            self.logger.debug(f"Current annotation count: {current_annotation_count}, last count: {self._last_annotation_count}, update_frequency: {self.config.update_frequency}")
+            # Check if we should trigger training
+            if (current_annotation_count - self._last_annotation_count) >= self.config.update_frequency:
+                self._training_queue.put("train")
+                self._last_annotation_count = current_annotation_count
+                self.logger.info(f"Queued active learning training (annotations: {current_annotation_count})")
+            else:
+                self.logger.debug("Not enough new annotations to trigger training")
+    def force_training(self):
+        """Force immediate training (for testing purposes)."""
+        if not self.config.enabled:
+            self.logger.debug("Active learning is disabled")
+            return
+        self.logger.info("Forcing immediate active learning training")
+        self._training_queue.put("train")
+    def get_stats(self) -> Dict[str, Any]:
+        """Get active learning statistics."""
+        with self._lock:
+            stats = {
+                "enabled": self.config.enabled,
+                "training_count": self._training_count,
+                "last_training_time": self._last_training_time,
+                "models_trained": list(self._models.keys()),
+                "current_schema": self.schema_cycler.get_current_schema() if self.schema_cycler else None,
+                "schema_order": self.schema_cycler.get_schema_order() if self.schema_cycler else [],
+                "database_enabled": self.config.database_enabled,
+                "model_persistence_enabled": self.config.model_persistence_enabled,
+                "llm_enabled": self.config.llm_enabled,
+                "query_strategy": self.config.query_strategy,
+                "calibrate_probabilities": self.config.calibrate_probabilities,
+                "cold_start_strategy": self.config.cold_start_strategy,
+                "use_icl_ensemble": self.config.use_icl_ensemble,
+                "annotation_routing": self.config.annotation_routing,
+            }
+            # Add training metrics if available
+            if self.database_manager:
+                try:
+                    stats["training_history"] = [
+                        asdict(metrics) for metrics in self.database_manager.get_training_history()
+                    ]
+                except Exception as e:
+                    self.logger.warning(f"Failed to get training history: {e}")
+                    stats["training_history"] = []
+            return stats
+    def shutdown(self):
+        """Shutdown the active learning manager."""
+        self._stop_training.set()
+        if self._training_thread and self._training_thread.is_alive():
+            self._training_queue.put(None)  # Shutdown signal
+            self._training_thread.join(timeout=5.0)
+        self.logger.info("Active learning manager shutdown complete")
+# Global singleton instance
+ACTIVE_LEARNING_MANAGER = None
+def parse_active_learning_config(config_data: Dict[str, Any]) -> Optional[ActiveLearningConfig]:
+    """Build an ``ActiveLearningConfig`` from a Potato project config dict.
+    Returns None when active learning is not enabled. Maps the keys under the
+    ``active_learning:`` section onto the dataclass fields (unknown keys are
+    ignored), and defaults ``schema_names`` to the project's labelable
+    annotation schemes when not given.
+    """
+    al_dict = (config_data or {}).get("active_learning", {}) or {}
+    if not al_dict.get("enabled"):
+        return None
+    valid_fields = {f.name for f in dataclasses.fields(ActiveLearningConfig)}
+    kwargs = {k: v for k, v in al_dict.items() if k in valid_fields}
+    # Honor the nested `active_learning.llm:` block (LLM cold-start / ICL).
+    # The dataclass uses flat fields (llm_enabled / llm_config), so translate.
+    llm_block = al_dict.get("llm")
+    if isinstance(llm_block, dict):
+        kwargs.setdefault("llm_enabled", bool(llm_block.get("enabled", False)))
+        kwargs.setdefault("llm_config", llm_block)
+    # YAML parses sequences as lists, but sklearn's vectorizers require a tuple
+    # for ngram_range (e.g. (1, 2)). Coerce it so training doesn't fail.
+    vec_params = kwargs.get("vectorizer_params")
+    if isinstance(vec_params, dict) and isinstance(vec_params.get("ngram_range"), list):
+        vec_params = dict(vec_params)
+        vec_params["ngram_range"] = tuple(vec_params["ngram_range"])
+        kwargs["vectorizer_params"] = vec_params
+    # resolution_strategy may arrive as a string; coerce to the enum.
+    rs = kwargs.get("resolution_strategy")
+    if isinstance(rs, str):
+        try:
+            kwargs["resolution_strategy"] = ResolutionStrategy(rs)
+        except ValueError:
+            kwargs.pop("resolution_strategy", None)
+    # Default schema_names to the labelable schemes in the project.
+    if not kwargs.get("schema_names"):
+        schemes = config_data.get("annotation_schemes", []) or []
+        kwargs["schema_names"] = [
+            s.get("name") for s in schemes
+            if s.get("name") and s.get("annotation_type") in (
+                "radio", "multiselect", "likert", "select"
+            )
+        ]
+    return ActiveLearningConfig(**kwargs)
+def init_active_learning_manager(config: ActiveLearningConfig) -> ActiveLearningManager:
+    """Initialize the global active learning manager."""
+    global ACTIVE_LEARNING_MANAGER
+    if ACTIVE_LEARNING_MANAGER is None:
+        ACTIVE_LEARNING_MANAGER = ActiveLearningManager(config)
+    return ACTIVE_LEARNING_MANAGER
+def get_active_learning_manager() -> Optional[ActiveLearningManager]:
+    """Get the global active learning manager."""
+    return ACTIVE_LEARNING_MANAGER
+def clear_active_learning_manager():
+    """Clear the global active learning manager (for testing)."""
+    global ACTIVE_LEARNING_MANAGER
+    if ACTIVE_LEARNING_MANAGER:
+        ACTIVE_LEARNING_MANAGER.shutdown()
+    ACTIVE_LEARNING_MANAGER = None

potato/adjudication.py ADDED Viewed

	@@ -0,0 +1,1224 @@

+"""
+Adjudication Module
+This module provides a comprehensive adjudication system where designated users
+review items with multiple annotations, resolve disagreements, and produce
+gold-standard final decisions.
+Adjudication is NOT a phase — it's a parallel workflow accessible via a dedicated
+/adjudicate route, available to users with adjudicator privileges. This avoids
+disrupting the existing phase progression system.
+Key Components:
+- AdjudicationConfig: Configuration dataclass for adjudication settings
+- AdjudicationItem: Represents an item eligible for adjudication with all annotations
+- AdjudicationDecision: Represents an adjudicator's final decision on an item
+- AdjudicationManager: Singleton manager for the adjudication workflow
+The workflow:
+1. Annotators complete annotations via /annotate (existing workflow)
+2. AdjudicationManager monitors annotation counts and agreement
+3. Items are flagged when criteria are met (min annotations, low agreement)
+4. Adjudicators review items via /adjudicate and submit decisions
+5. Final dataset CLI merges unanimous + adjudicated decisions
+"""
+import json
+import logging
+import math
+import os
+import threading
+from collections import Counter, defaultdict
+from dataclasses import dataclass, field
+from datetime import datetime
+from typing import Dict, List, Optional, Any, Set
+logger = logging.getLogger(__name__)
+# Singleton instance
+_ADJUDICATION_MANAGER = None
+_ADJUDICATION_LOCK = threading.Lock()
+@dataclass
+class AdjudicationConfig:
+    """Configuration for adjudication features."""
+    enabled: bool = False
+    adjudicator_users: List[str] = field(default_factory=list)
+    # Trigger criteria
+    min_annotations: int = 2
+    require_fully_annotated: bool = False
+    agreement_threshold: float = 0.75
+    show_all_items: bool = False
+    # Display options
+    show_annotator_names: bool = True
+    show_timing_data: bool = True
+    show_agreement_scores: bool = True
+    fast_decision_warning_ms: int = 2000
+    # Adjudicator metadata fields
+    require_confidence: bool = True
+    require_notes_on_override: bool = False
+    error_taxonomy: List[str] = field(default_factory=lambda: [
+        "ambiguous_text", "guideline_gap", "annotator_error",
+        "edge_case", "subjective_disagreement", "other"
+    ])
+    # Similarity (Phase 3, optional)
+    similarity_enabled: bool = False
+    similarity_model: str = "all-MiniLM-L6-v2"
+    similarity_top_k: int = 5
+    similarity_precompute: bool = True
+    # Output
+    output_subdir: str = "adjudication"
+@dataclass
+class AdjudicationItem:
+    """Represents an item eligible for adjudication with all annotator data."""
+    instance_id: str
+    annotations: Dict[str, Dict[str, Any]]  # user_id -> {schema: {label: value}}
+    span_annotations: Dict[str, List[Dict]]  # user_id -> [span_dict, ...]
+    behavioral_data: Dict[str, Dict]  # user_id -> {total_time_ms, ...}
+    agreement_scores: Dict[str, float]  # schema_name -> agreement score
+    overall_agreement: float
+    num_annotators: int
+    status: str = "pending"  # pending, in_progress, completed, skipped
+    assigned_adjudicator: Optional[str] = None
+    mace_predictions: Dict[str, Any] = field(default_factory=dict)  # schema -> predicted label
+    def to_dict(self) -> Dict[str, Any]:
+        """Serialize to dictionary for JSON output."""
+        result = {
+            "instance_id": self.instance_id,
+            "annotations": self.annotations,
+            "span_annotations": self.span_annotations,
+            "behavioral_data": self.behavioral_data,
+            "agreement_scores": self.agreement_scores,
+            "overall_agreement": self.overall_agreement,
+            "num_annotators": self.num_annotators,
+            "status": self.status,
+            "assigned_adjudicator": self.assigned_adjudicator,
+        }
+        if self.mace_predictions:
+            result["mace_predictions"] = self.mace_predictions
+        return result
+@dataclass
+class AdjudicationDecision:
+    """Represents an adjudicator's final decision on an item."""
+    instance_id: str
+    adjudicator_id: str
+    timestamp: str  # ISO format string
+    label_decisions: Dict[str, Any]  # schema -> value
+    span_decisions: List[Dict]  # list of span dicts
+    source: Dict[str, str]  # schema -> "annotator_X" | "adjudicator" | "merged"
+    confidence: str  # "high", "medium", "low"
+    notes: str
+    error_taxonomy: List[str]
+    guideline_update_flag: bool = False
+    guideline_update_notes: str = ""
+    time_spent_ms: int = 0
+    def to_dict(self) -> Dict[str, Any]:
+        """Serialize to dictionary for JSON output."""
+        return {
+            "instance_id": self.instance_id,
+            "adjudicator_id": self.adjudicator_id,
+            "timestamp": self.timestamp,
+            "label_decisions": self.label_decisions,
+            "span_decisions": self.span_decisions,
+            "source": self.source,
+            "confidence": self.confidence,
+            "notes": self.notes,
+            "error_taxonomy": self.error_taxonomy,
+            "guideline_update_flag": self.guideline_update_flag,
+            "guideline_update_notes": self.guideline_update_notes,
+            "time_spent_ms": self.time_spent_ms,
+        }
+    @classmethod
+    def from_dict(cls, d: Dict[str, Any]) -> "AdjudicationDecision":
+        """Deserialize from dictionary."""
+        return cls(
+            instance_id=d["instance_id"],
+            adjudicator_id=d["adjudicator_id"],
+            timestamp=d["timestamp"],
+            label_decisions=d.get("label_decisions", {}),
+            span_decisions=d.get("span_decisions", []),
+            source=d.get("source", {}),
+            confidence=d.get("confidence", "medium"),
+            notes=d.get("notes", ""),
+            error_taxonomy=d.get("error_taxonomy", []),
+            guideline_update_flag=d.get("guideline_update_flag", False),
+            guideline_update_notes=d.get("guideline_update_notes", ""),
+            time_spent_ms=d.get("time_spent_ms", 0),
+        )
+class AdjudicationManager:
+    """
+    Manages the adjudication workflow including queue building, agreement
+    computation, decision storage, and final dataset generation.
+    Follows the singleton pattern used by QualityControlManager.
+    """
+    def __init__(self, config: Dict[str, Any]):
+        """
+        Initialize the adjudication manager.
+        Args:
+            config: The full application configuration dictionary
+        """
+        self.config = config
+        self.logger = logging.getLogger(__name__)
+        self._lock = threading.RLock()
+        # Parse configuration
+        self.adj_config = self._parse_config(config)
+        # Queue and decisions
+        self.queue: Dict[str, AdjudicationItem] = {}  # instance_id -> AdjudicationItem
+        self.decisions: Dict[str, AdjudicationDecision] = {}  # instance_id -> decision
+        self._queue_built = False
+        # Load any previously saved decisions
+        self._load_decisions()
+        # Initialize similarity engine (Phase 3)
+        self.similarity_engine = None
+        if self.adj_config.similarity_enabled:
+            from potato.similarity import init_similarity_engine
+            self.similarity_engine = init_similarity_engine(config, self.adj_config)
+            if (self.similarity_engine and self.similarity_engine.enabled
+                    and self.adj_config.similarity_precompute):
+                self._precompute_similarities()
+        self.logger.info(
+            f"AdjudicationManager initialized: enabled={self.adj_config.enabled}, "
+            f"adjudicators={self.adj_config.adjudicator_users}"
+        )
+    def _parse_config(self, config: Dict[str, Any]) -> AdjudicationConfig:
+        """Parse adjudication configuration from the main config."""
+        adj = AdjudicationConfig()
+        adj_config = config.get("adjudication", {})
+        if not adj_config or not adj_config.get("enabled", False):
+            return adj
+        adj.enabled = True
+        adj.adjudicator_users = adj_config.get("adjudicator_users", [])
+        adj.min_annotations = adj_config.get("min_annotations", 2)
+        adj.require_fully_annotated = adj_config.get("require_fully_annotated", False)
+        adj.agreement_threshold = adj_config.get("agreement_threshold", 0.75)
+        adj.show_all_items = adj_config.get("show_all_items", False)
+        adj.show_annotator_names = adj_config.get("show_annotator_names", True)
+        adj.show_timing_data = adj_config.get("show_timing_data", True)
+        adj.show_agreement_scores = adj_config.get("show_agreement_scores", True)
+        adj.fast_decision_warning_ms = adj_config.get("fast_decision_warning_ms", 2000)
+        adj.require_confidence = adj_config.get("require_confidence", True)
+        adj.require_notes_on_override = adj_config.get("require_notes_on_override", False)
+        if "error_taxonomy" in adj_config:
+            adj.error_taxonomy = adj_config["error_taxonomy"]
+        # Similarity settings
+        sim_config = adj_config.get("similarity", {})
+        if sim_config.get("enabled", False):
+            adj.similarity_enabled = True
+            adj.similarity_model = sim_config.get("model", "all-MiniLM-L6-v2")
+            adj.similarity_top_k = sim_config.get("top_k", 5)
+            adj.similarity_precompute = sim_config.get("precompute_on_start", True)
+        adj.output_subdir = adj_config.get("output_subdir", "adjudication")
+        return adj
+    def is_adjudicator(self, username: str) -> bool:
+        """Check if a user is an authorized adjudicator."""
+        if not self.adj_config.enabled:
+            return False
+        return username in self.adj_config.adjudicator_users
+    def build_queue(self) -> List[AdjudicationItem]:
+        """
+        Scan all user annotations and build the adjudication queue.
+        Items become eligible when they have enough annotations and
+        agreement is below the threshold.
+        Returns:
+            List of AdjudicationItem objects
+        """
+        from potato.user_state_management import get_user_state_manager
+        from potato.item_state_management import get_item_state_manager
+        with self._lock:
+            usm = get_user_state_manager()
+            ism = get_item_state_manager()
+            # Get all annotation schemes from config
+            annotation_schemes = self.config.get("annotation_schemes", [])
+            scheme_names = [s.get("name", "") for s in annotation_schemes]
+            # Iterate over all items
+            for instance_id, item in ism.instance_id_to_instance.items():
+                instance_id_str = str(instance_id)
+                # Skip if already decided
+                if instance_id_str in self.decisions:
+                    if instance_id_str not in self.queue:
+                        continue
+                    # Mark as completed if decision exists
+                    self.queue[instance_id_str].status = "completed"
+                    continue
+                # Get all annotators for this item
+                annotators = ism.instance_annotators.get(instance_id, set())
+                # Filter out adjudicators from annotator list
+                annotators = {
+                    u for u in annotators
+                    if u not in self.adj_config.adjudicator_users
+                }
+                if len(annotators) < self.adj_config.min_annotations:
+                    continue
+                # Check if we require fully annotated items
+                if self.adj_config.require_fully_annotated:
+                    max_per_item = ism.max_annotations_per_item
+                    if max_per_item > 0 and len(annotators) < max_per_item:
+                        continue
+                # Collect annotations from all annotators
+                item_annotations = {}
+                item_spans = {}
+                item_behavioral = {}
+                for user_id in annotators:
+                    user_state = usm.get_user_state(user_id)
+                    if not user_state:
+                        continue
+                    # Get label annotations
+                    label_annots = user_state.instance_id_to_label_to_value.get(
+                        instance_id_str, {}
+                    )
+                    if label_annots:
+                        item_annotations[user_id] = self._serialize_labels(label_annots)
+                    # Get span annotations
+                    span_annots = user_state.instance_id_to_span_to_value.get(
+                        instance_id_str, {}
+                    )
+                    if span_annots:
+                        item_spans[user_id] = self._serialize_spans(span_annots)
+                    # Get behavioral data
+                    bd = user_state.instance_id_to_behavioral_data.get(
+                        instance_id_str, {}
+                    )
+                    if bd:
+                        item_behavioral[user_id] = self._serialize_behavioral(bd)
+                if not item_annotations and not item_spans:
+                    continue
+                # Compute agreement scores
+                agreement_scores = self._compute_agreement(
+                    item_annotations, scheme_names
+                )
+                overall = self._compute_overall_agreement(agreement_scores)
+                # Filter by agreement threshold
+                if not self.adj_config.show_all_items:
+                    if overall >= self.adj_config.agreement_threshold:
+                        continue
+                # Preserve existing status if already in queue
+                existing = self.queue.get(instance_id_str)
+                status = existing.status if existing else "pending"
+                assigned = existing.assigned_adjudicator if existing else None
+                # Enrich with MACE predictions if available
+                mace_preds = {}
+                try:
+                    from potato.mace_manager import get_mace_manager
+                    mace_mgr = get_mace_manager()
+                    if mace_mgr and mace_mgr.results:
+                        for sname in scheme_names:
+                            pred = mace_mgr.get_prediction(instance_id_str, sname)
+                            if pred is not None:
+                                mace_preds[sname] = pred
+                except Exception:
+                    pass  # MACE is optional
+                self.queue[instance_id_str] = AdjudicationItem(
+                    instance_id=instance_id_str,
+                    annotations=item_annotations,
+                    span_annotations=item_spans,
+                    behavioral_data=item_behavioral,
+                    agreement_scores=agreement_scores,
+                    overall_agreement=overall,
+                    num_annotators=len(annotators),
+                    status=status,
+                    assigned_adjudicator=assigned,
+                    mace_predictions=mace_preds,
+                )
+            self._queue_built = True
+            return list(self.queue.values())
+    def try_enqueue_item(self, instance_id: str) -> bool:
+        """
+        Evaluate a single item and, if it qualifies, add it to the queue.
+        Called when an overlap-sample item saturates so that low-agreement
+        items show up in the adjudication queue without needing a full
+        ``build_queue()`` rescan. Returns True if the item ended up in the
+        queue, False otherwise.
+        """
+        if not self.adj_config.enabled:
+            return False
+        from potato.user_state_management import get_user_state_manager
+        from potato.item_state_management import get_item_state_manager
+        usm = get_user_state_manager()
+        ism = get_item_state_manager()
+        if usm is None or ism is None:
+            return False
+        with self._lock:
+            instance_id_str = str(instance_id)
+            if instance_id_str in self.decisions:
+                return False
+            item = ism.instance_id_to_instance.get(instance_id)
+            if item is None:
+                return False
+            annotators = {
+                u for u in ism.instance_annotators.get(instance_id, set())
+                if u not in self.adj_config.adjudicator_users
+            }
+            if len(annotators) < self.adj_config.min_annotations:
+                return False
+            scheme_names = [s.get("name", "") for s in self.config.get("annotation_schemes", [])]
+            item_annotations: Dict[str, Any] = {}
+            item_spans: Dict[str, Any] = {}
+            item_behavioral: Dict[str, Any] = {}
+            for user_id in annotators:
+                ustate = usm.get_user_state(user_id)
+                if not ustate:
+                    continue
+                la = ustate.instance_id_to_label_to_value.get(instance_id_str, {})
+                if la:
+                    item_annotations[user_id] = self._serialize_labels(la)
+                sa = ustate.instance_id_to_span_to_value.get(instance_id_str, {})
+                if sa:
+                    item_spans[user_id] = self._serialize_spans(sa)
+                bd = ustate.instance_id_to_behavioral_data.get(instance_id_str, {})
+                if bd:
+                    item_behavioral[user_id] = self._serialize_behavioral(bd)
+            if not item_annotations and not item_spans:
+                return False
+            agreement_scores = self._compute_agreement(item_annotations, scheme_names)
+            overall = self._compute_overall_agreement(agreement_scores)
+            if not self.adj_config.show_all_items:
+                if overall >= self.adj_config.agreement_threshold:
+                    return False
+            existing = self.queue.get(instance_id_str)
+            self.queue[instance_id_str] = AdjudicationItem(
+                instance_id=instance_id_str,
+                annotations=item_annotations,
+                span_annotations=item_spans,
+                behavioral_data=item_behavioral,
+                agreement_scores=agreement_scores,
+                overall_agreement=overall,
+                num_annotators=len(annotators),
+                status=existing.status if existing else "pending",
+                assigned_adjudicator=existing.assigned_adjudicator if existing else None,
+            )
+            self.logger.info(
+                "Auto-routed item %s into adjudication queue (overall agreement=%.3f, "
+                "threshold=%.3f, annotators=%d)",
+                instance_id_str, overall, self.adj_config.agreement_threshold, len(annotators),
+            )
+            return True
+    def _serialize_labels(self, label_data: Dict) -> Dict[str, Any]:
+        """Convert label annotation data to serializable dict."""
+        result = {}
+        for key, value in label_data.items():
+            # Key might be a Label object or a string
+            if hasattr(key, 'get_schema'):
+                schema = key.get_schema()
+                name = key.get_name()
+                if schema not in result:
+                    result[schema] = {}
+                result[schema][name] = value
+            elif isinstance(key, str):
+                result[key] = value
+            else:
+                result[str(key)] = value
+        return result
+    def _serialize_spans(self, span_data: Dict) -> List[Dict]:
+        """Convert span annotation data to serializable list."""
+        spans = []
+        for key, value in span_data.items():
+            if hasattr(key, 'get_schema'):
+                spans.append({
+                    "schema": key.get_schema(),
+                    "name": key.get_name(),
+                    "title": key.get_title() if hasattr(key, 'get_title') else "",
+                    "start": key.get_start(),
+                    "end": key.get_end(),
+                    "id": key.get_id(),
+                    "target_field": key.get_target_field() if hasattr(key, 'get_target_field') else None,
+                })
+            elif isinstance(value, dict):
+                spans.append(value)
+        return spans
+    def _serialize_behavioral(self, bd) -> Dict:
+        """Convert behavioral data to serializable dict."""
+        if hasattr(bd, 'to_dict'):
+            return bd.to_dict()
+        elif isinstance(bd, dict):
+            return bd
+        return {}
+    def _compute_agreement(
+        self, item_annotations: Dict[str, Dict], scheme_names: List[str]
+    ) -> Dict[str, float]:
+        """
+        Compute per-schema agreement for an item.
+        Uses simple percentage agreement (proportion of annotators who chose
+        the most common label). For more sophisticated metrics, simpledorff
+        can be used but requires multiple items.
+        Returns:
+            Dict mapping schema_name to agreement score (0.0 - 1.0)
+        """
+        agreement_scores = {}
+        for schema in scheme_names:
+            values = []
+            for user_id, user_annots in item_annotations.items():
+                if schema in user_annots:
+                    val = user_annots[schema]
+                    # Normalize to comparable form
+                    if isinstance(val, dict):
+                        # Radio stores {label: label} (value is the label string)
+                        # and multiselect stores {label: value/true}. A label is
+                        # "selected" when its value is present/truthy. The old
+                        # filter (v is True / == "true" / == 1) dropped radio's
+                        # string value, collapsing every annotator to an empty
+                        # frozenset -> a spurious 1.0 agreement even on total
+                        # disagreement.
+                        falsey = (False, None, "", "false", "False", 0, "0")
+                        selected = frozenset(
+                            k for k, v in val.items() if v not in falsey
+                        )
+                        values.append(selected)
+                    else:
+                        values.append(val)
+            if len(values) < 2:
+                continue
+            # Compute pairwise agreement (percentage)
+            agree_count = 0
+            total_pairs = 0
+            for i in range(len(values)):
+                for j in range(i + 1, len(values)):
+                    total_pairs += 1
+                    if values[i] == values[j]:
+                        agree_count += 1
+            agreement_scores[schema] = (
+                agree_count / total_pairs if total_pairs > 0 else 1.0
+            )
+        return agreement_scores
+    def _compute_overall_agreement(self, agreement_scores: Dict[str, float]) -> float:
+        """Compute overall agreement as the mean of per-schema scores."""
+        if not agreement_scores:
+            return 1.0
+        return sum(agreement_scores.values()) / len(agreement_scores)
+    def get_queue(
+        self,
+        adjudicator_id: Optional[str] = None,
+        filter_status: Optional[str] = None,
+    ) -> List[AdjudicationItem]:
+        """
+        Get the adjudication queue, optionally filtered by status.
+        Args:
+            adjudicator_id: Optional adjudicator to filter by assignment
+            filter_status: Optional status filter ("pending", "completed", etc.)
+        Returns:
+            List of AdjudicationItem objects
+        """
+        with self._lock:
+            if not self._queue_built:
+                self.build_queue()
+            items = list(self.queue.values())
+            if filter_status:
+                items = [i for i in items if i.status == filter_status]
+            # Sort: pending first, then by agreement (lowest first)
+            items.sort(key=lambda x: (
+                0 if x.status == "pending" else 1 if x.status == "in_progress" else 2,
+                x.overall_agreement,
+            ))
+            return items
+    def get_item(self, instance_id: str) -> Optional[AdjudicationItem]:
+        """
+        Get full item data for adjudication.
+        Args:
+            instance_id: The instance ID to retrieve
+        Returns:
+            AdjudicationItem or None if not in queue
+        """
+        with self._lock:
+            if not self._queue_built:
+                self.build_queue()
+            return self.queue.get(str(instance_id))
+    def get_item_text(self, instance_id: str) -> str:
+        """Get the text content for an item."""
+        from potato.item_state_management import get_item_state_manager
+        ism = get_item_state_manager()
+        item = ism.instance_id_to_instance.get(instance_id)
+        if item:
+            # Use text_key from config if available
+            text_key = self.config.get("item_properties", {}).get("text_key", "text")
+            data = item.get_data()
+            if isinstance(data, dict) and text_key in data:
+                return data[text_key]
+            return item.get_text()
+        return ""
+    def get_item_data(self, instance_id: str) -> Dict[str, Any]:
+        """Get the full raw data for an item."""
+        from potato.item_state_management import get_item_state_manager
+        ism = get_item_state_manager()
+        item = ism.instance_id_to_instance.get(instance_id)
+        if item:
+            data = item.get_data()
+            if isinstance(data, dict):
+                return data
+            return {"text": str(data)}
+        return {}
+    def get_next_item(self, adjudicator_id: str) -> Optional[AdjudicationItem]:
+        """Get the next pending item for an adjudicator."""
+        items = self.get_queue(filter_status="pending")
+        if items:
+            return items[0]
+        return None
+    def skip_item(self, instance_id: str, adjudicator_id: str) -> bool:
+        """Mark an item as skipped."""
+        with self._lock:
+            item = self.queue.get(str(instance_id))
+            if item:
+                item.status = "skipped"
+                return True
+            return False
+    def submit_decision(self, decision: AdjudicationDecision) -> bool:
+        """
+        Submit an adjudication decision.
+        Args:
+            decision: The AdjudicationDecision to save
+        Returns:
+            True if successful
+        """
+        with self._lock:
+            instance_id = str(decision.instance_id)
+            self.decisions[instance_id] = decision
+            # Update queue status
+            if instance_id in self.queue:
+                self.queue[instance_id].status = "completed"
+                self.queue[instance_id].assigned_adjudicator = decision.adjudicator_id
+            # Persist to disk
+            self._save_decisions()
+            self.logger.info(
+                f"Adjudication decision saved for {instance_id} "
+                f"by {decision.adjudicator_id}"
+            )
+            return True
+    def get_stats(self) -> Dict[str, Any]:
+        """Get adjudication progress statistics."""
+        with self._lock:
+            if not self._queue_built:
+                self.build_queue()
+            total = len(self.queue)
+            completed = sum(
+                1 for i in self.queue.values() if i.status == "completed"
+            )
+            pending = sum(
+                1 for i in self.queue.values() if i.status == "pending"
+            )
+            skipped = sum(
+                1 for i in self.queue.values() if i.status == "skipped"
+            )
+            in_progress = sum(
+                1 for i in self.queue.values() if i.status == "in_progress"
+            )
+            avg_agreement = 0.0
+            if self.queue:
+                avg_agreement = sum(
+                    i.overall_agreement for i in self.queue.values()
+                ) / len(self.queue)
+            # Per-adjudicator stats
+            adjudicator_stats = defaultdict(lambda: {"completed": 0, "total_time_ms": 0})
+            for decision in self.decisions.values():
+                adj_id = decision.adjudicator_id
+                adjudicator_stats[adj_id]["completed"] += 1
+                adjudicator_stats[adj_id]["total_time_ms"] += decision.time_spent_ms
+            return {
+                "total": total,
+                "completed": completed,
+                "pending": pending,
+                "skipped": skipped,
+                "in_progress": in_progress,
+                "completion_rate": completed / total if total > 0 else 0.0,
+                "avg_agreement": avg_agreement,
+                "adjudicator_stats": dict(adjudicator_stats),
+            }
+    def get_decision(self, instance_id: str) -> Optional[AdjudicationDecision]:
+        """Get the decision for an item, if one exists."""
+        return self.decisions.get(str(instance_id))
+    # ------------------------------------------------------------------
+    # Phase 3: Similarity integration
+    # ------------------------------------------------------------------
+    def _precompute_similarities(self) -> None:
+        """Precompute embeddings for all items in the item state manager."""
+        if not self.similarity_engine or not self.similarity_engine.enabled:
+            return
+        from potato.item_state_management import get_item_state_manager
+        try:
+            ism = get_item_state_manager()
+            item_texts = {}
+            for instance_id, item in ism.instance_id_to_instance.items():
+                text = self.get_item_text(str(instance_id))
+                if text:
+                    item_texts[str(instance_id)] = text
+            if item_texts:
+                count = self.similarity_engine.precompute_embeddings(item_texts)
+                self.logger.info(f"Precomputed {count} similarity embeddings")
+        except Exception as e:
+            self.logger.error(f"Error precomputing similarities: {e}")
+    def get_similar_items(
+        self, instance_id: str, include_metadata: bool = True
+    ) -> List[Dict[str, Any]]:
+        """
+        Get similar items for a given instance, enriched with queue metadata.
+        Args:
+            instance_id: The reference instance ID
+            include_metadata: Whether to include decision/consensus data
+        Returns:
+            List of dicts with instance_id, similarity, and optional metadata
+        """
+        if not self.similarity_engine or not self.similarity_engine.enabled:
+            return []
+        similar = self.similarity_engine.find_similar(instance_id)
+        results = []
+        for other_id, score in similar:
+            entry = {
+                "instance_id": other_id,
+                "similarity": round(score, 4),
+                "text_preview": self.similarity_engine.text_cache.get(
+                    other_id, ""
+                ),
+            }
+            if include_metadata:
+                queue_item = self.queue.get(other_id)
+                decision = self.decisions.get(other_id)
+                entry["in_queue"] = queue_item is not None
+                entry["status"] = queue_item.status if queue_item else None
+                entry["overall_agreement"] = (
+                    queue_item.overall_agreement if queue_item else None
+                )
+                if decision:
+                    entry["decision"] = "completed"
+                    entry["consensus_label"] = None
+                else:
+                    entry["decision"] = None
+                    if queue_item:
+                        entry["consensus_label"] = self._get_consensus_label(
+                            queue_item
+                        )
+                    else:
+                        entry["consensus_label"] = None
+            results.append(entry)
+        return results
+    def _get_consensus_label(self, item: AdjudicationItem) -> Optional[str]:
+        """
+        Get the majority/consensus label for an item across the first schema.
+        Args:
+            item: The AdjudicationItem
+        Returns:
+            The most common label value as a string, or None
+        """
+        if not item.annotations:
+            return None
+        # Use the first schema that has values
+        for user_annots in item.annotations.values():
+            for schema_name in user_annots:
+                # Collect all values for this schema
+                values = []
+                for ua in item.annotations.values():
+                    val = ua.get(schema_name)
+                    if val is not None:
+                        if isinstance(val, dict):
+                            # Multiselect: use frozenset representation
+                            selected = sorted(
+                                k for k, v in val.items()
+                                if v is True or v == "true" or v == 1
+                            )
+                            values.append(", ".join(selected) if selected else str(val))
+                        else:
+                            values.append(str(val))
+                if values:
+                    counter = Counter(values)
+                    return counter.most_common(1)[0][0]
+        return None
+    # ------------------------------------------------------------------
+    # Phase 3: Behavioral signal analysis
+    # ------------------------------------------------------------------
+    def get_annotator_signals(
+        self, user_id: str, instance_id: str
+    ) -> Dict[str, Any]:
+        """
+        Compute per-annotator quality signals for a specific item.
+        Returns:
+            Dict with user_id, instance_id, flags list, and metrics dict
+        """
+        flags = []
+        metrics = {}
+        instance_id = str(instance_id)
+        item = self.queue.get(instance_id)
+        if not item:
+            return {"user_id": user_id, "instance_id": instance_id,
+                    "flags": [], "metrics": {}}
+        # Get behavioral data for this user on this item
+        bd = item.behavioral_data.get(user_id, {})
+        if hasattr(bd, 'to_dict'):
+            bd = bd.to_dict()
+        total_time = bd.get("total_time_ms", 0)
+        metrics["total_time_ms"] = total_time
+        # 1. Speed z-score vs user's typical time
+        user_times = self._get_user_times(user_id)
+        if len(user_times) >= 3 and total_time > 0:
+            mean_time = sum(user_times) / len(user_times)
+            std_time = math.sqrt(
+                sum((t - mean_time) ** 2 for t in user_times) / len(user_times)
+            )
+            if std_time > 0:
+                z_score = (total_time - mean_time) / std_time
+                metrics["speed_z_score"] = round(z_score, 2)
+                if z_score < -2.0:
+                    flags.append({
+                        "type": "unusually_fast",
+                        "severity": "high",
+                        "message": f"Annotation time ({total_time}ms) is {abs(z_score):.1f} std devs below average"
+                    })
+        # 2. Fast decision warning
+        fast_threshold = self.adj_config.fast_decision_warning_ms
+        if fast_threshold > 0 and 0 < total_time < fast_threshold:
+            flags.append({
+                "type": "fast_decision",
+                "severity": "medium",
+                "message": f"Decision made in {total_time}ms (below {fast_threshold}ms threshold)"
+            })
+        # 3. Annotation change count
+        raw_changes = bd.get("annotation_changes", [])
+        change_count = len(raw_changes) if isinstance(raw_changes, list) else int(raw_changes or 0)
+        metrics["annotation_changes"] = change_count
+        if change_count > 5:
+            flags.append({
+                "type": "excessive_changes",
+                "severity": "medium",
+                "message": f"Made {change_count} annotation changes on this item"
+            })
+        # 4. Historical agreement rate with consensus
+        agreement_rate = self._compute_user_agreement_rate(user_id)
+        if agreement_rate is not None:
+            metrics["agreement_rate"] = round(agreement_rate, 3)
+            if agreement_rate < 0.4:
+                flags.append({
+                    "type": "low_agreement",
+                    "severity": "high",
+                    "message": f"Agreement rate with consensus: {agreement_rate:.0%}"
+                })
+        # 5. Similar item consistency
+        if self.similarity_engine and self.similarity_engine.enabled:
+            inconsistencies = self._check_similar_item_consistency(
+                user_id, instance_id
+            )
+            metrics["similar_item_inconsistencies"] = inconsistencies
+            if inconsistencies > 0:
+                flags.append({
+                    "type": "similar_item_inconsistency",
+                    "severity": "medium",
+                    "message": f"Different label on {inconsistencies} similar item(s)"
+                })
+        return {
+            "user_id": user_id,
+            "instance_id": instance_id,
+            "flags": flags,
+            "metrics": metrics,
+        }
+    def _get_user_times(self, user_id: str) -> List[float]:
+        """Collect all annotation times for a user across queue items."""
+        times = []
+        for item in self.queue.values():
+            bd = item.behavioral_data.get(user_id, {})
+            if hasattr(bd, 'to_dict'):
+                bd = bd.to_dict()
+            t = bd.get("total_time_ms", 0)
+            if t > 0:
+                times.append(t)
+        return times
+    def _compute_user_agreement_rate(self, user_id: str) -> Optional[float]:
+        """
+        Compute how often a user agrees with the consensus across all items.
+        Returns:
+            Float 0-1 or None if insufficient data (needs >= 3 items)
+        """
+        agree_count = 0
+        total_count = 0
+        for item in self.queue.values():
+            if user_id not in item.annotations:
+                continue
+            consensus = self._get_consensus_label(item)
+            if consensus is None:
+                continue
+            user_annots = item.annotations[user_id]
+            # Check the first schema
+            for schema_name, val in user_annots.items():
+                if isinstance(val, dict):
+                    selected = sorted(
+                        k for k, v in val.items()
+                        if v is True or v == "true" or v == 1
+                    )
+                    user_label = ", ".join(selected) if selected else str(val)
+                else:
+                    user_label = str(val)
+                if user_label == consensus:
+                    agree_count += 1
+                total_count += 1
+                break  # Only check first schema
+        if total_count < 3:
+            return None
+        return agree_count / total_count
+    def _check_similar_item_consistency(
+        self, user_id: str, instance_id: str
+    ) -> int:
+        """
+        Check if user's label on similar items (>0.8 similarity) is consistent.
+        Returns:
+            Count of similar items where user's label differs
+        """
+        if not self.similarity_engine:
+            return 0
+        similar = self.similarity_engine.find_similar(instance_id)
+        if not similar:
+            return 0
+        # Get user's label on the current item
+        item = self.queue.get(instance_id)
+        if not item or user_id not in item.annotations:
+            return 0
+        user_annots = item.annotations[user_id]
+        current_label = None
+        current_schema = None
+        for schema_name, val in user_annots.items():
+            current_schema = schema_name
+            if isinstance(val, dict):
+                selected = sorted(
+                    k for k, v in val.items()
+                    if v is True or v == "true" or v == 1
+                )
+                current_label = ", ".join(selected) if selected else str(val)
+            else:
+                current_label = str(val)
+            break
+        if current_label is None:
+            return 0
+        inconsistencies = 0
+        for other_id, score in similar:
+            if score < 0.8:
+                break  # Results are sorted by score desc
+            other_item = self.queue.get(other_id)
+            if not other_item or user_id not in other_item.annotations:
+                continue
+            other_annots = other_item.annotations[user_id]
+            other_val = other_annots.get(current_schema)
+            if other_val is None:
+                continue
+            if isinstance(other_val, dict):
+                selected = sorted(
+                    k for k, v in other_val.items()
+                    if v is True or v == "true" or v == 1
+                )
+                other_label = ", ".join(selected) if selected else str(other_val)
+            else:
+                other_label = str(other_val)
+            if other_label != current_label:
+                inconsistencies += 1
+        return inconsistencies
+    def _get_output_dir(self) -> str:
+        """Get the adjudication output directory."""
+        output_dir = self.config.get("output_annotation_dir", "annotation_output")
+        adj_dir = os.path.join(output_dir, self.adj_config.output_subdir)
+        os.makedirs(adj_dir, exist_ok=True)
+        return adj_dir
+    def _save_decisions(self) -> None:
+        """Persist all decisions to disk."""
+        try:
+            adj_dir = self._get_output_dir()
+            decisions_file = os.path.join(adj_dir, "decisions.json")
+            data = {
+                "decisions": [d.to_dict() for d in self.decisions.values()],
+                "last_updated": datetime.now().isoformat(),
+            }
+            with open(decisions_file, "w", encoding="utf-8") as f:
+                json.dump(data, f, indent=2)
+        except Exception as e:
+            self.logger.error(f"Failed to save adjudication decisions: {e}")
+    def _load_decisions(self) -> None:
+        """Load previously saved decisions from disk."""
+        try:
+            output_dir = self.config.get("output_annotation_dir", "annotation_output")
+            adj_dir = os.path.join(output_dir, self.adj_config.output_subdir)
+            decisions_file = os.path.join(adj_dir, "decisions.json")
+            if not os.path.exists(decisions_file):
+                return
+            with open(decisions_file, "r", encoding="utf-8") as f:
+                data = json.load(f)
+            for d in data.get("decisions", []):
+                decision = AdjudicationDecision.from_dict(d)
+                self.decisions[decision.instance_id] = decision
+            self.logger.info(
+                f"Loaded {len(self.decisions)} previous adjudication decisions"
+            )
+        except Exception as e:
+            self.logger.warning(f"Failed to load adjudication decisions: {e}")
+    def generate_final_dataset(self) -> List[Dict[str, Any]]:
+        """
+        Generate the final dataset by merging unanimous agreements
+        and adjudication decisions.
+        Returns:
+            List of item dicts with final labels and provenance
+        """
+        from potato.user_state_management import get_user_state_manager
+        from potato.item_state_management import get_item_state_manager
+        usm = get_user_state_manager()
+        ism = get_item_state_manager()
+        annotation_schemes = self.config.get("annotation_schemes", [])
+        scheme_names = [s.get("name", "") for s in annotation_schemes]
+        results = []
+        for instance_id, item in ism.instance_id_to_instance.items():
+            instance_id_str = str(instance_id)
+            result = {
+                "instance_id": instance_id_str,
+                "item_data": item.get_data() if hasattr(item, 'get_data') else {},
+            }
+            # Check if we have an adjudication decision
+            decision = self.decisions.get(instance_id_str)
+            if decision:
+                result["labels"] = decision.label_decisions
+                result["spans"] = decision.span_decisions
+                result["source"] = "adjudicated"
+                result["adjudicator"] = decision.adjudicator_id
+                result["confidence"] = decision.confidence
+                result["provenance"] = decision.source
+                results.append(result)
+                continue
+            # Check for unanimous agreement
+            annotators = ism.instance_annotators.get(instance_id, set())
+            annotators = {
+                u for u in annotators
+                if u not in self.adj_config.adjudicator_users
+            }
+            if len(annotators) < 2:
+                continue
+            # Collect annotations
+            annotations = {}
+            for user_id in annotators:
+                user_state = usm.get_user_state(user_id)
+                if not user_state:
+                    continue
+                labels = user_state.instance_id_to_label_to_value.get(
+                    instance_id_str, {}
+                )
+                if labels:
+                    annotations[user_id] = self._serialize_labels(labels)
+            if not annotations:
+                continue
+            # Check for unanimity per schema
+            unanimous_labels = {}
+            is_unanimous = True
+            for schema in scheme_names:
+                values = []
+                for user_annots in annotations.values():
+                    if schema in user_annots:
+                        values.append(json.dumps(user_annots[schema], sort_keys=True))
+                if len(values) < 2:
+                    continue
+                if len(set(values)) == 1:
+                    unanimous_labels[schema] = json.loads(values[0])
+                else:
+                    is_unanimous = False
+            if is_unanimous and unanimous_labels:
+                result["labels"] = unanimous_labels
+                result["source"] = "unanimous"
+                result["num_annotators"] = len(annotators)
+                results.append(result)
+            else:
+                result["labels"] = {}
+                result["source"] = "unresolved"
+                result["num_annotators"] = len(annotators)
+                results.append(result)
+        return results
+def init_adjudication_manager(config: Dict[str, Any]) -> Optional[AdjudicationManager]:
+    """Initialize the singleton AdjudicationManager."""
+    global _ADJUDICATION_MANAGER
+    with _ADJUDICATION_LOCK:
+        if _ADJUDICATION_MANAGER is None:
+            _ADJUDICATION_MANAGER = AdjudicationManager(config)
+    return _ADJUDICATION_MANAGER
+def get_adjudication_manager() -> Optional[AdjudicationManager]:
+    """Get the singleton AdjudicationManager instance."""
+    return _ADJUDICATION_MANAGER
+def clear_adjudication_manager():
+    """Clear the singleton (for testing)."""
+    global _ADJUDICATION_MANAGER
+    with _ADJUDICATION_LOCK:
+        _ADJUDICATION_MANAGER = None

potato/adjudication_export.py ADDED Viewed

	@@ -0,0 +1,162 @@

+"""
+Adjudication Export CLI
+Generate final datasets by merging unanimous agreements and adjudication decisions.
+Usage:
+    python -m potato.adjudication_export --config config.yaml --output final_dataset.jsonl
+    python -m potato.adjudication_export --config config.yaml --output final.csv --format csv
+    python -m potato.adjudication_export --config config.yaml --output final.json --format json
+"""
+import argparse
+import csv
+import json
+import os
+import sys
+import logging
+logger = logging.getLogger(__name__)
+def main():
+    parser = argparse.ArgumentParser(
+        description="Export adjudicated dataset from Potato annotation project"
+    )
+    parser.add_argument(
+        "--config", required=True,
+        help="Path to the Potato config YAML file"
+    )
+    parser.add_argument(
+        "--output", required=True,
+        help="Output file path"
+    )
+    parser.add_argument(
+        "--format", choices=["jsonl", "json", "csv"], default="jsonl",
+        help="Output format (default: jsonl)"
+    )
+    parser.add_argument(
+        "--include-unresolved", action="store_true",
+        help="Include items without adjudication or consensus"
+    )
+    parser.add_argument(
+        "--verbose", "-v", action="store_true",
+        help="Verbose output"
+    )
+    args = parser.parse_args()
+    if args.verbose:
+        logging.basicConfig(level=logging.DEBUG)
+    else:
+        logging.basicConfig(level=logging.INFO)
+    # Load config
+    from potato.server_utils.config_module import init_config, config
+    try:
+        init_config(args.config)
+    except Exception as e:
+        print(f"Error loading config: {e}", file=sys.stderr)
+        sys.exit(1)
+    # Initialize state managers
+    from potato.item_state_management import init_item_state_manager
+    from potato.user_state_management import init_user_state_manager
+    init_user_state_manager(config)
+    init_item_state_manager(config)
+    # Load data (this loads items and user annotations from disk)
+    # We need a minimal load - just items and user states
+    from potato.flask_server import load_instance_data, load_user_data
+    load_instance_data(config)
+    load_user_data(config)
+    # Initialize adjudication manager
+    from potato.adjudication import init_adjudication_manager
+    adj_mgr = init_adjudication_manager(config)
+    if not adj_mgr or not adj_mgr.adj_config.enabled:
+        print("Adjudication is not enabled in this config.", file=sys.stderr)
+        sys.exit(1)
+    # Build queue to compute agreements
+    adj_mgr.build_queue()
+    # Generate final dataset
+    results = adj_mgr.generate_final_dataset()
+    # Filter unresolved if not requested
+    if not args.include_unresolved:
+        results = [r for r in results if r.get("source") != "unresolved"]
+    # Write output
+    output_path = args.output
+    fmt = args.format
+    if fmt == "jsonl":
+        with open(output_path, "w") as f:
+            for item in results:
+                f.write(json.dumps(item) + "\n")
+    elif fmt == "json":
+        with open(output_path, "w") as f:
+            json.dump(results, f, indent=2)
+    elif fmt == "csv":
+        if not results:
+            print("No results to export.", file=sys.stderr)
+            sys.exit(0)
+        # Flatten for CSV
+        fieldnames = set()
+        flat_results = []
+        for item in results:
+            flat = {
+                "instance_id": item["instance_id"],
+                "source": item.get("source", ""),
+            }
+            # Flatten labels
+            labels = item.get("labels", {})
+            for schema, value in labels.items():
+                if isinstance(value, dict):
+                    flat[schema] = json.dumps(value)
+                else:
+                    flat[schema] = value
+            # Add provenance fields
+            if "adjudicator" in item:
+                flat["adjudicator"] = item["adjudicator"]
+            if "confidence" in item:
+                flat["confidence"] = item["confidence"]
+            if "num_annotators" in item:
+                flat["num_annotators"] = item["num_annotators"]
+            fieldnames.update(flat.keys())
+            flat_results.append(flat)
+        # Sort fieldnames for consistent output
+        fieldnames = sorted(fieldnames)
+        with open(output_path, "w", newline="") as f:
+            writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
+            writer.writeheader()
+            writer.writerows(flat_results)
+    # Summary
+    total = len(results)
+    unanimous = sum(1 for r in results if r.get("source") == "unanimous")
+    adjudicated = sum(1 for r in results if r.get("source") == "adjudicated")
+    unresolved = sum(1 for r in results if r.get("source") == "unresolved")
+    print(f"\nExport complete: {output_path}")
+    print(f"  Total items: {total}")
+    print(f"  Unanimous:   {unanimous}")
+    print(f"  Adjudicated: {adjudicated}")
+    if args.include_unresolved:
+        print(f"  Unresolved:  {unresolved}")
+    print(f"  Format:      {fmt}")
+if __name__ == "__main__":
+    main()

potato/admin.py ADDED Viewed

The diff for this file is too large to render. See raw diff

potato/agent_proxy/__init__.py ADDED Viewed

	@@ -0,0 +1,44 @@

+"""
+Agent Proxy Package
+Provides agent proxy implementations for live agent interaction during annotation.
+Proxies communicate with AI agent backends (echo, HTTP, OpenAI) and return
+responses to the annotation interface.
+Usage:
+    from potato.agent_proxy import AgentProxyFactory
+    proxy = AgentProxyFactory.create(config)
+    context = proxy.start_session("Book a flight to Paris")
+    response = proxy.send_message("Hello", context)
+"""
+from .base import AgentMessage, AgentResponse, BaseAgentProxy, AgentProxyFactory
+from .session import (
+    AgentSession,
+    AgentSessionManager,
+    init_agent_session_manager,
+    get_agent_session_manager,
+    clear_agent_session_manager,
+)
+from .sandbox import SafetySandbox, SandboxViolation
+# Import proxy implementations to trigger registration
+from . import echo_proxy
+from . import http_proxy
+from . import openai_proxy
+from . import coding_proxy  # subprocess_coding + docker_coding
+__all__ = [
+    "AgentMessage",
+    "AgentResponse",
+    "BaseAgentProxy",
+    "AgentProxyFactory",
+    "AgentSession",
+    "AgentSessionManager",
+    "init_agent_session_manager",
+    "get_agent_session_manager",
+    "clear_agent_session_manager",
+    "SafetySandbox",
+    "SandboxViolation",
+]

potato/agent_proxy/base.py ADDED Viewed

	@@ -0,0 +1,138 @@

+"""
+Agent Proxy Base Module
+Provides the abstract base class and data structures for agent proxies,
+plus a factory registry for creating proxy instances from configuration.
+Agent proxies allow annotators to interact with AI agents live during
+annotation tasks. Each proxy type (echo, http, openai) handles
+communication with a specific kind of agent backend.
+"""
+from abc import ABC, abstractmethod
+from dataclasses import dataclass, field
+from typing import Dict, Any, Optional, List
+import logging
+import time
+logger = logging.getLogger(__name__)
+@dataclass
+class AgentMessage:
+    """A single message in an agent conversation."""
+    role: str  # "user", "agent", "system", "error"
+    content: str
+    timestamp: float = field(default_factory=time.time)
+    metadata: Dict[str, Any] = field(default_factory=dict)
+@dataclass
+class AgentResponse:
+    """Response from an agent proxy after sending a message."""
+    message: AgentMessage
+    done: bool = False
+    error: Optional[str] = None
+class BaseAgentProxy(ABC):
+    """
+    Abstract base class for agent proxies.
+    Subclasses implement communication with specific agent backends
+    (echo for testing, HTTP for generic REST APIs, OpenAI for chat completions).
+    """
+    proxy_type: str = ""
+    def __init__(self, config: dict):
+        self.config = config
+        self._initialize()
+    @abstractmethod
+    def _initialize(self):
+        """Set up connections, validate config. Called by __init__."""
+        pass
+    @abstractmethod
+    def start_session(self, task_description: str) -> dict:
+        """
+        Start a new interaction session.
+        Args:
+            task_description: The task the annotator should accomplish with the agent.
+        Returns:
+            Proxy-specific session context dict (stored in AgentSession.proxy_context).
+        """
+        pass
+    @abstractmethod
+    def send_message(self, message: str, session_context: dict) -> AgentResponse:
+        """
+        Send a message to the agent and get a blocking response.
+        Args:
+            message: The user's message text.
+            session_context: The proxy-specific context from start_session.
+        Returns:
+            AgentResponse with the agent's reply.
+        """
+        pass
+    def end_session(self, session_context: dict):
+        """
+        Clean up session resources. Override if needed.
+        Args:
+            session_context: The proxy-specific context from start_session.
+        """
+        pass
+class AgentProxyFactory:
+    """Factory registry for creating agent proxy instances."""
+    _proxies: Dict[str, type] = {}
+    @classmethod
+    def register(cls, proxy_type: str, proxy_class: type):
+        """Register a proxy type."""
+        cls._proxies[proxy_type] = proxy_class
+        logger.debug(f"Registered agent proxy type: {proxy_type}")
+    @classmethod
+    def create(cls, config: dict) -> BaseAgentProxy:
+        """
+        Create an agent proxy from configuration.
+        Args:
+            config: The full config dict. Reads from config["agent_proxy"].
+        Returns:
+            Configured BaseAgentProxy instance.
+        Raises:
+            ValueError: If proxy type is unknown or missing.
+        """
+        agent_config = config.get("agent_proxy", {})
+        proxy_type = agent_config.get("type")
+        if not proxy_type:
+            raise ValueError("agent_proxy.type is required")
+        if proxy_type not in cls._proxies:
+            supported = ", ".join(sorted(cls._proxies.keys()))
+            raise ValueError(
+                f"Unknown agent proxy type: '{proxy_type}'. "
+                f"Supported types: {supported}"
+            )
+        proxy_class = cls._proxies[proxy_type]
+        return proxy_class(agent_config)
+    @classmethod
+    def get_supported_types(cls) -> List[str]:
+        """Get list of registered proxy type names."""
+        return sorted(cls._proxies.keys())

potato/agent_proxy/coding_proxy.py ADDED Viewed

	@@ -0,0 +1,466 @@

+"""
+Coding-agent proxies — LLM plans + sandboxed code execution.
+Two implementations behind a shared base class:
+- :class:`SubprocessCodingAgentProxy` (default, ``type: subprocess_coding``)
+  runs each Python or shell action in a per-session temp workspace via
+  ``subprocess.run`` with a per-step timeout and an output cap. Light
+  isolation — suitable for trusted-input research workflows. The
+  workspace is sandboxed (separate cwd) but **not** a security boundary;
+  malicious code can still touch the host filesystem outside ``cwd``.
+- :class:`DockerCodingAgentProxy` (``type: docker_coding``) runs each
+  action inside an ephemeral Docker container with ``--network=none``,
+  ``--memory``, ``--cpus``, ``--read-only`` and a writable workspace
+  bind-mounted at ``/work``. Real isolation — survives untrusted code.
+  Requires the ``docker`` Python package and a running Docker daemon.
+Both inherit per-step / per-session / rate-limit enforcement from the
+existing :mod:`potato.agent_proxy.sandbox` framework via the standard
+``send_message`` flow in ``routes.py:agent_chat_send``.
+Configuration shape (both proxies):
+    agent_proxy:
+      type: subprocess_coding | docker_coding
+      llm:
+        endpoint_type: ollama
+        model: llama3.2:3b
+        base_url: http://localhost:11434
+        temperature: 0.2
+        max_tokens: 800
+      execution:
+        per_step_timeout: 8           # seconds
+        max_output_chars: 4000
+        starter_files: {}             # {filename: contents} written into workspace
+      docker:                          # only for docker_coding
+        image: python:3.11-slim
+        memory: 512m
+        cpus: 1.0
+        network: none                 # "none" or "bridge"
+      sandbox: { max_steps: 20, ... } # standard agent-proxy sandbox knobs
+"""
+from __future__ import annotations
+import json
+import logging
+import os
+import re
+import shutil
+import subprocess
+import tempfile
+from abc import abstractmethod
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional
+from .base import AgentMessage, AgentProxyFactory, AgentResponse, BaseAgentProxy
+logger = logging.getLogger(__name__)
+_PLANNER_SYSTEM_PROMPT = (
+    "You are an autonomous coding agent. The user will give you a coding task. "
+    "Each turn, decide a SINGLE next action and respond with ONLY a JSON object "
+    "of the form: {\"thought\": str, \"action\": {\"type\": str, \"code\": str}}. "
+    "The 'type' must be one of:\n"
+    "  - \"python\": run the contents of 'code' as a python script in the workspace\n"
+    "  - \"shell\": run 'code' as a bash command in the workspace\n"
+    "  - \"finish\": stop and return your final answer in 'code' (which is then shown "
+    "to the user as your conclusion -- no execution happens)\n"
+    "Keep each action small and focused. Use 'finish' as soon as the task is done."
+)
+@dataclass
+class _ExecResult:
+    stdout: str
+    stderr: str
+    exit_code: Optional[int]
+    timed_out: bool = False
+    error: Optional[str] = None
+class CodingAgentProxy(BaseAgentProxy):
+    """Shared planner/executor scaffold for coding agents.
+    Subclasses implement :meth:`_execute` to run an action in their
+    chosen sandbox. Each call to :meth:`send_message`:
+      1. Appends the user message to the running history.
+      2. Asks an LLM (configurable endpoint) for a JSON ``{thought, action}``.
+      3. Hands ``action`` to the subclass for execution.
+      4. Returns a single reply combining thought + tool output.
+    """
+    def _initialize(self):
+        llm_cfg = self.config.get("llm") or {}
+        self.llm_endpoint_type = llm_cfg.get("endpoint_type", "ollama")
+        self.llm_model = llm_cfg.get("model")
+        self.llm_base_url = llm_cfg.get("base_url")
+        self.llm_temperature = llm_cfg.get("temperature", 0.2)
+        self.llm_max_tokens = llm_cfg.get("max_tokens", 800)
+        # OpenAI-compatible servers (vLLM etc.) ignore the key but the SDK
+        # requires a non-empty string. Ollama needs none. Without forwarding
+        # this the planner silently failed with "planner_unavailable".
+        self.llm_api_key = llm_cfg.get("api_key")
+        # Last endpoint init / call error, surfaced to the user instead of
+        # an opaque "planner unavailable" message.
+        self._llm_error: Optional[str] = None
+        execution_cfg = self.config.get("execution") or {}
+        self.per_step_timeout = execution_cfg.get("per_step_timeout", 8)
+        self.max_output_chars = execution_cfg.get("max_output_chars", 4000)
+        self.starter_files: Dict[str, str] = execution_cfg.get("starter_files", {}) or {}
+        self._llm = None  # lazy
+    # ------------------------------------------------------------------
+    # LLM lazy-init
+    # ------------------------------------------------------------------
+    def _get_llm(self):
+        if self._llm is not None:
+            return self._llm
+        try:
+            from potato.ai.ai_endpoint import AIEndpointFactory
+            ai_cfg: Dict[str, Any] = {
+                "model": self.llm_model,
+                "max_tokens": self.llm_max_tokens,
+                "temperature": self.llm_temperature,
+            }
+            if self.llm_base_url:
+                ai_cfg["base_url"] = self.llm_base_url
+            # Forward the key for OpenAI-compatible endpoints; vLLM ignores
+            # its value but the OpenAI SDK rejects an empty one. Fall back to
+            # env then a non-empty placeholder so local servers just work.
+            ai_cfg["api_key"] = (
+                self.llm_api_key
+                or os.environ.get("OPENAI_API_KEY")
+                or os.environ.get("ANTHROPIC_API_KEY")
+                or "EMPTY"
+            )
+            self._llm = AIEndpointFactory.create_endpoint({
+                "ai_support": {
+                    "enabled": True,
+                    "endpoint_type": self.llm_endpoint_type,
+                    "ai_config": ai_cfg,
+                }
+            })
+            self._llm_error = None
+        except Exception as e:
+            logger.warning("CodingAgentProxy: planner LLM init failed: %s", e)
+            self._llm_error = f"{type(e).__name__}: {e}"
+            self._llm = None
+        return self._llm
+    # ------------------------------------------------------------------
+    # Lifecycle
+    # ------------------------------------------------------------------
+    def start_session(self, task_description: str) -> dict:
+        workspace = tempfile.mkdtemp(prefix="potato_coding_agent_")
+        for filename, contents in self.starter_files.items():
+            target = os.path.join(workspace, filename)
+            os.makedirs(os.path.dirname(target) or workspace, exist_ok=True)
+            with open(target, "w") as f:
+                f.write(contents)
+        history = [
+            {"role": "system", "content": _PLANNER_SYSTEM_PROMPT},
+            {
+                "role": "system",
+                "content": f"Workspace: {workspace}\nTask: {task_description}",
+            },
+        ]
+        return {
+            "workspace": workspace,
+            "history": history,
+            "step": 0,
+            "finished": False,
+        }
+    def end_session(self, session_context: dict):
+        workspace = session_context.get("workspace") if session_context else None
+        if workspace and os.path.isdir(workspace):
+            shutil.rmtree(workspace, ignore_errors=True)
+    # ------------------------------------------------------------------
+    # Per-turn flow
+    # ------------------------------------------------------------------
+    def send_message(self, message: str, session_context: dict) -> AgentResponse:
+        if session_context.get("finished"):
+            return AgentResponse(
+                message=AgentMessage(role="agent", content="(session already finished)"),
+                done=True,
+            )
+        history: List[Dict[str, str]] = session_context.setdefault("history", [])
+        history.append({"role": "user", "content": message})
+        session_context["step"] = session_context.get("step", 0) + 1
+        plan = self._plan_next_action(history)
+        if plan is None:
+            detail = self._llm_error or "no response from planner LLM"
+            reply = (
+                f"Planner LLM unavailable ({self.llm_endpoint_type}): "
+                f"{detail}"
+            )
+            history.append({"role": "assistant", "content": reply})
+            session_context["finished"] = True
+            return AgentResponse(
+                message=AgentMessage(role="error", content=reply), error="planner_unavailable",
+            )
+        thought = plan.get("thought", "")
+        action = plan.get("action") or {}
+        atype = (action.get("type") or "").strip().lower()
+        code = action.get("code") or ""
+        if atype == "finish":
+            reply = self._format_finish_reply(thought, code)
+            history.append({"role": "assistant", "content": reply})
+            session_context["finished"] = True
+            return AgentResponse(
+                message=AgentMessage(role="agent", content=reply), done=True,
+            )
+        if atype not in ("python", "shell"):
+            reply = (
+                f"{thought}\n\n[invalid action type {atype!r}; try 'python', "
+                "'shell', or 'finish']"
+            )
+            history.append({"role": "assistant", "content": reply})
+            return AgentResponse(message=AgentMessage(role="agent", content=reply))
+        result = self._execute(atype, code, session_context)
+        reply = self._format_exec_reply(thought, atype, code, result)
+        history.append({"role": "assistant", "content": reply})
+        return AgentResponse(message=AgentMessage(role="agent", content=reply))
+    # ------------------------------------------------------------------
+    # Planning
+    # ------------------------------------------------------------------
+    def _plan_next_action(self, history: List[Dict[str, str]]) -> Optional[Dict[str, Any]]:
+        endpoint = self._get_llm()
+        if endpoint is None:
+            return None
+        try:
+            if hasattr(endpoint, "chat_query"):
+                raw = endpoint.chat_query(history)
+            else:
+                flat = "\n".join(f'{m["role"]}: {m["content"]}' for m in history)
+                raw = endpoint.query(flat + "\nassistant:", None)
+        except Exception as e:
+            logger.warning("Planner LLM call failed: %s", e)
+            self._llm_error = f"{type(e).__name__}: {e}"
+            return None
+        if isinstance(raw, dict):
+            return raw  # already parsed JSON
+        text = str(raw or "").strip()
+        if not text:
+            return None
+        try:
+            return json.loads(text)
+        except json.JSONDecodeError:
+            match = re.search(r"\{.*\}", text, flags=re.DOTALL)
+            if match:
+                try:
+                    return json.loads(match.group(0))
+                except json.JSONDecodeError:
+                    pass
+        # Last-ditch: treat the whole text as a finish reply.
+        return {"thought": "", "action": {"type": "finish", "code": text[:500]}}
+    # ------------------------------------------------------------------
+    # Reply formatting
+    # ------------------------------------------------------------------
+    def _format_exec_reply(
+        self, thought: str, atype: str, code: str, result: _ExecResult
+    ) -> str:
+        parts: List[str] = []
+        if thought:
+            parts.append(thought.strip())
+        parts.append(f"```{atype}\n{code.strip()}\n```")
+        out_block: List[str] = []
+        if result.timed_out:
+            out_block.append(f"[timeout after {self.per_step_timeout}s]")
+        if result.error:
+            out_block.append(f"[error: {result.error}]")
+        if result.exit_code is not None:
+            out_block.append(f"[exit={result.exit_code}]")
+        if result.stdout:
+            out_block.append("stdout:\n" + self._truncate(result.stdout))
+        if result.stderr:
+            out_block.append("stderr:\n" + self._truncate(result.stderr))
+        if not out_block:
+            out_block.append("(no output)")
+        parts.append("\n".join(out_block))
+        return "\n\n".join(parts)
+    def _format_finish_reply(self, thought: str, final_text: str) -> str:
+        parts = []
+        if thought:
+            parts.append(thought.strip())
+        if final_text:
+            parts.append(final_text.strip())
+        if not parts:
+            parts.append("(done)")
+        return "\n\n".join(parts)
+    def _truncate(self, text: str) -> str:
+        if len(text) <= self.max_output_chars:
+            return text
+        cut = self.max_output_chars
+        return text[:cut] + f"\n[...truncated {len(text) - cut} chars]"
+    # ------------------------------------------------------------------
+    # Sandbox-specific execution
+    # ------------------------------------------------------------------
+    @abstractmethod
+    def _execute(
+        self, action_type: str, code: str, session_context: dict
+    ) -> _ExecResult:
+        """Execute ``code`` (one of 'python' / 'shell') in the sandbox."""
+class SubprocessCodingAgentProxy(CodingAgentProxy):
+    """Local subprocess-based execution.
+    NOT a security boundary -- the per-step timeout + tempdir cwd is the
+    only protection. Use ``DockerCodingAgentProxy`` for untrusted input.
+    """
+    proxy_type = "subprocess_coding"
+    def _execute(
+        self, action_type: str, code: str, session_context: dict
+    ) -> _ExecResult:
+        workspace = session_context["workspace"]
+        env = self._build_env()
+        try:
+            if action_type == "python":
+                script_path = os.path.join(workspace, "_action.py")
+                with open(script_path, "w") as f:
+                    f.write(code)
+                proc = subprocess.run(
+                    ["python", script_path],
+                    cwd=workspace,
+                    env=env,
+                    capture_output=True,
+                    text=True,
+                    timeout=self.per_step_timeout,
+                )
+            else:  # shell
+                proc = subprocess.run(
+                    ["bash", "-c", code],
+                    cwd=workspace,
+                    env=env,
+                    capture_output=True,
+                    text=True,
+                    timeout=self.per_step_timeout,
+                )
+        except subprocess.TimeoutExpired as e:
+            return _ExecResult(
+                stdout=(e.stdout or "") if isinstance(e.stdout, str) else "",
+                stderr=(e.stderr or "") if isinstance(e.stderr, str) else "",
+                exit_code=None,
+                timed_out=True,
+            )
+        except FileNotFoundError as e:
+            return _ExecResult(stdout="", stderr="", exit_code=None, error=str(e))
+        return _ExecResult(
+            stdout=proc.stdout or "",
+            stderr=proc.stderr or "",
+            exit_code=proc.returncode,
+        )
+    def _build_env(self) -> Dict[str, str]:
+        # Strip env down to a minimal set so subprocess code can't
+        # accidentally exfiltrate the host's secrets via env vars.
+        keep = {"PATH", "HOME", "LANG", "LC_ALL"}
+        return {k: v for k, v in os.environ.items() if k in keep}
+class DockerCodingAgentProxy(CodingAgentProxy):
+    """Ephemeral-container execution. Real isolation; requires Docker."""
+    proxy_type = "docker_coding"
+    def _initialize(self):
+        super()._initialize()
+        docker_cfg = self.config.get("docker") or {}
+        self.docker_image = docker_cfg.get("image", "python:3.11-slim")
+        self.docker_memory = docker_cfg.get("memory", "512m")
+        self.docker_cpus = str(docker_cfg.get("cpus", 1.0))
+        self.docker_network = docker_cfg.get("network", "none")
+        self._docker = None  # lazy
+        # Sanity check: warn if docker CLI isn't on PATH
+        if shutil.which("docker") is None:
+            logger.warning(
+                "DockerCodingAgentProxy: 'docker' CLI not found on PATH. "
+                "Container execution will fail at runtime."
+            )
+    def _execute(
+        self, action_type: str, code: str, session_context: dict
+    ) -> _ExecResult:
+        workspace = session_context["workspace"]
+        # Materialise the code as a file in the workspace so the container
+        # can run it without inline injection through `-c`.
+        if action_type == "python":
+            target = os.path.join(workspace, "_action.py")
+            with open(target, "w") as f:
+                f.write(code)
+            container_cmd = ["python", "/work/_action.py"]
+        else:  # shell
+            target = os.path.join(workspace, "_action.sh")
+            with open(target, "w") as f:
+                f.write(code)
+            os.chmod(target, 0o755)
+            container_cmd = ["bash", "/work/_action.sh"]
+        cmd = [
+            "docker", "run", "--rm",
+            f"--network={self.docker_network}",
+            f"--memory={self.docker_memory}",
+            f"--cpus={self.docker_cpus}",
+            "--read-only",
+            "--tmpfs", "/tmp:exec,size=64m",
+            "-v", f"{workspace}:/work",
+            "-w", "/work",
+            self.docker_image,
+        ] + container_cmd
+        try:
+            proc = subprocess.run(
+                cmd,
+                capture_output=True,
+                text=True,
+                timeout=self.per_step_timeout + 5,  # docker pull/start overhead
+            )
+        except subprocess.TimeoutExpired as e:
+            return _ExecResult(
+                stdout=(e.stdout or "") if isinstance(e.stdout, str) else "",
+                stderr=(e.stderr or "") if isinstance(e.stderr, str) else "",
+                exit_code=None,
+                timed_out=True,
+            )
+        except FileNotFoundError as e:
+            return _ExecResult(stdout="", stderr="", exit_code=None, error=str(e))
+        return _ExecResult(
+            stdout=proc.stdout or "",
+            stderr=proc.stderr or "",
+            exit_code=proc.returncode,
+        )
+# Register both with the factory so configs can refer to them by name.
+AgentProxyFactory.register("subprocess_coding", SubprocessCodingAgentProxy)
+AgentProxyFactory.register("docker_coding", DockerCodingAgentProxy)

potato/agent_proxy/echo_proxy.py ADDED Viewed

	@@ -0,0 +1,55 @@

+"""
+Echo Agent Proxy
+A testing/demo proxy that returns responses from a configurable list.
+Cycles through responses in order, wrapping around when exhausted.
+Configuration:
+    agent_proxy:
+      type: echo
+      responses:
+        - "I understand your request."
+        - "Working on it now."
+        - "Here's what I found."
+"""
+import logging
+from .base import BaseAgentProxy, AgentMessage, AgentResponse, AgentProxyFactory
+logger = logging.getLogger(__name__)
+class EchoProxy(BaseAgentProxy):
+    """Test proxy that returns canned responses in order."""
+    proxy_type = "echo"
+    def _initialize(self):
+        self.responses = self.config.get("responses", [
+            "I understand.",
+            "Working on it.",
+            "Done!",
+        ])
+    def start_session(self, task_description: str) -> dict:
+        return {"response_index": 0, "task_description": task_description}
+    def send_message(self, message: str, session_context: dict) -> AgentResponse:
+        idx = session_context.get("response_index", 0)
+        response_text = self.responses[idx % len(self.responses)]
+        session_context["response_index"] = idx + 1
+        return AgentResponse(
+            message=AgentMessage(
+                role="agent",
+                content=response_text,
+            )
+        )
+    def end_session(self, session_context: dict):
+        pass
+# Register with factory
+AgentProxyFactory.register("echo", EchoProxy)

potato/agent_proxy/http_proxy.py ADDED Viewed

	@@ -0,0 +1,108 @@

+"""
+Generic HTTP Agent Proxy
+POSTs to any REST endpoint with configurable field mapping.
+Supports sending full conversation history and custom headers.
+Configuration:
+    agent_proxy:
+      type: http
+      url: "http://localhost:8080/chat"
+      headers:
+        Authorization: "Bearer YOUR_KEY"
+      message_key: "message"        # key in request body for user message
+      response_key: "response"      # key in response JSON for agent reply
+      session_id_key: "session_id"  # key in request/response for session tracking
+      send_history: false           # whether to send full conversation history
+      history_key: "messages"       # key for history array in request body
+"""
+import logging
+import uuid
+import requests
+from .base import BaseAgentProxy, AgentMessage, AgentResponse, AgentProxyFactory
+logger = logging.getLogger(__name__)
+class GenericHTTPProxy(BaseAgentProxy):
+    """Generic REST API proxy with configurable field mapping."""
+    proxy_type = "http"
+    def _initialize(self):
+        self.url = self.config.get("url")
+        if not self.url:
+            raise ValueError("http proxy requires 'url' in agent_proxy config")
+        self.headers = self.config.get("headers", {})
+        self.message_key = self.config.get("message_key", "message")
+        self.response_key = self.config.get("response_key", "response")
+        self.session_id_key = self.config.get("session_id_key", "session_id")
+        self.send_history = self.config.get("send_history", False)
+        self.history_key = self.config.get("history_key", "messages")
+        self.timeout = self.config.get("sandbox", {}).get(
+            "request_timeout_seconds", 60
+        )
+    def start_session(self, task_description: str) -> dict:
+        return {
+            "session_id": str(uuid.uuid4()),
+            "task_description": task_description,
+            "history": [],
+        }
+    def send_message(self, message: str, session_context: dict) -> AgentResponse:
+        payload = {
+            self.message_key: message,
+            self.session_id_key: session_context["session_id"],
+        }
+        if self.send_history:
+            payload[self.history_key] = session_context.get("history", [])
+        try:
+            resp = requests.post(
+                self.url,
+                json=payload,
+                headers=self.headers,
+                timeout=self.timeout,
+            )
+            resp.raise_for_status()
+            data = resp.json()
+            response_text = data.get(self.response_key, "")
+            if not response_text and isinstance(data, str):
+                response_text = data
+            # Update history
+            session_context.setdefault("history", []).append(
+                {"role": "user", "content": message}
+            )
+            session_context["history"].append(
+                {"role": "agent", "content": response_text}
+            )
+            return AgentResponse(
+                message=AgentMessage(role="agent", content=str(response_text))
+            )
+        except requests.Timeout:
+            return AgentResponse(
+                message=AgentMessage(role="error", content="Agent request timed out."),
+                error="timeout",
+            )
+        except requests.RequestException as e:
+            logger.error(f"HTTP proxy request failed: {e}")
+            return AgentResponse(
+                message=AgentMessage(
+                    role="error", content=f"Agent communication error: {e}"
+                ),
+                error=str(e),
+            )
+# Register with factory
+AgentProxyFactory.register("http", GenericHTTPProxy)

potato/agent_proxy/openai_proxy.py ADDED Viewed

	@@ -0,0 +1,105 @@

+"""
+OpenAI Chat Completions Agent Proxy
+Uses the OpenAI SDK to communicate with chat completion models.
+Maintains conversation history in session context for multi-turn dialogue.
+Configuration:
+    agent_proxy:
+      type: openai
+      api_key: "${OPENAI_API_KEY}"   # or set OPENAI_API_KEY env var
+      model: "gpt-4o"
+      system_prompt: "You are a helpful travel agent."
+      temperature: 0.7
+      max_tokens: 1024
+"""
+import logging
+import os
+from .base import BaseAgentProxy, AgentMessage, AgentResponse, AgentProxyFactory
+logger = logging.getLogger(__name__)
+class OpenAIChatProxy(BaseAgentProxy):
+    """OpenAI Chat Completions proxy."""
+    proxy_type = "openai"
+    def _initialize(self):
+        api_key = self.config.get("api_key", "")
+        # Support environment variable references like ${OPENAI_API_KEY}
+        if api_key.startswith("${") and api_key.endswith("}"):
+            env_var = api_key[2:-1]
+            api_key = os.environ.get(env_var, "")
+        if not api_key:
+            api_key = os.environ.get("OPENAI_API_KEY", "")
+        if not api_key:
+            raise ValueError(
+                "OpenAI proxy requires api_key in config or OPENAI_API_KEY env var"
+            )
+        try:
+            import openai
+            self.client = openai.OpenAI(api_key=api_key)
+        except ImportError:
+            raise ImportError(
+                "openai package is required for the OpenAI proxy. "
+                "Install with: pip install openai"
+            )
+        self.model = self.config.get("model", "gpt-4o")
+        self.system_prompt = self.config.get("system_prompt", "")
+        self.temperature = self.config.get("temperature", 0.7)
+        self.max_tokens = self.config.get("max_tokens", 1024)
+        self.timeout = self.config.get("sandbox", {}).get(
+            "request_timeout_seconds", 60
+        )
+    def start_session(self, task_description: str) -> dict:
+        messages = []
+        if self.system_prompt:
+            messages.append({"role": "system", "content": self.system_prompt})
+        # Include task description as system context
+        messages.append({
+            "role": "system",
+            "content": f"The user's task: {task_description}",
+        })
+        return {"messages": messages}
+    def send_message(self, message: str, session_context: dict) -> AgentResponse:
+        messages = session_context.get("messages", [])
+        messages.append({"role": "user", "content": message})
+        try:
+            response = self.client.chat.completions.create(
+                model=self.model,
+                messages=messages,
+                temperature=self.temperature,
+                max_tokens=self.max_tokens,
+                timeout=self.timeout,
+            )
+            content = response.choices[0].message.content or ""
+            messages.append({"role": "assistant", "content": content})
+            session_context["messages"] = messages
+            return AgentResponse(
+                message=AgentMessage(role="agent", content=content)
+            )
+        except Exception as e:
+            logger.error(f"OpenAI proxy error: {e}")
+            return AgentResponse(
+                message=AgentMessage(
+                    role="error", content=f"Agent error: {e}"
+                ),
+                error=str(e),
+            )
+# Register with factory
+AgentProxyFactory.register("openai", OpenAIChatProxy)

potato/agent_proxy/sandbox.py ADDED Viewed

	@@ -0,0 +1,76 @@

+"""
+Agent Proxy Safety Sandbox
+Enforces limits on agent interactions: step counts, session timeouts,
+rate limits, and request timeouts. Prevents runaway or abusive sessions.
+"""
+import time
+import threading
+import logging
+from collections import defaultdict
+from typing import Dict, List
+logger = logging.getLogger(__name__)
+class SandboxViolation(Exception):
+    """Raised when a safety limit is exceeded."""
+    pass
+class SafetySandbox:
+    """Enforces safety limits on agent interactions."""
+    def __init__(self, config: dict):
+        sandbox_config = config.get("sandbox", {})
+        self.max_steps = sandbox_config.get("max_steps", 20)
+        self.max_session_seconds = sandbox_config.get("max_session_seconds", 600)
+        self.rate_limit_per_minute = sandbox_config.get("rate_limit_per_minute", 10)
+        self.request_timeout = sandbox_config.get("request_timeout_seconds", 60)
+        # Sliding window rate limit tracking: user_id -> list of timestamps
+        self._rate_windows: Dict[str, List[float]] = defaultdict(list)
+        self._lock = threading.Lock()
+    def check_step_limit(self, current_steps: int):
+        """Raise SandboxViolation if step limit reached."""
+        if current_steps >= self.max_steps:
+            raise SandboxViolation(
+                f"Step limit reached ({self.max_steps}). "
+                f"Please finish the conversation."
+            )
+    def check_session_timeout(self, session_start: float):
+        """Raise SandboxViolation if session has timed out."""
+        elapsed = time.time() - session_start
+        if elapsed > self.max_session_seconds:
+            raise SandboxViolation(
+                f"Session timeout ({self.max_session_seconds}s). "
+                f"Please finish the conversation."
+            )
+    def check_rate_limit(self, user_id: str):
+        """Raise SandboxViolation if user is sending too fast."""
+        now = time.time()
+        window_start = now - 60.0
+        with self._lock:
+            # Remove old entries outside the 1-minute window
+            timestamps = self._rate_windows[user_id]
+            self._rate_windows[user_id] = [
+                t for t in timestamps if t > window_start
+            ]
+            if len(self._rate_windows[user_id]) >= self.rate_limit_per_minute:
+                raise SandboxViolation(
+                    f"Rate limit exceeded ({self.rate_limit_per_minute}/min). "
+                    f"Please wait before sending another message."
+                )
+            # Record this request
+            self._rate_windows[user_id].append(now)
+    def get_request_timeout(self) -> float:
+        """Get the timeout in seconds for proxy HTTP requests."""
+        return self.request_timeout

potato/agent_proxy/session.py ADDED Viewed

	@@ -0,0 +1,119 @@

+"""
+Agent Session Manager
+Thread-safe singleton that tracks active agent interaction sessions.
+Each session maps a (user_id, instance_id) pair to an AgentSession
+containing the proxy, conversation history, and step count.
+Follows the same singleton pattern as ItemStateManager and UserStateManager.
+"""
+import threading
+import time
+import logging
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional, Tuple
+from .base import AgentMessage, BaseAgentProxy
+logger = logging.getLogger(__name__)
+@dataclass
+class AgentSession:
+    """An active agent interaction session."""
+    user_id: str
+    instance_id: str
+    proxy: BaseAgentProxy
+    task_description: str
+    proxy_context: dict = field(default_factory=dict)
+    messages: List[AgentMessage] = field(default_factory=list)
+    step_count: int = 0
+    started_at: float = field(default_factory=time.time)
+    finished: bool = False
+class AgentSessionManager:
+    """Thread-safe manager for active agent sessions."""
+    def __init__(self, config: dict):
+        self.config = config
+        self._sessions: Dict[Tuple[str, str], AgentSession] = {}
+        self._lock = threading.RLock()
+    def create_session(
+        self,
+        user_id: str,
+        instance_id: str,
+        proxy: BaseAgentProxy,
+        task_description: str,
+    ) -> AgentSession:
+        """Create a new session for a user/instance pair."""
+        with self._lock:
+            key = (user_id, instance_id)
+            if key in self._sessions and not self._sessions[key].finished:
+                logger.warning(
+                    f"Session already exists for {key}, returning existing"
+                )
+                return self._sessions[key]
+            proxy_context = proxy.start_session(task_description)
+            session = AgentSession(
+                user_id=user_id,
+                instance_id=instance_id,
+                proxy=proxy,
+                task_description=task_description,
+                proxy_context=proxy_context,
+            )
+            self._sessions[key] = session
+            logger.debug(f"Created agent session for {key}")
+            return session
+    def get_session(
+        self, user_id: str, instance_id: str
+    ) -> Optional[AgentSession]:
+        """Get an active session, or None if not found."""
+        with self._lock:
+            return self._sessions.get((user_id, instance_id))
+    def remove_session(self, user_id: str, instance_id: str):
+        """Remove a session and clean up proxy resources."""
+        with self._lock:
+            key = (user_id, instance_id)
+            session = self._sessions.pop(key, None)
+            if session:
+                try:
+                    session.proxy.end_session(session.proxy_context)
+                except Exception as e:
+                    logger.warning(f"Error ending proxy session for {key}: {e}")
+                logger.debug(f"Removed agent session for {key}")
+# Singleton management
+_AGENT_SESSION_MANAGER: Optional[AgentSessionManager] = None
+_AGENT_SESSION_MANAGER_LOCK = threading.Lock()
+def init_agent_session_manager(config: dict) -> AgentSessionManager:
+    """Initialize the singleton AgentSessionManager."""
+    global _AGENT_SESSION_MANAGER
+    if _AGENT_SESSION_MANAGER is None:
+        with _AGENT_SESSION_MANAGER_LOCK:
+            if _AGENT_SESSION_MANAGER is None:
+                _AGENT_SESSION_MANAGER = AgentSessionManager(config)
+    return _AGENT_SESSION_MANAGER
+def get_agent_session_manager() -> AgentSessionManager:
+    """Get the singleton AgentSessionManager."""
+    global _AGENT_SESSION_MANAGER
+    if _AGENT_SESSION_MANAGER is None:
+        raise ValueError("AgentSessionManager has not been initialized yet!")
+    return _AGENT_SESSION_MANAGER
+def clear_agent_session_manager():
+    """Clear the singleton instance (for testing)."""
+    global _AGENT_SESSION_MANAGER
+    with _AGENT_SESSION_MANAGER_LOCK:
+        _AGENT_SESSION_MANAGER = None

potato/agent_runner.py ADDED Viewed

	@@ -0,0 +1,1008 @@

+"""
+Live Agent Runner
+Manages an AI agent that browses the web via Playwright, controlled by an LLM.
+Annotators can observe, pause, instruct, or take over the agent in real time.
+The agent loop runs in a background thread with its own asyncio event loop.
+Communication with Flask routes happens through thread-safe state and queues.
+"""
+import asyncio
+import base64
+import json
+import logging
+import os
+import threading
+import time
+import uuid
+from dataclasses import dataclass, field
+from enum import Enum
+from queue import Queue, Empty
+from typing import Any, Callable, Dict, List, Optional
+logger = logging.getLogger(__name__)
+class AgentState(Enum):
+    """States of the agent lifecycle."""
+    IDLE = "idle"
+    RUNNING = "running"
+    PAUSED = "paused"
+    TAKEOVER = "takeover"
+    COMPLETED = "completed"
+    ERROR = "error"
+@dataclass
+class AgentStep:
+    """A single step in the agent's execution."""
+    step_index: int
+    screenshot_path: str
+    action: Dict[str, Any]
+    thought: str
+    observation: str
+    timestamp: float
+    url: str = ""
+    viewport: Optional[Dict[str, int]] = None
+    coordinates: Optional[Dict[str, int]] = None
+    element: Optional[Dict[str, Any]] = None
+    annotator_instruction: Optional[str] = None
+    def to_dict(self) -> Dict[str, Any]:
+        d = {
+            "step_index": self.step_index,
+            "screenshot_url": self.screenshot_path,
+            "action_type": self.action.get("type", "unknown"),
+            "action": self.action,
+            "thought": self.thought,
+            "observation": self.observation,
+            "timestamp": self.timestamp,
+            "url": self.url,
+        }
+        if self.viewport:
+            d["viewport"] = self.viewport
+        if self.coordinates:
+            d["coordinates"] = self.coordinates
+        if self.element:
+            d["element"] = self.element
+        if self.annotator_instruction:
+            d["annotator_instruction"] = self.annotator_instruction
+        return d
+@dataclass
+class AgentConfig:
+    """Configuration for the agent runner."""
+    max_steps: int = 30
+    step_delay: float = 1.0
+    viewport_width: int = 1280
+    viewport_height: int = 720
+    system_prompt: str = ""
+    model: str = "claude-sonnet-4-20250514"
+    api_key: str = ""
+    max_tokens: int = 4096
+    temperature: float = 0.3
+    endpoint_type: str = "anthropic_vision"
+    history_window: int = 5  # Number of recent steps to include in LLM context
+    timeout: int = 60  # Per-request timeout in seconds
+    base_url: str = ""  # For Ollama: server URL
+    @classmethod
+    def from_config(cls, config: Dict[str, Any]) -> "AgentConfig":
+        """Create AgentConfig from a live_agent YAML config dict."""
+        ai_config = config.get("ai_config", {})
+        viewport = config.get("viewport", {})
+        endpoint_type = config.get("endpoint_type", "anthropic_vision")
+        # API key: Ollama doesn't need one; OpenAI-compatible servers
+        # (e.g. vLLM) ignore it but the SDK requires a non-empty string.
+        if endpoint_type == "ollama_vision":
+            api_key = ai_config.get("api_key", "")
+            default_model = "gemma3:4b"
+        elif endpoint_type == "openai_vision":
+            api_key = ai_config.get("api_key", os.environ.get("OPENAI_API_KEY", "EMPTY"))
+            default_model = ""  # must be set explicitly (e.g. served model id)
+        else:
+            api_key = ai_config.get("api_key", os.environ.get("ANTHROPIC_API_KEY", ""))
+            default_model = "claude-sonnet-4-20250514"
+        return cls(
+            max_steps=config.get("max_steps", 30),
+            step_delay=config.get("step_delay", 1.0),
+            viewport_width=viewport.get("width", 1280),
+            viewport_height=viewport.get("height", 720),
+            system_prompt=config.get("system_prompt", DEFAULT_SYSTEM_PROMPT),
+            model=ai_config.get("model", default_model),
+            api_key=api_key,
+            max_tokens=ai_config.get("max_tokens", 4096),
+            temperature=ai_config.get("temperature", 0.3),
+            endpoint_type=endpoint_type,
+            history_window=config.get("history_window", 5),
+            timeout=ai_config.get("timeout", 60),
+            base_url=ai_config.get("base_url", "http://localhost:11434"),
+        )
+DEFAULT_SYSTEM_PROMPT = """You are a web browsing agent. You can see screenshots of web pages and take actions to complete tasks.
+For each step, analyze the current screenshot and respond with a JSON object:
+{
+  "thought": "Your reasoning about what you see and what to do next",
+  "action": {
+    "type": "click|type|scroll|navigate|wait|done",
+    // For click: "x": 100, "y": 200
+    // For type: "text": "hello world"
+    // For scroll: "direction": "up|down", "amount": 300
+    // For navigate: "url": "https://..."
+    // For wait: (no extra fields)
+    // For done: "summary": "Task completed because..."
+  }
+}
+Always respond with valid JSON only. No markdown, no extra text."""
+class AgentRunner:
+    """
+    Runs an AI agent that browses the web via Playwright.
+    The agent loop:
+    1. Takes a screenshot
+    2. Sends it to the LLM with context/history
+    3. Parses the LLM response for an action
+    4. Executes the action via Playwright
+    5. Emits events to all listeners (for SSE)
+    6. Repeats until done, error, or max_steps
+    Thread-safe control methods allow pause/resume/instruct/takeover.
+    """
+    def __init__(self, session_id: str, config: AgentConfig, screenshot_dir: str):
+        self.session_id = session_id
+        self.config = config
+        self.screenshot_dir = screenshot_dir
+        # State
+        self._state = AgentState.IDLE
+        self._state_lock = threading.Lock()
+        self._steps: List[AgentStep] = []
+        self._error: Optional[str] = None
+        # Control
+        self._pause_event = threading.Event()
+        self._pause_event.set()  # Not paused initially
+        self._stop_flag = threading.Event()
+        self._instruction_queue: Queue = Queue()
+        self._takeover_actions: Queue = Queue()
+        # Listeners for SSE
+        self._listeners: List[Callable] = []
+        self._listeners_lock = threading.Lock()
+        # Annotator interactions log
+        self._interactions: List[Dict[str, Any]] = []
+        # Playwright session (set during run)
+        self._playwright_session = None
+        self._llm_client = None
+        # Background thread
+        self._thread: Optional[threading.Thread] = None
+    @property
+    def state(self) -> AgentState:
+        with self._state_lock:
+            return self._state
+    @state.setter
+    def state(self, new_state: AgentState):
+        with self._state_lock:
+            old_state = self._state
+            self._state = new_state
+        self._emit_event("state_change", {
+            "old_state": old_state.value,
+            "new_state": new_state.value,
+            "timestamp": time.time(),
+        })
+    @property
+    def steps(self) -> List[AgentStep]:
+        return list(self._steps)
+    @property
+    def step_count(self) -> int:
+        return len(self._steps)
+    @property
+    def error(self) -> Optional[str]:
+        return self._error
+    # --- Control methods (thread-safe) ---
+    def pause(self):
+        """Pause the agent loop after the current step completes."""
+        if self.state == AgentState.RUNNING:
+            self._pause_event.clear()
+            self.state = AgentState.PAUSED
+            logger.info(f"[{self.session_id}] Agent paused")
+    def resume(self):
+        """Resume a paused agent."""
+        if self.state == AgentState.PAUSED:
+            self.state = AgentState.RUNNING
+            self._pause_event.set()
+            logger.info(f"[{self.session_id}] Agent resumed")
+    def inject_instruction(self, instruction: str):
+        """Send an instruction to the agent (processed at next step)."""
+        self._instruction_queue.put(instruction)
+        self._interactions.append({
+            "type": "instruction",
+            "text": instruction,
+            "timestamp": time.time(),
+            "step_index": self.step_count,
+        })
+        self._emit_event("instruction_received", {"instruction": instruction})
+        logger.info(f"[{self.session_id}] Instruction injected: {instruction[:100]}")
+    def enter_takeover(self):
+        """Switch to manual takeover mode."""
+        if self.state in (AgentState.RUNNING, AgentState.PAUSED):
+            self._pause_event.clear()  # Pause the agent loop
+            self.state = AgentState.TAKEOVER
+            self._interactions.append({
+                "type": "takeover_start",
+                "timestamp": time.time(),
+                "step_index": self.step_count,
+            })
+            logger.info(f"[{self.session_id}] Takeover mode entered")
+    def exit_takeover(self):
+        """Exit manual takeover and resume the agent."""
+        if self.state == AgentState.TAKEOVER:
+            self._interactions.append({
+                "type": "takeover_end",
+                "timestamp": time.time(),
+                "step_index": self.step_count,
+            })
+            self.state = AgentState.RUNNING
+            self._pause_event.set()
+            logger.info(f"[{self.session_id}] Takeover mode exited")
+    def submit_manual_action(self, action: Dict[str, Any]):
+        """Submit a manual action during takeover mode."""
+        if self.state == AgentState.TAKEOVER:
+            self._takeover_actions.put(action)
+    def stop(self):
+        """Stop the agent loop."""
+        self._stop_flag.set()
+        self._pause_event.set()  # Unblock if paused
+        logger.info(f"[{self.session_id}] Stop requested")
+    # --- Listener management ---
+    def add_listener(self, callback: Callable):
+        """Add an SSE listener callback."""
+        with self._listeners_lock:
+            self._listeners.append(callback)
+    def remove_listener(self, callback: Callable):
+        """Remove an SSE listener callback."""
+        with self._listeners_lock:
+            self._listeners = [l for l in self._listeners if l is not callback]
+    def _emit_event(self, event_type: str, data: Dict[str, Any]):
+        """Emit an event to all listeners."""
+        event = {"type": event_type, "data": data, "session_id": self.session_id}
+        with self._listeners_lock:
+            for listener in self._listeners:
+                try:
+                    listener(event)
+                except Exception as e:
+                    logger.warning(f"Listener error: {e}")
+    # --- Main agent loop ---
+    def start(self, task_description: str, start_url: str):
+        """Start the agent in a background thread."""
+        if self.state != AgentState.IDLE:
+            raise RuntimeError(f"Cannot start agent in state {self.state}")
+        self._thread = threading.Thread(
+            target=self._run_thread,
+            args=(task_description, start_url),
+            daemon=True,
+            name=f"agent-{self.session_id}",
+        )
+        self._thread.start()
+    def _run_thread(self, task_description: str, start_url: str):
+        """Thread target: runs the async agent loop."""
+        loop = asyncio.new_event_loop()
+        asyncio.set_event_loop(loop)
+        try:
+            loop.run_until_complete(self._run_async(task_description, start_url))
+        except Exception as e:
+            logger.error(f"[{self.session_id}] Agent thread error: {e}")
+            self._error = str(e)
+            self.state = AgentState.ERROR
+            self._emit_event("error", {"message": str(e)})
+        finally:
+            loop.close()
+    async def _run_async(self, task_description: str, start_url: str):
+        """Async agent loop."""
+        from potato.web_playwright import PlaywrightSession
+        self.state = AgentState.RUNNING
+        # Initialize Playwright
+        self._playwright_session = PlaywrightSession(
+            width=self.config.viewport_width,
+            height=self.config.viewport_height,
+        )
+        started = await self._playwright_session.start(start_url)
+        if not started:
+            raise RuntimeError("Failed to start Playwright browser session")
+        # Initialize LLM client
+        self._init_llm_client()
+        self._emit_event("started", {
+            "task": task_description,
+            "start_url": start_url,
+            "max_steps": self.config.max_steps,
+        })
+        try:
+            for step_index in range(self.config.max_steps):
+                # Check stop flag
+                if self._stop_flag.is_set():
+                    logger.info(f"[{self.session_id}] Stopped by user")
+                    break
+                # Wait if paused (blocks until resume/stop)
+                while not self._pause_event.is_set():
+                    if self._stop_flag.is_set():
+                        break
+                    # Handle takeover actions while paused in takeover mode
+                    if self.state == AgentState.TAKEOVER:
+                        await self._process_takeover_actions()
+                    await asyncio.sleep(0.1)
+                if self._stop_flag.is_set():
+                    break
+                # Check for injected instructions
+                instruction = None
+                try:
+                    instruction = self._instruction_queue.get_nowait()
+                except Empty:
+                    pass
+                # Execute one agent step
+                step = await self._agent_step(
+                    step_index, task_description, instruction
+                )
+                self._steps.append(step)
+                # Check if agent decided it's done
+                if step.action.get("type") == "done":
+                    logger.info(f"[{self.session_id}] Agent completed task")
+                    break
+                # Step delay
+                if self.config.step_delay > 0:
+                    await asyncio.sleep(self.config.step_delay)
+            self.state = AgentState.COMPLETED
+            self._emit_event("complete", {
+                "total_steps": len(self._steps),
+                "final_url": (await self._playwright_session.get_state()).get("url", ""),
+            })
+        finally:
+            await self._playwright_session.stop()
+            self._playwright_session = None
+    async def _agent_step(
+        self,
+        step_index: int,
+        task_description: str,
+        instruction: Optional[str] = None,
+    ) -> AgentStep:
+        """Execute a single agent step: screenshot → LLM → action → emit."""
+        # 1. Take screenshot
+        screenshot_bytes = await self._playwright_session.screenshot()
+        if not screenshot_bytes:
+            raise RuntimeError("Failed to capture screenshot")
+        screenshot_path = os.path.join(
+            self.screenshot_dir, f"step_{step_index:03d}.png"
+        )
+        os.makedirs(os.path.dirname(screenshot_path), exist_ok=True)
+        with open(screenshot_path, "wb") as f:
+            f.write(screenshot_bytes)
+        # 2. Get page state
+        page_state = await self._playwright_session.get_state()
+        # 3. Emit thinking event
+        self._emit_event("thinking", {
+            "step_index": step_index,
+            "screenshot_url": screenshot_path,
+            "url": page_state.get("url", ""),
+        })
+        # 4. Build messages and query LLM
+        screenshot_b64 = base64.b64encode(screenshot_bytes).decode("utf-8")
+        messages = self._build_llm_messages(
+            screenshot_b64, task_description, instruction
+        )
+        llm_response = self._query_llm(messages)
+        # 5. Parse action from response
+        thought, action = self._parse_action(llm_response)
+        # 6. Execute action
+        observation = await self._execute_action(action)
+        # 7. Build step
+        step = AgentStep(
+            step_index=step_index,
+            screenshot_path=screenshot_path,
+            action=action,
+            thought=thought,
+            observation=observation,
+            timestamp=time.time(),
+            url=page_state.get("url", ""),
+            viewport=page_state.get("viewport"),
+            coordinates=_extract_coordinates(action),
+            annotator_instruction=instruction,
+        )
+        # 8. Emit step event
+        self._emit_event("step", step.to_dict())
+        return step
+    def _build_llm_messages(
+        self,
+        screenshot_b64: str,
+        task_description: str,
+        instruction: Optional[str] = None,
+    ) -> List[Dict[str, Any]]:
+        """Build message list for the LLM vision API."""
+        messages = []
+        # System message
+        system_prompt = self.config.system_prompt or DEFAULT_SYSTEM_PROMPT
+        messages.append({"role": "system", "content": system_prompt})
+        # Task description
+        task_msg = f"Task: {task_description}"
+        if instruction:
+            task_msg += f"\n\nAnnotator instruction: {instruction}"
+        # Include recent step history
+        history_steps = self._steps[-self.config.history_window:]
+        if history_steps:
+            history_parts = []
+            for s in history_steps:
+                entry = f"Step {s.step_index}: thought='{s.thought}', action={json.dumps(s.action)}, observation='{s.observation}'"
+                history_parts.append(entry)
+            task_msg += "\n\nRecent history:\n" + "\n".join(history_parts)
+        messages.append({"role": "user", "content": task_msg})
+        # Current screenshot (as a separate user message with image)
+        messages.append({
+            "role": "user",
+            "content": [
+                {
+                    "type": "image",
+                    "source": {
+                        "type": "base64",
+                        "media_type": "image/png",
+                        "data": screenshot_b64,
+                    },
+                },
+                {
+                    "type": "text",
+                    "text": f"Current page screenshot (step {len(self._steps)}). What action should I take next?",
+                },
+            ],
+        })
+        return messages
+    def _init_llm_client(self):
+        """Initialize the LLM client based on endpoint_type."""
+        if self.config.endpoint_type == "anthropic_vision":
+            try:
+                import anthropic
+            except ImportError:
+                raise RuntimeError(
+                    "anthropic package required. Install with: pip install anthropic"
+                )
+            api_key = self.config.api_key or os.environ.get("ANTHROPIC_API_KEY")
+            if not api_key:
+                raise RuntimeError(
+                    "Anthropic API key required. Set in config or ANTHROPIC_API_KEY env var."
+                )
+            self._llm_client = anthropic.Anthropic(
+                api_key=api_key, timeout=self.config.timeout
+            )
+        elif self.config.endpoint_type == "ollama_vision":
+            try:
+                import ollama
+            except ImportError:
+                raise RuntimeError(
+                    "ollama package required. Install with: pip install ollama"
+                )
+            host = self.config.base_url or "http://localhost:11434"
+            self._llm_client = ollama.Client(
+                host=host, timeout=self.config.timeout
+            )
+            # Verify connectivity
+            try:
+                self._llm_client.list()
+                logger.info(f"Connected to Ollama at {host}, model: {self.config.model}")
+            except Exception as e:
+                raise RuntimeError(f"Failed to connect to Ollama at {host}: {e}")
+        elif self.config.endpoint_type == "openai_vision":
+            try:
+                from openai import OpenAI
+            except ImportError:
+                raise RuntimeError(
+                    "openai package required. Install with: pip install openai"
+                )
+            base_url = self.config.base_url or "https://api.openai.com/v1"
+            self._llm_client = OpenAI(
+                base_url=base_url,
+                api_key=self.config.api_key or "EMPTY",
+                timeout=self.config.timeout,
+            )
+            try:
+                self._llm_client.models.list()
+                logger.info(
+                    f"Connected to OpenAI-compatible endpoint at {base_url}, "
+                    f"model: {self.config.model}"
+                )
+            except Exception as e:
+                # Non-fatal: some servers gate /models; the chat call will
+                # surface a real error if the endpoint is truly unreachable.
+                logger.warning(
+                    f"Could not list models at {base_url} ({e}); continuing."
+                )
+        else:
+            raise RuntimeError(
+                f"Unsupported endpoint_type: {self.config.endpoint_type}. "
+                f"Supported: 'anthropic_vision', 'ollama_vision', 'openai_vision'."
+            )
+    def _query_llm(self, messages: List[Dict[str, Any]]) -> str:
+        """Send messages to the LLM and return the text response."""
+        if self.config.endpoint_type == "anthropic_vision":
+            return self._query_anthropic(messages)
+        elif self.config.endpoint_type == "ollama_vision":
+            return self._query_ollama(messages)
+        elif self.config.endpoint_type == "openai_vision":
+            return self._query_openai(messages)
+        raise RuntimeError(f"Unsupported endpoint type: {self.config.endpoint_type}")
+    def _query_openai(self, messages: List[Dict[str, Any]]) -> str:
+        """Query an OpenAI-compatible vision endpoint (OpenAI, vLLM, etc.).
+        Converts the internal Anthropic-style message blocks into OpenAI
+        chat-completions format (image blocks become ``image_url`` data
+        URIs). Requests a JSON object response when the server supports it,
+        falling back gracefully if it does not.
+        """
+        oai_messages = []
+        for msg in messages:
+            role = msg["role"]
+            content = msg.get("content", "")
+            if isinstance(content, str):
+                oai_messages.append({"role": role, "content": content})
+                continue
+            parts = []
+            for block in content:
+                if not isinstance(block, dict):
+                    continue
+                if block.get("type") == "text":
+                    parts.append({"type": "text", "text": block.get("text", "")})
+                elif block.get("type") == "image":
+                    src = block.get("source", {})
+                    if src.get("type") == "base64":
+                        media = src.get("media_type", "image/png")
+                        parts.append({
+                            "type": "image_url",
+                            "image_url": {
+                                "url": f"data:{media};base64,{src['data']}"
+                            },
+                        })
+            oai_messages.append({"role": role, "content": parts})
+        kwargs = {
+            "model": self.config.model,
+            "messages": oai_messages,
+            "max_tokens": self.config.max_tokens,
+            "temperature": self.config.temperature,
+        }
+        def _is_rate_limit(exc) -> bool:
+            if getattr(exc, "status_code", None) == 429:
+                return True
+            s = str(exc).lower()
+            return ("429" in s or "rate limit" in s or "quota" in s
+                    or "resource_exhausted" in s)
+        def _create(use_rf: bool):
+            if use_rf:
+                return self._llm_client.chat.completions.create(
+                    response_format={"type": "json_object"}, **kwargs)
+            return self._llm_client.chat.completions.create(**kwargs)
+        # Transient 429s (per-minute rate/token bursts) are common mid-run
+        # even on paid tiers; back off and retry instead of failing the
+        # whole agent session.
+        backoffs = [5, 15, 30, 30, 30]
+        use_rf = True
+        attempt = 0
+        while True:
+            try:
+                resp = _create(use_rf)
+                break
+            except Exception as e:
+                if _is_rate_limit(e):
+                    if attempt >= len(backoffs):
+                        raise
+                    wait = backoffs[attempt]
+                    attempt += 1
+                    logger.warning(
+                        f"[{self.session_id}] LLM 429/rate-limited; "
+                        f"retry {attempt}/{len(backoffs)} in {wait}s"
+                    )
+                    self._emit_event("thinking", {
+                        "text": f"Rate-limited by the model API; "
+                                f"waiting {wait}s before retrying…"
+                    })
+                    time.sleep(wait)
+                    continue
+                if use_rf:
+                    # Server may not support response_format; drop it once.
+                    use_rf = False
+                    continue
+                raise
+        return resp.choices[0].message.content or ""
+    def _query_anthropic(self, messages: List[Dict[str, Any]]) -> str:
+        """Query Anthropic Claude with vision support."""
+        # Separate system message
+        system = ""
+        api_messages = []
+        for msg in messages:
+            if msg["role"] == "system":
+                system = msg["content"]
+            else:
+                api_messages.append(msg)
+        kwargs = {
+            "model": self.config.model,
+            "max_tokens": self.config.max_tokens,
+            "temperature": self.config.temperature,
+            "messages": api_messages,
+        }
+        if system:
+            kwargs["system"] = system
+        response = self._llm_client.messages.create(**kwargs)
+        return response.content[0].text
+    def _query_ollama(self, messages: List[Dict[str, Any]]) -> str:
+        """Query Ollama vision model.
+        Converts Anthropic-format messages to Ollama format:
+        - System messages are prepended to the prompt text
+        - Multiple user messages are merged into a single message
+        - Content blocks with images use Ollama's 'images' key
+        """
+        # Extract text and images from Anthropic-format messages
+        all_text_parts = []
+        all_images = []
+        for msg in messages:
+            content = msg.get("content", "")
+            if msg["role"] == "system":
+                if isinstance(content, str) and content:
+                    all_text_parts.insert(0, content)
+                continue
+            if isinstance(content, list):
+                for block in content:
+                    if isinstance(block, dict):
+                        if block.get("type") == "text":
+                            all_text_parts.append(block["text"])
+                        elif block.get("type") == "image":
+                            source = block.get("source", {})
+                            if source.get("type") == "base64":
+                                all_images.append(source["data"])
+            elif isinstance(content, str) and content:
+                all_text_parts.append(content)
+        ollama_msg = {
+            "role": "user",
+            "content": "\n\n".join(all_text_parts),
+        }
+        if all_images:
+            ollama_msg["images"] = all_images
+        options = {
+            "temperature": self.config.temperature,
+            "num_predict": self.config.max_tokens,
+        }
+        # Use Ollama's format schema to force structured JSON output
+        agent_schema = {
+            "type": "object",
+            "properties": {
+                "thought": {"type": "string"},
+                "action": {
+                    "type": "object",
+                    "properties": {
+                        "type": {"type": "string"},
+                        "x": {"type": "integer"},
+                        "y": {"type": "integer"},
+                        "text": {"type": "string"},
+                        "url": {"type": "string"},
+                        "direction": {"type": "string"},
+                        "amount": {"type": "integer"},
+                        "summary": {"type": "string"},
+                    },
+                    "required": ["type"],
+                },
+            },
+            "required": ["thought", "action"],
+        }
+        response = self._llm_client.chat(
+            model=self.config.model,
+            messages=[ollama_msg],
+            options=options,
+            format=agent_schema,
+        )
+        # Extract content from response (handle both dict and Pydantic model)
+        message = (
+            response.get("message")
+            if hasattr(response, "get")
+            else getattr(response, "message", None)
+        )
+        if message is None:
+            raise RuntimeError("No message in Ollama response")
+        content = (
+            message.get("content")
+            if hasattr(message, "get")
+            else getattr(message, "content", None)
+        )
+        # Some models (e.g. qwen3-vl) put responses in 'thinking' field
+        # and leave content empty. Extract the agent JSON from thinking.
+        if not content:
+            thinking = (
+                message.get("thinking")
+                if hasattr(message, "get")
+                else getattr(message, "thinking", None)
+            )
+            if thinking:
+                content = _extract_agent_json(thinking)
+        return content or ""
+    def _parse_action(self, llm_response: str) -> tuple:
+        """Parse thought and action from LLM JSON response.
+        Returns:
+            (thought, action_dict)
+        """
+        # Try to extract JSON from response
+        text = llm_response.strip()
+        # Handle markdown code blocks
+        if "```json" in text:
+            import re
+            match = re.search(r"```json\s*([\s\S]*?)\s*```", text)
+            if match:
+                text = match.group(1).strip()
+        elif "```" in text:
+            import re
+            match = re.search(r"```\s*([\s\S]*?)\s*```", text)
+            if match:
+                text = match.group(1).strip()
+        try:
+            parsed = json.loads(text)
+        except json.JSONDecodeError:
+            logger.warning(f"Failed to parse LLM response as JSON: {text[:200]}")
+            return text, {"type": "wait"}
+        thought = parsed.get("thought", "")
+        action = parsed.get("action", {"type": "wait"})
+        # Validate action has a type
+        if "type" not in action:
+            action["type"] = "wait"
+        return thought, action
+    async def _execute_action(self, action: Dict[str, Any]) -> str:
+        """Execute an action via Playwright and return observation."""
+        action_type = action.get("type", "wait")
+        pw = self._playwright_session
+        try:
+            if action_type == "click":
+                x = int(action.get("x", 0))
+                y = int(action.get("y", 0))
+                success = await pw.click(x, y)
+                return f"Clicked at ({x}, {y})" if success else f"Click failed at ({x}, {y})"
+            elif action_type == "type":
+                text = action.get("text", "")
+                # Handle control characters via keyboard.press
+                if text == "\b":
+                    success = await pw.page.keyboard.press("Backspace") or True
+                    return "Pressed Backspace"
+                elif text == "\n":
+                    success = await pw.page.keyboard.press("Enter") or True
+                    return "Pressed Enter"
+                elif text == "\t":
+                    success = await pw.page.keyboard.press("Tab") or True
+                    return "Pressed Tab"
+                else:
+                    success = await pw.type_text(text)
+                    return f"Typed '{text}'" if success else f"Type failed: '{text}'"
+            elif action_type == "scroll":
+                direction = action.get("direction", "down")
+                amount = int(action.get("amount", 300))
+                dy = amount if direction == "down" else -amount
+                success = await pw.scroll(0, dy)
+                return f"Scrolled {direction} by {amount}px" if success else "Scroll failed"
+            elif action_type == "navigate":
+                url = action.get("url", "")
+                success = await pw.navigate(url)
+                return f"Navigated to {url}" if success else f"Navigation failed: {url}"
+            elif action_type == "wait":
+                await asyncio.sleep(1)
+                return "Waited 1 second"
+            elif action_type == "done":
+                summary = action.get("summary", "Task completed")
+                return summary
+            else:
+                logger.warning(f"Unknown action type: {action_type}")
+                return f"Unknown action: {action_type}"
+        except Exception as e:
+            logger.error(f"Action execution error: {e}")
+            return f"Error executing {action_type}: {e}"
+    async def _process_takeover_actions(self):
+        """Process manual actions submitted during takeover mode."""
+        try:
+            action = self._takeover_actions.get_nowait()
+        except Empty:
+            return
+        pw = self._playwright_session
+        if not pw:
+            return
+        observation = await self._execute_action(action)
+        # Take screenshot after manual action
+        screenshot_bytes = await pw.screenshot()
+        step_index = len(self._steps)
+        screenshot_path = os.path.join(
+            self.screenshot_dir, f"step_{step_index:03d}_manual.png"
+        )
+        if screenshot_bytes:
+            with open(screenshot_path, "wb") as f:
+                f.write(screenshot_bytes)
+        page_state = await pw.get_state()
+        step = AgentStep(
+            step_index=step_index,
+            screenshot_path=screenshot_path,
+            action={**action, "_manual": True},
+            thought="[Manual takeover action]",
+            observation=observation,
+            timestamp=time.time(),
+            url=page_state.get("url", ""),
+            viewport=page_state.get("viewport"),
+            coordinates=_extract_coordinates(action),
+        )
+        self._steps.append(step)
+        self._emit_event("step", step.to_dict())
+    # --- Trace export ---
+    def get_trace(self) -> Dict[str, Any]:
+        """Export the session as a web_agent_trace-compatible dict."""
+        return {
+            "steps": [s.to_dict() for s in self._steps],
+            "task_description": "",  # Set by caller
+            "session_id": self.session_id,
+            "agent_config": {
+                "model": self.config.model,
+                "endpoint_type": self.config.endpoint_type,
+                "max_steps": self.config.max_steps,
+            },
+            "annotator_interactions": self._interactions,
+            "state": self.state.value,
+            "total_steps": len(self._steps),
+        }
+    def get_state_summary(self) -> Dict[str, Any]:
+        """Get a summary of current state for API responses."""
+        return {
+            "session_id": self.session_id,
+            "state": self.state.value,
+            "step_count": len(self._steps),
+            "error": self._error,
+            "has_instructions_pending": not self._instruction_queue.empty(),
+        }
+def _extract_agent_json(text: str) -> str:
+    """Extract the last valid JSON object containing 'thought' or 'action' from text.
+    Some models (qwen3-vl) put their chain-of-thought in the thinking field
+    with the actual JSON answer embedded in the text. This function finds
+    that JSON, skipping any example/template JSON from the prompt.
+    """
+    import re
+    # Find all JSON-like blocks (balanced braces)
+    candidates = []
+    depth = 0
+    start = None
+    for i, ch in enumerate(text):
+        if ch == "{":
+            if depth == 0:
+                start = i
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0 and start is not None:
+                candidates.append(text[start : i + 1])
+                start = None
+    # Try each candidate (last first — most likely to be the final answer)
+    for candidate in reversed(candidates):
+        try:
+            parsed = json.loads(candidate)
+            if isinstance(parsed, dict) and ("thought" in parsed or "action" in parsed):
+                return candidate
+        except (json.JSONDecodeError, ValueError):
+            continue
+    # Fallback: try greedy regex for any JSON
+    match = re.search(r"\{[^{}]*\}", text)
+    return match.group(0) if match else ""
+def _extract_coordinates(action: Dict[str, Any]) -> Optional[Dict[str, int]]:
+    """Extract x, y coordinates from an action if present."""
+    if "x" in action and "y" in action:
+        return {"x": int(action["x"]), "y": int(action["y"])}
+    return None

potato/agent_runner_manager.py ADDED Viewed

	@@ -0,0 +1,226 @@

+"""
+Agent Runner Session Manager
+Singleton that manages active AgentRunner sessions.
+Keyed by "{user_id}:{instance_id}" for per-user, per-instance isolation.
+Includes TTL-based cleanup and max concurrent session limits.
+"""
+import atexit
+import logging
+import threading
+import time
+from typing import Dict, Optional
+from potato.agent_runner import AgentConfig, AgentRunner, AgentState
+logger = logging.getLogger(__name__)
+# Default limits
+DEFAULT_MAX_SESSIONS = 10
+DEFAULT_SESSION_TTL = 3600  # 1 hour
+class AgentRunnerManager:
+    """
+    Manages active AgentRunner sessions with lifecycle control.
+    Thread-safe singleton. Sessions are keyed by "{user_id}:{instance_id}".
+    """
+    _instance = None
+    _lock = threading.Lock()
+    def __init__(
+        self,
+        max_sessions: int = DEFAULT_MAX_SESSIONS,
+        session_ttl: int = DEFAULT_SESSION_TTL,
+    ):
+        self._sessions: Dict[str, AgentRunner] = {}
+        self._session_created: Dict[str, float] = {}
+        self._session_meta: Dict[str, Dict] = {}
+        self._lock = threading.Lock()
+        self.max_sessions = max_sessions
+        self.session_ttl = session_ttl
+        # Start cleanup thread
+        self._cleanup_stop = threading.Event()
+        self._cleanup_thread = threading.Thread(
+            target=self._cleanup_loop, daemon=True, name="agent-cleanup"
+        )
+        self._cleanup_thread.start()
+    @classmethod
+    def get_instance(cls, **kwargs) -> "AgentRunnerManager":
+        """Get or create the singleton instance."""
+        if cls._instance is None:
+            with cls._lock:
+                if cls._instance is None:
+                    cls._instance = cls(**kwargs)
+        return cls._instance
+    @classmethod
+    def clear_instance(cls):
+        """Clear the singleton (for testing)."""
+        with cls._lock:
+            if cls._instance is not None:
+                cls._instance.shutdown()
+                cls._instance = None
+    def create_session(
+        self,
+        user_id: str,
+        instance_id: str,
+        config: AgentConfig,
+        screenshot_dir: str,
+    ) -> AgentRunner:
+        """
+        Create a new agent session.
+        Args:
+            user_id: Annotator user ID
+            instance_id: Annotation instance ID
+            config: Agent configuration
+            screenshot_dir: Directory to store screenshots
+        Returns:
+            AgentRunner instance
+        Raises:
+            RuntimeError: If max sessions reached or session already exists
+        """
+        session_key = f"{user_id}:{instance_id}"
+        with self._lock:
+            # Clean up expired sessions first
+            self._cleanup_expired_locked()
+            # Check for existing active session
+            if session_key in self._sessions:
+                existing = self._sessions[session_key]
+                if existing.state in (AgentState.RUNNING, AgentState.PAUSED, AgentState.TAKEOVER):
+                    raise RuntimeError(
+                        f"Active session already exists for {session_key}. "
+                        f"Stop it first."
+                    )
+                # Old completed/error session — remove it
+                del self._sessions[session_key]
+                del self._session_created[session_key]
+                if session_key in self._session_meta:
+                    del self._session_meta[session_key]
+            # Check capacity
+            active_count = sum(
+                1
+                for s in self._sessions.values()
+                if s.state in (AgentState.RUNNING, AgentState.PAUSED, AgentState.TAKEOVER)
+            )
+            if active_count >= self.max_sessions:
+                raise RuntimeError(
+                    f"Maximum concurrent sessions ({self.max_sessions}) reached"
+                )
+            import uuid
+            session_id = str(uuid.uuid4())[:12]
+            runner = AgentRunner(session_id, config, screenshot_dir)
+            self._sessions[session_key] = runner
+            self._session_created[session_key] = time.time()
+            self._session_meta[session_key] = {
+                "user_id": user_id,
+                "instance_id": instance_id,
+                "session_id": session_id,
+            }
+            logger.info(
+                f"Created agent session {session_id} for {session_key}"
+            )
+            return runner
+    def get_session(self, session_id: str) -> Optional[AgentRunner]:
+        """Get a session by its session_id."""
+        with self._lock:
+            for runner in self._sessions.values():
+                if runner.session_id == session_id:
+                    return runner
+        return None
+    def get_session_by_key(self, user_id: str, instance_id: str) -> Optional[AgentRunner]:
+        """Get a session by user_id and instance_id."""
+        session_key = f"{user_id}:{instance_id}"
+        with self._lock:
+            return self._sessions.get(session_key)
+    def remove_session(self, session_id: str):
+        """Remove a session by session_id."""
+        with self._lock:
+            key_to_remove = None
+            for key, runner in self._sessions.items():
+                if runner.session_id == session_id:
+                    key_to_remove = key
+                    break
+            if key_to_remove:
+                runner = self._sessions.pop(key_to_remove)
+                self._session_created.pop(key_to_remove, None)
+                self._session_meta.pop(key_to_remove, None)
+                runner.stop()
+                logger.info(f"Removed agent session {session_id}")
+    def list_sessions(self) -> list:
+        """List all active sessions."""
+        with self._lock:
+            result = []
+            for key, runner in self._sessions.items():
+                meta = self._session_meta.get(key, {})
+                result.append({
+                    "session_id": runner.session_id,
+                    "user_id": meta.get("user_id"),
+                    "instance_id": meta.get("instance_id"),
+                    "state": runner.state.value,
+                    "step_count": runner.step_count,
+                    "created": self._session_created.get(key),
+                })
+            return result
+    def _cleanup_expired_locked(self):
+        """Remove expired sessions. Must be called with self._lock held."""
+        now = time.time()
+        expired_keys = []
+        for key, created_at in self._session_created.items():
+            if now - created_at > self.session_ttl:
+                runner = self._sessions.get(key)
+                if runner and runner.state in (AgentState.COMPLETED, AgentState.ERROR, AgentState.IDLE):
+                    expired_keys.append(key)
+                elif runner and now - created_at > self.session_ttl * 2:
+                    # Force-stop sessions that have been running too long
+                    runner.stop()
+                    expired_keys.append(key)
+        for key in expired_keys:
+            self._sessions.pop(key, None)
+            self._session_created.pop(key, None)
+            self._session_meta.pop(key, None)
+            logger.info(f"Cleaned up expired session: {key}")
+    def _cleanup_loop(self):
+        """Background cleanup thread."""
+        while not self._cleanup_stop.is_set():
+            self._cleanup_stop.wait(60)  # Check every 60 seconds
+            if self._cleanup_stop.is_set():
+                break
+            with self._lock:
+                self._cleanup_expired_locked()
+    def shutdown(self):
+        """Stop all sessions and cleanup thread."""
+        self._cleanup_stop.set()
+        with self._lock:
+            for key, runner in self._sessions.items():
+                try:
+                    runner.stop()
+                except Exception as e:
+                    logger.warning(f"Error stopping session {key}: {e}")
+            self._sessions.clear()
+            self._session_created.clear()
+            self._session_meta.clear()
+        logger.info("AgentRunnerManager shut down")

potato/agreement.py ADDED Viewed

	@@ -0,0 +1,278 @@

+"""
+Inter-Annotator Agreement Calculation Module
+This module provides functionality for calculating inter-annotator agreement metrics,
+including Krippendorff's alpha, Cohen's kappa (pairwise), and Fleiss' kappa
+(N raters), from annotation data. It supports both rating agreement (interval
+metric) and skip agreement (nominal metric) calculations.
+The module processes annotation files in JSON format and outputs agreement statistics
+along with a CSV file containing the processed annotation data.
+"""
+import argparse
+from itertools import combinations
+import simpledorff
+from simpledorff.metrics import *
+import ujson
+import pandas as pd
+from collections import defaultdict
+import numpy as np
+def get_nans(shape):
+    """
+    Create a numpy array filled with NaN values.
+    Args:
+        shape: The shape of the array to create
+    Returns:
+        numpy.ndarray: Array filled with NaN values
+    """
+    ar = np.empty(shape)
+    ar[:] = np.NaN
+    return ar
+def cohen_kappa_pairwise(reliability_df):
+    """
+    Compute Cohen's kappa for every pair of annotators and return aggregate stats.
+    Cohen's kappa is defined for exactly two raters. With N>2 raters we compute
+    kappa for each pair on the items they both rated, then return the mean and the
+    per-pair breakdown. Pairs that share fewer than 2 items are skipped.
+    Args:
+        reliability_df: long-format DataFrame with columns
+            unit (item id), annotator (user), annotation (label value).
+    Returns:
+        dict with keys: mean_kappa (float | None), pairs (list of
+            {annotator_a, annotator_b, kappa, n_items}), n_pairs_evaluated,
+            n_pairs_skipped.
+    """
+    from sklearn.metrics import cohen_kappa_score
+    annotators = sorted(reliability_df["annotator"].unique())
+    pairs = []
+    skipped = 0
+    for a, b in combinations(annotators, 2):
+        a_rows = reliability_df[reliability_df["annotator"] == a].set_index("unit")["annotation"]
+        b_rows = reliability_df[reliability_df["annotator"] == b].set_index("unit")["annotation"]
+        shared = a_rows.index.intersection(b_rows.index)
+        if len(shared) < 2:
+            skipped += 1
+            continue
+        y_a = a_rows.loc[shared].astype(str).tolist()
+        y_b = b_rows.loc[shared].astype(str).tolist()
+        try:
+            kappa = float(cohen_kappa_score(y_a, y_b))
+        except Exception:
+            skipped += 1
+            continue
+        pairs.append({
+            "annotator_a": a,
+            "annotator_b": b,
+            "kappa": round(kappa, 4),
+            "n_items": int(len(shared)),
+        })
+    mean_kappa = (sum(p["kappa"] for p in pairs) / len(pairs)) if pairs else None
+    return {
+        "mean_kappa": round(mean_kappa, 4) if mean_kappa is not None else None,
+        "pairs": pairs,
+        "n_pairs_evaluated": len(pairs),
+        "n_pairs_skipped": skipped,
+    }
+def fleiss_kappa(reliability_df):
+    """
+    Compute Fleiss' kappa for N raters over a categorical label set.
+    Fleiss' kappa assumes the same number of ratings per item but tolerates
+    different rater identities per item. Items with fewer than 2 ratings are
+    dropped; the remaining items are padded by repeating their available
+    ratings up to the per-item rater count (`n_raters = max ratings per item`).
+    When per-item rater counts vary widely the metric is approximate; we report
+    `n_raters` and `n_items_evaluated` so the caller can judge.
+    Args:
+        reliability_df: long-format DataFrame with columns
+            unit (item id), annotator (user), annotation (label value).
+    Returns:
+        dict with keys: kappa (float | None), n_items_evaluated (int),
+            n_raters (int), n_categories (int), interpretation (str).
+    """
+    if reliability_df.empty:
+        return {"kappa": None, "n_items_evaluated": 0, "n_raters": 0,
+                "n_categories": 0, "interpretation": "No data"}
+    df = reliability_df.copy()
+    df["annotation"] = df["annotation"].astype(str)
+    counts_by_item = df.groupby(["unit", "annotation"]).size().unstack(fill_value=0)
+    items_with_ratings = counts_by_item.sum(axis=1)
+    counts_by_item = counts_by_item.loc[items_with_ratings >= 2]
+    if counts_by_item.empty:
+        return {"kappa": None, "n_items_evaluated": 0, "n_raters": 0,
+                "n_categories": int(df["annotation"].nunique()),
+                "interpretation": "No items with >=2 raters"}
+    n_raters = int(counts_by_item.sum(axis=1).max())
+    n_items = int(counts_by_item.shape[0])
+    n_categories = int(counts_by_item.shape[1])
+    matrix = counts_by_item.to_numpy(dtype=float)
+    row_sums = matrix.sum(axis=1, keepdims=True)
+    row_sums[row_sums == 0] = 1.0
+    matrix = matrix * (n_raters / row_sums)
+    p_j = matrix.sum(axis=0) / (n_items * n_raters)
+    if n_raters < 2:
+        return {"kappa": None, "n_items_evaluated": n_items, "n_raters": n_raters,
+                "n_categories": n_categories,
+                "interpretation": "Need >=2 raters per item"}
+    p_i = (np.sum(matrix ** 2, axis=1) - n_raters) / (n_raters * (n_raters - 1))
+    p_bar = float(p_i.mean())
+    p_e = float(np.sum(p_j ** 2))
+    if p_e >= 1.0:
+        kappa = 1.0 if p_bar >= 1.0 else 0.0
+    else:
+        kappa = (p_bar - p_e) / (1 - p_e)
+    return {
+        "kappa": round(float(kappa), 4),
+        "n_items_evaluated": n_items,
+        "n_raters": n_raters,
+        "n_categories": n_categories,
+        "interpretation": interpret_kappa(kappa),
+    }
+def interpret_kappa(kappa):
+    """Landis & Koch (1977) interpretation bands for kappa-family metrics."""
+    if kappa is None:
+        return "No agreement computable"
+    if kappa < 0:
+        return "Worse than chance"
+    if kappa < 0.21:
+        return "Slight"
+    if kappa < 0.41:
+        return "Fair"
+    if kappa < 0.61:
+        return "Moderate"
+    if kappa < 0.81:
+        return "Substantial"
+    return "Almost perfect"
+def flatten(annotations):
+    """
+    Flatten annotation data structure for processing.
+    Converts a list of annotation dictionaries into a format where each
+    annotation is a dictionary mapping user IDs to their labels.
+    Args:
+        annotations: List of annotation dictionaries
+    Returns:
+        list: Flattened annotation data structure
+    Example:
+        Input: [{"user": "user1", "label": "positive"}, {"user": "user2", "label": "negative"}]
+        Output: [{"user1": "positive", "user2": "negative"}]
+    """
+    return [{a["user"]: a["label"] for a in ann} for ann in annotations]
+def main(args):
+    """
+    Main function for calculating inter-annotator agreement.
+    This function processes annotation data from a JSON file, calculates
+    Krippendorff's alpha for both rating agreement and skip agreement,
+    and outputs the results along with a CSV file of the processed data.
+    Args:
+        args: Command line arguments containing file paths
+    Side Effects:
+        - Reads annotation data from input file
+        - Prints agreement statistics to console
+        - Writes processed data to output CSV file
+    The function processes the first 385 annotations by default and handles
+    missing annotations and skipped items appropriately.
+    """
+    # Load annotation data from JSON file
+    with open(args.file, "r") as f:
+        annotations = [ujson.loads(line)["annotations"] for line in f]
+    # Extract unique user IDs from all annotations
+    users = set([a["user"] for ann in annotations for a in ann])
+    annotations = flatten(annotations)
+    # Limit to first 385 annotations (configurable limit)
+    annotations = annotations[:385]
+    # Create data matrix for agreement calculation
+    # Each row represents a user, each column represents an annotation
+    # -1 values indicate skipped annotations, NaN indicates missing annotations
+    data = [
+        [np.nan if user not in a or int(a[user]) == -1 else int(a[user]) for a in annotations]
+        for user in users
+    ]
+    # Create skip data matrix (boolean indicating if annotation was skipped)
+    skip_data = [
+        [np.nan if user not in a else int(a[user]) < 0 for a in annotations] for user in users
+    ]
+    # Calculate statistics for each user
+    labeled = ~np.isnan(data)
+    skipped = [
+        [False if user not in a else int(a[user]) < 0 for a in annotations] for user in users
+    ]
+    # Print summary statistics
+    print("calculating over:")
+    for user, skip in zip(labeled, skipped):
+        print("labeled:", sum(user))
+        print("skipped:", sum(skip))
+    # Count instances where all users provided annotations
+    print(np.all(labeled, axis=0).sum())
+    # Calculate and print Krippendorff's alpha for rating agreement
+    # Uses interval metric for continuous rating scales
+    print("rating agreement:")
+    print(simpledorff.calculate_krippendorffs_alpha(pd.DataFrame(data),metric_fn=interval_metric))
+    # Calculate and print Krippendorff's alpha for skip agreement
+    # Uses nominal metric for binary skip/no-skip decisions
+    print("skip agreement:")
+    print(simpledorff.calculate_krippendorffs_alpha(pd.DataFrame(data),metric_fn=nominal_metric))
+    # Write processed data to CSV file
+    with open(args.outfile, "w") as f:
+        for row in zip(*data):
+            f.write(",".join([str(a) for a in row]) + "\n")
+if __name__ == "__main__":
+    # Set up command line argument parsing
+    parser = argparse.ArgumentParser(
+        description="Calculate Krippendorf's alpha from given JSON file of annotations"
+    )
+    parser.add_argument("file", help="path to JSON file")
+    parser.add_argument("outfile", help="write path to CSV")
+    main(parser.parse_args())

potato/ai/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .ai_help_wrapper import generate_ai_help_html

potato/ai/ai_cache.py ADDED Viewed

	@@ -0,0 +1,1473 @@

+from __future__ import annotations
+import json
+import logging
+import os
+from typing import Any, Dict, Union
+import requests
+from tqdm import tqdm
+import time
+from concurrent.futures import ThreadPoolExecutor
+import threading
+from builtins import open
+from potato.server_utils.config_module import config
+logger = logging.getLogger(__name__)
+from potato.item_state_management import get_item_state_manager
+from potato.ai.ai_endpoint import (
+    AIEndpointFactory,
+    Annotation_Type,
+    AnnotationInput,
+    ImageData,
+    VisualAnnotationInput,
+    ModelCapabilities,
+)
+from potato.ai.ollama_endpoint import OllamaEndpoint
+from potato.ai.openrouter_endpoint import OpenRouterEndpoint
+from potato.ai.ai_prompt import ModelManager, get_ai_prompt
+AICACHEMANAGER = None
+def _get_scheme_field(annotation_id: int, field: str, default=None):
+    """Safely get a field from an annotation scheme with a clear error message."""
+    schemes = config.get("annotation_schemes", [])
+    if annotation_id >= len(schemes):
+        raise ValueError(
+            f"AI cache: annotation_id {annotation_id} out of range "
+            f"(only {len(schemes)} scheme(s) configured)"
+        )
+    scheme = schemes[annotation_id]
+    if default is not None:
+        return scheme.get(field, default)
+    if field not in scheme:
+        scheme_name = scheme.get("name", f"index {annotation_id}")
+        scheme_type = scheme.get("annotation_type", "unknown")
+        raise ValueError(
+            f"AI cache: annotation scheme '{scheme_name}' (type '{scheme_type}') "
+            f"missing required field '{field}'"
+        )
+    return scheme[field]
+def _get_instance_text(instance_id: int) -> str:
+    """Get the text content from an instance using the configured text_key."""
+    item = get_item_state_manager().items()[instance_id]
+    item_data = item.get_data()
+    # Get the configured text_key
+    text_key = config.get("item_properties", {}).get("text_key", "text")
+    # Try the configured text_key first
+    if text_key in item_data:
+        return item_data[text_key]
+    # Fall back to common keys
+    for key in ['text', 'content', 'message']:
+        if key in item_data:
+            return item_data[key]
+    # Last resort: return any string value
+    for value in item_data.values():
+        if isinstance(value, str):
+            return value
+    return str(item_data)
+def _is_image_url(text: str) -> bool:
+    """Check if text appears to be an image URL."""
+    if not isinstance(text, str):
+        return False
+    text_lower = text.lower()
+    # Check for image extensions
+    image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.webp', '.bmp']
+    if any(ext in text_lower for ext in image_extensions):
+        return True
+    # Check for common image hosting services
+    image_hosts = ['unsplash.com', 'imgur.com', 'flickr.com', 'picsum.photos']
+    if any(host in text_lower for host in image_hosts):
+        return True
+    # Check if URL starts with http and might be an image
+    if text_lower.startswith(('http://', 'https://')) and 'image' in text_lower:
+        return True
+    return False
+def _get_image_data_from_url(url: str) -> ImageData:
+    """Download image from URL and return as ImageData.
+    Includes SSRF protection to prevent fetching from private/internal IPs.
+    """
+    import base64
+    import ipaddress
+    import socket
+    from urllib.parse import urlparse
+    # SSRF protection: validate URL scheme and resolve hostname
+    try:
+        parsed = urlparse(url)
+        if parsed.scheme not in ('http', 'https'):
+            logger.warning(f"Blocked non-HTTP image URL: {url[:100]}")
+            return None
+        hostname = parsed.hostname
+        if hostname:
+            addr_info = socket.getaddrinfo(hostname, None)
+            for info in addr_info:
+                ip_str = info[4][0]
+                try:
+                    ip = ipaddress.ip_address(ip_str)
+                    if ip.is_private or ip.is_loopback or ip.is_link_local:
+                        logger.warning(
+                            f"Blocked image URL resolving to private IP: "
+                            f"{hostname} -> {ip_str}"
+                        )
+                        return None
+                except ValueError:
+                    pass
+    except Exception as e:
+        logger.warning(f"Failed to validate image URL {url[:100]}: {e}")
+        return None
+    try:
+        response = requests.get(url, timeout=30)
+        response.raise_for_status()
+        b64_data = base64.b64encode(response.content).decode('utf-8')
+        # Determine mime type from content-type header or URL
+        content_type = response.headers.get('content-type', 'image/jpeg')
+        return ImageData(source='base64', data=b64_data, mime_type=content_type)
+    except Exception as e:
+        logger.error(f"Failed to download image from {url}: {e}")
+        return None
+def init_ai_cache_manager():
+    global AICACHEMANAGER
+    if AICACHEMANAGER is None:
+        AICACHEMANAGER = AiCacheManager()
+    return AICACHEMANAGER
+def get_ai_cache_manager():
+    """Get the AI cache manager instance. Returns None if not initialized (AI support disabled)."""
+    global AICACHEMANAGER
+    return AICACHEMANAGER
+def clear_ai_cache_manager():
+    """Clear the AI cache manager singleton. Used for testing."""
+    global AICACHEMANAGER
+    AICACHEMANAGER = None
+class AiCacheManager:
+    def __init__(self):
+        ai_support = config["ai_support"]
+        if not ai_support["enabled"]:
+            return
+        cache_config = ai_support.get("cache_config", {})
+        ai_config = ai_support.get("ai_config", {})
+        include = ai_config.get("include") or {}
+        special_include = include.get("special_include", None)
+        self.include_all = include.get("all", False)
+        self.special_includes = {}
+        self.model_manager = ModelManager()
+        self.model_manager.load_models_module()
+        if special_include:
+            for page_key, page_value in special_include.items():
+                # Convert string keys to integers for easier lookup
+                page_index = int(page_key)
+                self.special_includes[page_index] = {}
+                for annotation_id, annotation_types in page_value.items():
+                    annotation_id_int = int(annotation_id)
+                    self.special_includes[page_index][annotation_id_int] = annotation_types
+        # Disk cache configuration.
+        # F-028: tolerate a partial/absent ai_cache config (e.g. AI support
+        # enabled for ICL with no disk_cache block) instead of crashing boot
+        # with KeyError: 'disk_cache'.
+        disk_cache_cfg = cache_config.get("disk_cache", {}) if isinstance(cache_config, dict) else {}
+        self.disk_cache_enabled = disk_cache_cfg.get("enabled", False)
+        disk_cache_path = disk_cache_cfg.get("path")
+        if self.disk_cache_enabled and not disk_cache_path:
+            raise Exception("You have enable disk cache, but you did not specific the path!")
+        self.disk_persistence_path = disk_cache_path
+        # Validate cache path stays within task directory
+        if self.disk_persistence_path:
+            task_dir = os.path.abspath(config.get("task_dir", "."))
+            cache_abs = os.path.abspath(
+                os.path.join(task_dir, self.disk_persistence_path)
+                if not os.path.isabs(self.disk_persistence_path)
+                else self.disk_persistence_path
+            )
+            if not cache_abs.startswith(task_dir + os.sep) and cache_abs != task_dir:
+                raise ValueError(
+                    f"Cache path '{self.disk_persistence_path}' resolves to "
+                    f"'{cache_abs}' which is outside the task directory "
+                    f"'{task_dir}'. Path traversal is not allowed."
+                )
+        # Prefetch configuration — clamp to sane ranges.
+        # F-028: default to no prefetch when the prefetch block is absent
+        # (e.g. cache_config: {enabled: false}) instead of KeyError on boot.
+        prefetch_cfg = cache_config.get("prefetch", {}) if isinstance(cache_config, dict) else {}
+        self.warm_up_page_count = max(0, min(int(prefetch_cfg.get("warm_up_page_count", 0)), 10000))
+        self.prefetch_page_count_on_next = max(0, min(int(prefetch_cfg.get("on_next", 0)), 10000))
+        self.prefetch_page_count_on_prev = max(0, min(int(prefetch_cfg.get("on_prev", 0)), 10000))
+        # Option highlighting configuration
+        option_highlighting = ai_support.get("option_highlighting", {})
+        self.option_highlighting_enabled = option_highlighting.get("enabled", False)
+        self.option_highlighting_top_k = option_highlighting.get("top_k", 3)
+        self.option_highlighting_dim_opacity = option_highlighting.get("dim_opacity", 0.4)
+        self.option_highlighting_auto_apply = option_highlighting.get("auto_apply", True)
+        self.option_highlighting_schemas = option_highlighting.get("schemas", None)  # None means all
+        # Prefetch count for option highlighting — clamp to sane range
+        self.option_highlighting_prefetch_count = max(0, min(
+            int(option_highlighting.get("prefetch_count", 20)), 10000
+        ))
+        # Threading
+        self.in_progress = {}
+        self.lock = threading.RLock()
+        self.executor = ThreadPoolExecutor(max_workers=20)
+        AIEndpointFactory.register_endpoint("ollama", OllamaEndpoint)
+        AIEndpointFactory.register_endpoint("open_router", OpenRouterEndpoint)
+        # Register visual AI endpoints
+        try:
+            from potato.ai.yolo_endpoint import YOLOEndpoint
+            AIEndpointFactory.register_endpoint("yolo", YOLOEndpoint)
+        except ImportError:
+            logger.debug("YOLO endpoint not available (ultralytics not installed)")
+        try:
+            from potato.ai.ollama_vision_endpoint import OllamaVisionEndpoint
+            AIEndpointFactory.register_endpoint("ollama_vision", OllamaVisionEndpoint)
+        except ImportError:
+            logger.debug("Ollama Vision endpoint not available")
+        try:
+            from potato.ai.openai_vision_endpoint import OpenAIVisionEndpoint
+            AIEndpointFactory.register_endpoint("openai_vision", OpenAIVisionEndpoint)
+        except ImportError:
+            logger.debug("OpenAI Vision endpoint not available")
+        try:
+            from potato.ai.anthropic_vision_endpoint import AnthropicVisionEndpoint
+            AIEndpointFactory.register_endpoint("anthropic_vision", AnthropicVisionEndpoint)
+        except ImportError:
+            logger.debug("Anthropic Vision endpoint not available")
+        # Degrade gracefully if the AI backend (e.g. a local Ollama/vLLM server)
+        # is unreachable at boot: log a warning and serve the task with AI
+        # support disabled rather than aborting server startup.
+        try:
+            self.ai_endpoint = AIEndpointFactory.create_endpoint(config)
+        except Exception as e:
+            logger.warning(
+                "AI endpoint unavailable at startup (%s). Continuing with AI "
+                "support disabled. Check that your AI backend is running.", e
+            )
+            self.ai_endpoint = None
+        # Create visual endpoint if different from main endpoint
+        self.visual_endpoint = None
+        visual_endpoint_type = config.get("ai_support", {}).get("visual_endpoint_type")
+        if visual_endpoint_type and visual_endpoint_type != config.get("ai_support", {}).get("endpoint_type"):
+            visual_config = {
+                "ai_support": {
+                    "enabled": True,
+                    "endpoint_type": visual_endpoint_type,
+                    "ai_config": config.get("ai_support", {}).get("visual_ai_config", config.get("ai_support", {}).get("ai_config", {}))
+                }
+            }
+            try:
+                self.visual_endpoint = AIEndpointFactory.create_endpoint(visual_config)
+            except Exception as e:
+                logger.warning(
+                    "Visual AI endpoint unavailable at startup (%s). Continuing "
+                    "without visual AI support.", e
+                )
+                self.visual_endpoint = None
+        annotation_scheme = config.get("annotation_schemes")
+        self.annotations = []
+        for scheme in annotation_scheme:
+            self.annotations.append(scheme)
+        # Check if main endpoint supports vision
+        self.endpoint_supports_vision = hasattr(self.ai_endpoint, 'query_with_image')
+        logger.info(f"AI endpoint supports vision: {self.endpoint_supports_vision}")
+        # Initialize cache
+        if self.disk_cache_enabled:
+            self.load_cache_from_disk()
+            self.start_warmup()
+    def _validate_assistant_compatibility(
+        self, instance_id: int, annotation_id: int, ai_assistant: str
+    ) -> tuple:
+        """
+        Validate that the AI assistant is compatible with the input type and model capabilities.
+        Args:
+            instance_id: The instance/item index
+            annotation_id: The annotation scheme index
+            ai_assistant: Type of assistance ('hint', 'keyword', 'rationale', 'detection', etc.)
+        Returns:
+            Tuple of (is_valid: bool, error_message: str)
+            If valid, error_message is empty string.
+        """
+        try:
+            text = _get_instance_text(instance_id)
+            is_image = _is_image_url(text)
+            # Determine which endpoint to use
+            if is_image and self.visual_endpoint:
+                endpoint = self.visual_endpoint
+            elif is_image and self.endpoint_supports_vision:
+                endpoint = self.ai_endpoint
+            else:
+                endpoint = self.ai_endpoint
+            # Get capabilities from endpoint
+            capabilities = getattr(endpoint, 'CAPABILITIES', None)
+            if capabilities is None:
+                # No capabilities declared - allow all (backward compatibility)
+                logger.debug(f"Endpoint {type(endpoint).__name__} has no CAPABILITIES, allowing {ai_assistant}")
+                return True, ""
+            # Check if the assistant type is supported
+            if not capabilities.supports_assistant(ai_assistant, is_image):
+                input_type = "image" if is_image else "text"
+                return False, (
+                    f"Model {type(endpoint).__name__} does not support '{ai_assistant}' "
+                    f"for {input_type} content"
+                )
+            return True, ""
+        except Exception as e:
+            logger.warning(f"Error validating assistant compatibility: {e}")
+            # On validation error, allow the request (fail open for now)
+            return True, ""
+    def get_endpoint_capabilities(self, for_image: bool = False) -> ModelCapabilities:
+        """
+        Get the capabilities of the appropriate endpoint for the given input type.
+        Args:
+            for_image: Whether the input is an image
+        Returns:
+            ModelCapabilities instance, or a default permissive one if not declared
+        """
+        if for_image and self.visual_endpoint:
+            endpoint = self.visual_endpoint
+        elif for_image and self.endpoint_supports_vision:
+            endpoint = self.ai_endpoint
+        else:
+            endpoint = self.ai_endpoint
+        capabilities = getattr(endpoint, 'CAPABILITIES', None)
+        if capabilities is None:
+            # Return permissive defaults for backward compatibility
+            return ModelCapabilities(
+                text_generation=True,
+                vision_input=for_image,
+                bounding_box_output=False,
+                text_classification=True,
+                image_classification=for_image,
+                rationale_generation=True,
+                keyword_extraction=not for_image,
+            )
+        return capabilities
+    def _get_ai_with_vision_support(self, text: str, prompt: str, output_format) -> str:
+        """
+        Get AI response, using vision if text is an image URL and endpoint supports it.
+        """
+        # Check if we should use vision
+        if self.endpoint_supports_vision and _is_image_url(text):
+            logger.debug(f"Using vision query for image URL: {text[:50]}...")
+            image_data = _get_image_data_from_url(text)
+            if image_data:
+                try:
+                    return self.ai_endpoint.query_with_image(prompt, image_data, output_format)
+                except Exception as e:
+                    logger.error(f"Vision query failed: {e}")
+                    # Fall back to text query
+        # Fall back to regular text query
+        return self.ai_endpoint.query(prompt, output_format)
+    def start_warmup(self):
+        self.start_prefetch(0, self.warm_up_page_count)
+        # Also prefetch option highlights if enabled
+        if self.option_highlighting_enabled:
+            self.start_option_highlight_prefetch(0, self.warm_up_page_count)
+        total = len(self.in_progress)
+        desc = "Preloading the AI"
+        progress_bar = tqdm(total=total, desc=desc, unit="item")
+        def count_completed():
+            return total - len(self.in_progress)
+        prev_done = 0
+        while self.in_progress:
+            current_done = count_completed()
+            progress_bar.update(current_done - prev_done)
+            prev_done = current_done
+            time.sleep(0.2)
+        final_done = count_completed()
+        if final_done > prev_done:
+            progress_bar.update(final_done - prev_done)
+        progress_bar.close()
+    def load_disk_cache_data(self, file_path: str) -> Dict[str, Any]:
+        """loads the cache JSON from disk and returns a dictionary of stringified keys to values."""
+        try:
+            with open(file_path, 'r', encoding='utf-8') as f:
+                return json.load(f)
+        except Exception as e:
+            logger.error(f"Error loading disk cache: {e}")
+            return {}
+    def load_cache_from_disk(self):
+        """Initializes disk cache file if it doesn't exist."""
+        if not self.disk_cache_enabled or not self.disk_persistence_path:
+            return
+        if os.path.exists(self.disk_persistence_path):
+            data = self.load_disk_cache_data(self.disk_persistence_path)
+            logger.info(f"Disk cache initialized with {len(data)} items")
+        else:
+            try:
+                # Create parent directory if it doesn't exist
+                os.makedirs(os.path.dirname(self.disk_persistence_path), exist_ok=True)
+                with open(self.disk_persistence_path, 'w', encoding='utf-8') as file:
+                    json.dump({}, file)
+                logger.info(f"Initialized empty disk cache at {self.disk_persistence_path}")
+            except Exception as e:
+                logger.error(f"Failed to create disk cache: {e}")
+    def save_cache_to_disk(self, key, value):
+        """saves a single key-value pair to disk cache using atomic write."""
+        if not self.disk_cache_enabled or not self.disk_persistence_path:
+            return
+        try:
+            os.makedirs(os.path.dirname(self.disk_persistence_path), exist_ok=True)
+            # Load existing disk data first
+            existing_disk_data = {}
+            if os.path.exists(self.disk_persistence_path):
+                existing_disk_data = self.load_disk_cache_data(self.disk_persistence_path)
+            # Add the new key-value pair
+            existing_disk_data[str(key)] = value
+            temp_path = self.disk_persistence_path + ".tmp"
+            with open(temp_path, 'w', encoding='utf-8') as f:
+                json.dump(existing_disk_data, f, indent=2, ensure_ascii=False)
+            os.rename(temp_path, self.disk_persistence_path)
+        except Exception as e:
+            logger.error(f"Error saving cache to disk: {e}")
+    def add_to_cache(self, key, value):
+        """inserts a key-value into the disk cache."""
+        with self.lock:
+            if self.disk_cache_enabled:
+                self.save_cache_to_disk(key, value)
+    def get_from_cache(self, key):
+        """Tries to retrieve the item from disk cache."""
+        with self.lock:
+            # Try disk cache
+            if self.disk_cache_enabled and self.disk_persistence_path and os.path.exists(self.disk_persistence_path):
+                try:
+                    disk_data = self.load_disk_cache_data(self.disk_persistence_path)
+                    key_str = str(key)
+                    if key_str in disk_data:
+                        return disk_data[key_str]
+                except Exception as e:
+                    logger.error(f"Error reading from disk: {e}")
+            return None
+    def generate_likert(self, instance_id: int, annotation_id: int, ai_assistant: str) -> str:
+        from string import Template
+        annotation_type = _get_scheme_field(annotation_id, "annotation_type")
+        description = _get_scheme_field(annotation_id, "description")
+        text = _get_instance_text(instance_id)
+        min_label = _get_scheme_field(annotation_id, "min_label")
+        max_label = _get_scheme_field(annotation_id, "max_label")
+        size = _get_scheme_field(annotation_id, "size")
+        ai_prompt = get_ai_prompt()
+        output_format = self.model_manager.get_model_class_by_name(ai_prompt[annotation_type].get(ai_assistant).get("output_format"))
+        # Check if we should use vision endpoint for image-based content
+        if self.endpoint_supports_vision and _is_image_url(text):
+            logger.debug(f"Using vision for likert {ai_assistant} on image: {text[:50]}...")
+            image_data = _get_image_data_from_url(text)
+            if image_data:
+                # Build vision-specific prompts based on ai_assistant type
+                if ai_assistant == "hint":
+                    prompt = f"""Look at this image and help with the following annotation task:
+Task: {description}
+Rating scale: {size} points, from "{min_label}" (1) to "{max_label}" ({size})
+Please analyze the image and suggest an appropriate rating with a brief explanation.
+Respond in JSON format: {{"hint": "<explanation>", "suggestive_choice": "<rating label>"}}"""
+                elif ai_assistant == "rationale":
+                    prompt = f"""Look at this image and explain the reasoning for different rating choices:
+Task: {description}
+Rating scale: {size} points, from "{min_label}" (1) to "{max_label}" ({size})
+For each possible rating, explain what visual evidence in the image would support that rating.
+Respond in JSON format: {{"rationales": [{{"label": "<rating>", "reasoning": "<explanation>"}}]}}"""
+                elif ai_assistant == "keyword":
+                    prompt = f"""Look at this image and identify visual features relevant to the rating task:
+Task: {description}
+Rating scale: {size} points, from "{min_label}" (1) to "{max_label}" ({size})
+Identify key visual elements that would influence the rating.
+Respond in JSON format: {{"keywords": ["<visual_feature_1>", "<visual_feature_2>"]}}"""
+                else:
+                    prompt = f"Analyze this image for: {description}"
+                try:
+                    return self.ai_endpoint.query_with_image(prompt, image_data, output_format)
+                except Exception as e:
+                    logger.error(f"Vision query failed for likert {ai_assistant}: {e}")
+        # Fall back to standard text-based generation
+        data = AnnotationInput(
+            ai_assistant=ai_assistant,
+            annotation_type=annotation_type,
+            text=text,
+            description=description,
+            min_label=min_label,
+            max_label=max_label,
+            size=size
+        )
+        res = self.ai_endpoint.get_ai(data, output_format)
+        return res
+    def generate_multiselect(self, instance_id: int, annotation_id: int, ai_assistant: str) -> str:
+        annotation_type = _get_scheme_field(annotation_id, "annotation_type")
+        description = _get_scheme_field(annotation_id, "description")
+        labels = _get_scheme_field(annotation_id, "labels")
+        text = _get_instance_text(instance_id)
+        ai_prompt = get_ai_prompt()
+        output_format = self.model_manager.get_model_class_by_name(ai_prompt[annotation_type].get(ai_assistant).get("output_format"))
+        # Check if we should use vision endpoint for image-based content
+        if self.endpoint_supports_vision and _is_image_url(text):
+            logger.debug(f"Using vision for multiselect {ai_assistant} on image: {text[:50]}...")
+            image_data = _get_image_data_from_url(text)
+            if image_data:
+                # Format labels for the prompt
+                label_names = [l.get('name', l) if isinstance(l, dict) else l for l in labels]
+                labels_str = ', '.join(f'"{name}"' for name in label_names)
+                # Build vision-specific prompts based on ai_assistant type
+                if ai_assistant == "hint":
+                    prompt = f"""Look at this image and help with the following annotation task:
+Task: {description}
+Available options (select all that apply): {labels_str}
+Please analyze the image and suggest which options apply.
+Respond in JSON format: {{"hint": "<explanation>", "suggestive_choices": ["<option1>", "<option2>"]}}"""
+                elif ai_assistant == "rationale":
+                    prompt = f"""Look at this image and explain the reasoning for each option:
+Task: {description}
+Available options: {labels_str}
+For each option, explain what visual evidence supports or contradicts it.
+Respond in JSON format: {{"rationales": [{{"label": "<option>", "reasoning": "<explanation>"}}]}}"""
+                elif ai_assistant == "keyword":
+                    prompt = f"""Look at this image and identify visual features for each option:
+Task: {description}
+Available options: {labels_str}
+For each option, identify visual cues that indicate its presence.
+Respond in JSON format: {{"label_keywords": [{{"label": "<option>", "keywords": ["<feature1>", "<feature2>"]}}]}}"""
+                else:
+                    prompt = f"Analyze this image for: {description}. Options: {labels_str}"
+                try:
+                    return self.ai_endpoint.query_with_image(prompt, image_data, output_format)
+                except Exception as e:
+                    logger.error(f"Vision query failed for multiselect {ai_assistant}: {e}")
+        # Fall back to standard text-based generation
+        data = AnnotationInput(
+            ai_assistant=ai_assistant,
+            annotation_type=annotation_type,
+            text=text,
+            description=description,
+            labels=labels
+        )
+        res = self.ai_endpoint.get_ai(data, output_format)
+        return res
+    def generate_radio(self, instance_id: int, annotation_id: int, ai_assistant: str) -> str:
+        annotation_type = _get_scheme_field(annotation_id, "annotation_type")
+        description = _get_scheme_field(annotation_id, "description")
+        text = _get_instance_text(instance_id)
+        labels = _get_scheme_field(annotation_id, "labels")
+        ai_prompt = get_ai_prompt()
+        output_format = self.model_manager.get_model_class_by_name(ai_prompt[annotation_type].get(ai_assistant).get("output_format"))
+        # Check if we should use vision endpoint for image-based content
+        if self.endpoint_supports_vision and _is_image_url(text):
+            logger.debug(f"Using vision for radio {ai_assistant} on image: {text[:50]}...")
+            image_data = _get_image_data_from_url(text)
+            if image_data:
+                # Format labels for the prompt
+                label_names = [l.get('name', l) if isinstance(l, dict) else l for l in labels]
+                labels_str = ', '.join(f'"{name}"' for name in label_names)
+                # Build vision-specific prompts based on ai_assistant type
+                if ai_assistant == "hint":
+                    prompt = f"""Look at this image and help with the following annotation task:
+Task: {description}
+Available options: {labels_str}
+Please analyze the image and suggest the most appropriate option.
+Respond in JSON format: {{"hint": "<explanation>", "suggestive_choice": "<selected option>"}}"""
+                elif ai_assistant == "rationale":
+                    prompt = f"""Look at this image and explain the reasoning for each option:
+Task: {description}
+Available options: {labels_str}
+For each option, explain what visual evidence in the image supports or contradicts it.
+Respond in JSON format: {{"rationales": [{{"label": "<option>", "reasoning": "<explanation>"}}]}}"""
+                elif ai_assistant == "keyword":
+                    prompt = f"""Look at this image and identify visual features for each option:
+Task: {description}
+Available options: {labels_str}
+For each option, identify visual cues that would indicate its presence.
+Respond in JSON format: {{"label_keywords": [{{"label": "<option>", "keywords": ["<feature1>", "<feature2>"]}}]}}"""
+                else:
+                    prompt = f"Analyze this image for: {description}. Options: {labels_str}"
+                try:
+                    return self.ai_endpoint.query_with_image(prompt, image_data, output_format)
+                except Exception as e:
+                    logger.error(f"Vision query failed for radio {ai_assistant}: {e}")
+        # Fall back to standard text-based generation
+        data = AnnotationInput(
+            ai_assistant=ai_assistant,
+            annotation_type=annotation_type,
+            text=text,
+            description=description,
+            labels=labels
+        )
+        res = self.ai_endpoint.get_ai(data, output_format)
+        return res
+    def generate_number(self, instance_id: int, annotation_id: int, ai_assistant: str) -> str:
+        annotation_type = _get_scheme_field(annotation_id, "annotation_type")
+        description = _get_scheme_field(annotation_id, "description")
+        text = _get_instance_text(instance_id)
+        data = AnnotationInput(
+            ai_assistant=ai_assistant,
+            annotation_type=annotation_type,
+            text=text,
+            description=description,
+        )
+        ai_prompt = get_ai_prompt();
+        output_format = self.model_manager.get_model_class_by_name(ai_prompt[annotation_type].get(ai_assistant).get("output_format"))
+        res = self.ai_endpoint.get_ai(data, output_format)
+        return res
+    def generate_select(self, instance_id: int, annotation_id: int, ai_assistant: str) -> str:
+        annotation_type = _get_scheme_field(annotation_id, "annotation_type")
+        description = _get_scheme_field(annotation_id, "description")
+        labels = _get_scheme_field(annotation_id, "labels")
+        text = _get_instance_text(instance_id)
+        data = AnnotationInput(
+            ai_assistant=ai_assistant,
+            annotation_type=annotation_type,
+            text=text,
+            description=description,
+            labels=labels
+        )
+        ai_prompt = get_ai_prompt();
+        output_format = self.model_manager.get_model_class_by_name(ai_prompt[annotation_type].get(ai_assistant).get("output_format"))
+        res = self.ai_endpoint.get_ai(data, output_format)
+        return res
+    def generate_slider(self, instance_id: int, annotation_id: int, ai_assistant: str) -> str:
+        annotation_type = _get_scheme_field(annotation_id, "annotation_type")
+        description = _get_scheme_field(annotation_id, "description")
+        min_value = _get_scheme_field(annotation_id, "min_value")
+        max_value = _get_scheme_field(annotation_id, "max_value")
+        step = _get_scheme_field(annotation_id, "step", default=1)
+        text = _get_instance_text(instance_id)
+        data = AnnotationInput(
+            ai_assistant=ai_assistant,
+            annotation_type=annotation_type,
+            text=text,
+            description=description,
+            min_value=min_value,
+            max_value=max_value,
+            step=step
+        )
+        ai_prompt = get_ai_prompt();
+        output_format = self.model_manager.get_model_class_by_name(ai_prompt[annotation_type].get(ai_assistant).get("output_format"))
+        res = self.ai_endpoint.get_ai(data, output_format)
+        return res
+    def generate_span(self, instance_id: int, annotation_id: int, ai_assistant: str) -> str:
+        annotation_type = _get_scheme_field(annotation_id, "annotation_type")
+        description = _get_scheme_field(annotation_id, "description")
+        labels = _get_scheme_field(annotation_id, "labels")
+        text = _get_instance_text(instance_id)
+        data = AnnotationInput(
+            ai_assistant=ai_assistant,
+            annotation_type=annotation_type,
+            text=text,
+            description=description,
+            labels=labels
+        )
+        ai_prompt = get_ai_prompt();
+        logger.debug(f"Generating span annotation with labels: {labels}")
+        output_format = self.model_manager.get_model_class_by_name(ai_prompt[annotation_type].get(ai_assistant).get("output_format"))
+        res = self.ai_endpoint.get_ai(data, output_format)
+        return res
+    def generate_textbox(self, instance_id: int, annotation_id: int, ai_assistant: str) -> str:
+        logger.debug(f"Generating textbox for annotation_id: {annotation_id}")
+        annotation_type = _get_scheme_field(annotation_id, "annotation_type")
+        description = _get_scheme_field(annotation_id, "description")
+        text = _get_instance_text(instance_id)
+        data = AnnotationInput(
+            ai_assistant=ai_assistant,
+            annotation_type=annotation_type,
+            text=text,
+            description=description,
+        )
+        ai_prompt = get_ai_prompt();
+        output_format = self.model_manager.get_model_class_by_name(ai_prompt[annotation_type].get(ai_assistant).get("output_format"))
+        res = self.ai_endpoint.get_ai(data, output_format)
+        return res
+    def generate_image_annotation(self, instance_id: int, annotation_id: int, ai_assistant: str) -> Dict:
+        """Generate AI assistance for image annotation tasks.
+        Args:
+            instance_id: The instance/item index
+            annotation_id: The annotation scheme index
+            ai_assistant: Type of assistance ('detection', 'classification', 'hint', 'pre_annotate', etc.)
+        Returns:
+            Dict with AI suggestions (detections, classifications, hints, etc.)
+        """
+        logger.debug(f"Generating image annotation for instance={instance_id}, annotation={annotation_id}, assistant={ai_assistant}")
+        annotation_type = _get_scheme_field(annotation_id, "annotation_type")
+        description = _get_scheme_field(annotation_id, "description", default="")
+        labels = _get_scheme_field(annotation_id, "labels", default=[])
+        # Extract label names if labels are dicts
+        if labels and isinstance(labels[0], dict):
+            labels = [l.get("name", str(l)) for l in labels]
+        # Get image URL from item data
+        item_data = get_item_state_manager().items()[instance_id].get_data()
+        image_url = self._extract_image_url(item_data)
+        if not image_url:
+            return {"error": "No image URL found in instance data"}
+        # Determine which endpoint to use
+        endpoint = self._get_visual_endpoint()
+        if not endpoint:
+            return {"error": "No visual AI endpoint configured"}
+        # Check if endpoint supports visual queries
+        if not hasattr(endpoint, 'query_with_image'):
+            # Fall back to text-based hint
+            return self._generate_text_hint_for_visual(instance_id, annotation_id, ai_assistant)
+        # Prepare image data
+        image_data = self._prepare_image_data(image_url)
+        # Get confidence threshold from config
+        confidence_threshold = _get_scheme_field(annotation_id, "ai_support", default={}).get(
+            "confidence_threshold", 0.5
+        )
+        # Build VisualAnnotationInput
+        data = VisualAnnotationInput(
+            ai_assistant=ai_assistant,
+            annotation_type=annotation_type,
+            task_type=ai_assistant,  # detection, classification, hint, etc.
+            image_data=image_data,
+            description=description,
+            labels=labels,
+            confidence_threshold=confidence_threshold
+        )
+        # Get output format from prompt config
+        ai_prompt = get_ai_prompt()
+        prompt_config = ai_prompt.get(annotation_type, {}).get(ai_assistant, {})
+        output_format_name = prompt_config.get("output_format", "visual_detection")
+        output_format = self.model_manager.get_model_class_by_name(output_format_name)
+        # Query the visual endpoint
+        result = endpoint.get_visual_ai(data, output_format)
+        return result
+    def generate_video_annotation(self, instance_id: int, annotation_id: int, ai_assistant: str) -> Dict:
+        """Generate AI assistance for video annotation tasks.
+        Args:
+            instance_id: The instance/item index
+            annotation_id: The annotation scheme index
+            ai_assistant: Type of assistance ('scene_detection', 'frame_classification', etc.)
+        Returns:
+            Dict with AI suggestions (segments, keyframes, etc.)
+        """
+        logger.debug(f"Generating video annotation for instance={instance_id}, annotation={annotation_id}, assistant={ai_assistant}")
+        annotation_type = _get_scheme_field(annotation_id, "annotation_type")
+        description = _get_scheme_field(annotation_id, "description", default="")
+        labels = _get_scheme_field(annotation_id, "labels", default=[])
+        # Extract label names if labels are dicts
+        if labels and isinstance(labels[0], dict):
+            labels = [l.get("name", str(l)) for l in labels]
+        # Get video URL from item data
+        item_data = get_item_state_manager().items()[instance_id].get_data()
+        video_url = self._extract_video_url(item_data)
+        if not video_url:
+            return {"error": "No video URL found in instance data"}
+        # Determine which endpoint to use
+        endpoint = self._get_visual_endpoint()
+        if not endpoint:
+            return {"error": "No visual AI endpoint configured"}
+        # Check if endpoint supports visual queries
+        if not hasattr(endpoint, 'query_with_image'):
+            return self._generate_text_hint_for_visual(instance_id, annotation_id, ai_assistant)
+        # Extract video frames
+        try:
+            frames = endpoint.extract_video_frames(video_url)
+            video_metadata = endpoint.get_video_metadata(video_url)
+        except Exception as e:
+            logger.error(f"Failed to extract video frames: {e}")
+            return {"error": f"Failed to process video: {str(e)}"}
+        # Build VisualAnnotationInput
+        data = VisualAnnotationInput(
+            ai_assistant=ai_assistant,
+            annotation_type=annotation_type,
+            task_type=ai_assistant,
+            image_data=frames,  # List of frame images
+            description=description,
+            labels=labels,
+            video_metadata=video_metadata
+        )
+        # Get output format
+        ai_prompt = get_ai_prompt()
+        prompt_config = ai_prompt.get(annotation_type, {}).get(ai_assistant, {})
+        output_format_name = prompt_config.get("output_format", "video_scene_detection")
+        output_format = self.model_manager.get_model_class_by_name(output_format_name)
+        # Query the visual endpoint
+        result = endpoint.get_visual_ai(data, output_format)
+        return result
+    def _get_visual_endpoint(self):
+        """Get the appropriate endpoint for visual tasks."""
+        # Use dedicated visual endpoint if configured
+        if self.visual_endpoint:
+            return self.visual_endpoint
+        # Check if main endpoint supports vision
+        if hasattr(self.ai_endpoint, 'query_with_image'):
+            return self.ai_endpoint
+        # Try to find a visual endpoint from registered types
+        visual_types = ['yolo', 'ollama_vision', 'openai_vision', 'anthropic_vision']
+        for vtype in visual_types:
+            if vtype in AIEndpointFactory._endpoints:
+                try:
+                    visual_config = {
+                        "ai_support": {
+                            "enabled": True,
+                            "endpoint_type": vtype,
+                            "ai_config": config.get("ai_support", {}).get("ai_config", {})
+                        }
+                    }
+                    return AIEndpointFactory.create_endpoint(visual_config)
+                except Exception as e:
+                    logger.debug(f"Could not create {vtype} endpoint: {e}")
+                    continue
+        return None
+    def _extract_image_url(self, item_data: Dict) -> str:
+        """Extract image URL from item data.
+        Looks for common field names that might contain image URLs.
+        """
+        # Common field names for images
+        image_fields = ['image', 'image_url', 'img', 'img_url', 'url', 'path', 'file', 'src']
+        for field in image_fields:
+            if field in item_data:
+                value = item_data[field]
+                if isinstance(value, str) and (
+                    value.startswith(('http://', 'https://', '/')) or
+                    value.endswith(('.jpg', '.jpeg', '.png', '.gif', '.webp'))
+                ):
+                    return value
+        # Check 'text' field for URL (common in simple configs)
+        if 'text' in item_data:
+            text = item_data['text']
+            if isinstance(text, str) and (
+                text.startswith(('http://', 'https://')) and
+                any(ext in text.lower() for ext in ['.jpg', '.jpeg', '.png', '.gif', '.webp'])
+            ):
+                return text
+        return None
+    def _extract_video_url(self, item_data: Dict) -> str:
+        """Extract video URL from item data."""
+        # Common field names for videos
+        video_fields = ['video', 'video_url', 'url', 'path', 'file', 'src', 'media']
+        for field in video_fields:
+            if field in item_data:
+                value = item_data[field]
+                if isinstance(value, str) and (
+                    value.startswith(('http://', 'https://', '/')) or
+                    value.endswith(('.mp4', '.webm', '.ogg', '.avi', '.mov'))
+                ):
+                    return value
+        # Check 'text' field for URL
+        if 'text' in item_data:
+            text = item_data['text']
+            if isinstance(text, str) and (
+                text.startswith(('http://', 'https://')) and
+                any(ext in text.lower() for ext in ['.mp4', '.webm', '.ogg', '.avi', '.mov'])
+            ):
+                return text
+        return None
+    def _prepare_image_data(self, image_url: str) -> ImageData:
+        """Prepare ImageData from URL or path."""
+        if image_url.startswith(('http://', 'https://')):
+            return ImageData(source="url", data=image_url)
+        else:
+            # Local file path - encode as base64
+            from potato.ai.visual_ai_endpoint import BaseVisualAIEndpoint
+            return BaseVisualAIEndpoint.encode_image_to_base64(image_url)
+    def _generate_text_hint_for_visual(self, instance_id: int, annotation_id: int, ai_assistant: str) -> Dict:
+        """Generate text-based hint when visual endpoint is not available."""
+        description = config["annotation_schemes"][annotation_id].get("description", "")
+        labels = config["annotation_schemes"][annotation_id].get("labels", [])
+        if labels and isinstance(labels[0], dict):
+            labels = [l.get("name", str(l)) for l in labels]
+        return {
+            "hint": f"Review the {'image' if 'image' in config['annotation_schemes'][annotation_id]['annotation_type'] else 'video'} carefully. "
+                    f"Look for: {', '.join(labels) if labels else 'relevant content'}. "
+                    f"Task: {description}",
+            "suggestive_choice": ""
+        }
+    def is_option_highlighting_enabled_for_scheme(self, annotation_id: int) -> bool:
+        """Check if option highlighting is enabled for a specific annotation scheme."""
+        if not self.option_highlighting_enabled:
+            return False
+        scheme = config["annotation_schemes"][annotation_id]
+        annotation_type = scheme.get("annotation_type", "")
+        scheme_name = scheme.get("name", "")
+        # Only applicable to discrete option types
+        discrete_types = ["radio", "multiselect", "likert", "select"]
+        if annotation_type not in discrete_types:
+            return False
+        # Check if schemas filter is set
+        if self.option_highlighting_schemas is not None:
+            if scheme_name not in self.option_highlighting_schemas:
+                return False
+        return True
+    def get_option_highlighting_config(self) -> Dict:
+        """Get the option highlighting configuration for the frontend."""
+        return {
+            "enabled": self.option_highlighting_enabled,
+            "top_k": self.option_highlighting_top_k,
+            "dim_opacity": self.option_highlighting_dim_opacity,
+            "auto_apply": self.option_highlighting_auto_apply,
+            "schemas": self.option_highlighting_schemas,
+            "prefetch_count": self.option_highlighting_prefetch_count,
+        }
+    def generate_option_highlights(self, instance_id: int, annotation_id: int) -> Dict:
+        """Generate option highlighting suggestions for an annotation.
+        Args:
+            instance_id: The instance/item index
+            annotation_id: The annotation scheme index
+        Returns:
+            Dict with highlighted options and configuration:
+            {
+                "highlighted": ["option1", "option2"],
+                "top_k": 3,
+                "confidence": 0.85
+            }
+        """
+        from string import Template
+        if not self.is_option_highlighting_enabled_for_scheme(annotation_id):
+            return {"error": "Option highlighting not enabled for this scheme"}
+        annotation_type = _get_scheme_field(annotation_id, "annotation_type", default="")
+        description = _get_scheme_field(annotation_id, "description", default="")
+        labels = _get_scheme_field(annotation_id, "labels", default=[])
+        # Extract label names
+        if labels and isinstance(labels[0], dict):
+            label_names = [l.get("name", str(l)) for l in labels]
+        else:
+            label_names = [str(l) for l in labels]
+        # For likert scales, generate label names from min/max labels
+        if annotation_type == "likert":
+            size = scheme.get("size", 5)
+            min_label = scheme.get("min_label", "1")
+            max_label = scheme.get("max_label", str(size))
+            label_names = [f"{i+1} ({min_label if i == 0 else max_label if i == size-1 else ''})" for i in range(size)]
+            # Clean up empty parentheses
+            label_names = [l.replace(" ()", "") for l in label_names]
+        text = _get_instance_text(instance_id)
+        top_k = min(self.option_highlighting_top_k, len(label_names))
+        # Get prompt template
+        ai_prompt = get_ai_prompt()
+        prompt_config = ai_prompt.get("option_highlight", {}).get("option_highlight", {})
+        if not prompt_config:
+            return {"error": "Option highlight prompt not configured"}
+        prompt_template = prompt_config.get("prompt", "")
+        output_format_name = prompt_config.get("output_format", "option_highlight")
+        output_format = self.model_manager.get_model_class_by_name(output_format_name)
+        # Build the prompt with clear delimiters to mitigate prompt injection.
+        # The user content is wrapped in XML-style tags so the LLM can
+        # distinguish between instructions and untrusted data.
+        delimited_text = (
+            f"<user_content>\n{text}\n</user_content>"
+        )
+        template = Template(prompt_template)
+        prompt = template.safe_substitute(
+            text=delimited_text,
+            description=description,
+            labels=", ".join(label_names),
+            top_k=top_k
+        )
+        # Query the AI endpoint
+        try:
+            result = self.ai_endpoint.query(prompt, output_format)
+            logger.debug(f"Option highlight raw result: {result}")
+            # Parse the result
+            if isinstance(result, str):
+                import json as json_module
+                try:
+                    # Try to parse JSON from the response
+                    result = json_module.loads(result)
+                except json_module.JSONDecodeError:
+                    # Try to extract JSON from markdown code block
+                    if "```json" in result:
+                        json_start = result.find("```json") + 7
+                        json_end = result.find("```", json_start)
+                        result = json_module.loads(result[json_start:json_end].strip())
+                    elif "```" in result:
+                        json_start = result.find("```") + 3
+                        json_end = result.find("```", json_start)
+                        result = json_module.loads(result[json_start:json_end].strip())
+                    else:
+                        return {"error": f"Could not parse response: {result[:100]}"}
+            highlighted = result.get("highlighted_options", [])
+            confidence = result.get("confidence", None)
+            # Validate highlighted options against available labels
+            valid_highlighted = [opt for opt in highlighted if opt in label_names]
+            return {
+                "highlighted": valid_highlighted[:top_k],
+                "top_k": top_k,
+                "confidence": confidence
+            }
+        except Exception as e:
+            logger.error(f"Error generating option highlights: {e}")
+            return {"error": str(e)}
+    def get_option_highlights(self, instance_id: int, annotation_id: int) -> Dict:
+        """Get option highlights from cache or generate them.
+        Args:
+            instance_id: The instance/item index
+            annotation_id: The annotation scheme index
+        Returns:
+            Dict with highlighted options
+        """
+        key = (instance_id, annotation_id, "option_highlight")
+        # Try cache first
+        if self.disk_cache_enabled:
+            cached = self.get_from_cache(key)
+            if cached is not None:
+                logger.debug(f"Option highlight cache hit for {key}")
+                return cached
+        # Generate
+        result = self.generate_option_highlights(instance_id, annotation_id)
+        # Cache if successful
+        if "error" not in result and self.disk_cache_enabled:
+            self.add_to_cache(key, result)
+        return result
+    def start_option_highlight_prefetch(self, page_id: int, prefetch_amount: int = None):
+        """Prefetch option highlights for upcoming items.
+        Args:
+            page_id: Current page/instance index
+            prefetch_amount: Number of items to prefetch (uses config default if None)
+        """
+        if not self.option_highlighting_enabled or not self.disk_cache_enabled:
+            return
+        if prefetch_amount is None:
+            prefetch_amount = self.option_highlighting_prefetch_count
+        ism = get_item_state_manager()
+        with self.lock:
+            # Calculate range
+            if prefetch_amount >= 0:
+                start_idx = page_id
+                end_idx = min(start_idx + prefetch_amount, len(ism.items()))
+            else:
+                start_idx = max(page_id + prefetch_amount, 0)
+                end_idx = page_id
+            keys = []
+            for i in range(start_idx, end_idx):
+                for annotation_id, scheme in enumerate(config["annotation_schemes"]):
+                    if self.is_option_highlighting_enabled_for_scheme(annotation_id):
+                        key = (i, annotation_id, "option_highlight")
+                        # Check if not already cached or in progress
+                        if self.get_from_cache(key) is None and key not in self.in_progress:
+                            keys.append(key)
+            # Submit prefetch jobs
+            for key in keys:
+                instance_id, annotation_id, _ = key
+                future = self.executor.submit(self.generate_option_highlights, instance_id, annotation_id)
+                self.in_progress[key] = future
+                def callback(fut, cache_key=key):
+                    with self.lock:
+                        try:
+                            result = fut.result()
+                            if "error" not in result:
+                                self.add_to_cache(cache_key, result)
+                        except Exception as e:
+                            logger.error(f"Option highlight prefetch failed for {cache_key}: {e}")
+                        self.in_progress.pop(cache_key, None)
+                future.add_done_callback(callback)
+            if keys:
+                logger.debug(f"Started option highlight prefetch for {len(keys)} items")
+    def get_include_all(self):
+        return self.include_all
+    def get_special_include(self, page_number_int, annotation_id_int):
+        logger.debug(f"get_special_include: page={page_number_int}, annotation_id={annotation_id_int}")
+        if not self.special_includes.get(page_number_int):
+            return None
+        elif not self.special_includes.get(page_number_int).get(annotation_id_int):
+            return None
+        return self.special_includes.get(page_number_int).get(annotation_id_int)
+    def start_prefetch(self, page_id, prefetch_amount):
+        """Prefetches a fixed number of upcoming items to warm the cache."""
+        if not config.get("ai_support", {}).get("enabled") or not self.disk_cache_enabled:
+            return
+        ism = get_item_state_manager()
+        with self.lock:
+            # Calculate range bounds
+            if prefetch_amount >= 0:
+                start_idx = page_id
+                end_idx = min(start_idx + prefetch_amount, len(ism.items()))
+            else:
+                start_idx = max(page_id - prefetch_amount, 0)
+                end_idx = page_id
+            logger.debug(f"Prefetch range: start_idx={start_idx}, end_idx={end_idx}")
+            keys = []
+            for i in range(start_idx, end_idx):
+                # Check if this page should be included
+                if not self.should_include_page(i):
+                    continue
+                # Process each annotation scheme for this page
+                for annotation_id, scheme in enumerate(config["annotation_schemes"]):
+                    if not self.should_include_scheme(i, annotation_id):
+                        continue
+                    annotation_type = scheme["annotation_type"]
+                    ai_prompt = get_ai_prompt()
+                    if not ai_prompt[annotation_type]:
+                        raise Exception(f"{annotation_type} is not defined in ai_prompt")
+                    # Generate keys for this page/scheme combination
+                    scheme_keys = self.get_keys_for_scheme(i, annotation_type, annotation_id, ai_prompt)
+                    keys.extend(scheme_keys)
+            if keys:
+                self.prefetch(keys)
+    def should_include_page(self, page_index):
+        """Determine if a page should be included based on include_all and special_includes."""
+        if self.include_all:
+            return True
+        return page_index in self.special_includes
+    def should_include_scheme(self, page_index, annotation_id):
+        """Determine if a scheme should be included for a given page."""
+        if self.include_all:
+            return True
+        # Check if page is in special_includes and scheme is specified
+        if page_index in self.special_includes:
+            page_includes = self.special_includes[page_index]
+            # Handle both list and dict formats for page_includes
+            if isinstance(page_includes, dict):
+                return annotation_id in page_includes
+            elif isinstance(page_includes, list):
+                return annotation_id in page_includes
+        return False
+    def get_keys_for_scheme(self, page_index, annotation_type, annotation_id, ai_prompt):
+        """Get all keys for a specific page combination."""
+        keys = []
+        # Check if this page/annotation has specific overrides in special_includes
+        if (page_index in self.special_includes and
+            isinstance(self.special_includes[page_index], dict) and
+            annotation_id in self.special_includes[page_index]):
+            # Use special_includes (overrides include_all setting)
+            specified_keys = self.special_includes[page_index][annotation_id]
+            for key in specified_keys:
+                keys.append((page_index, annotation_id, key))
+        elif self.include_all:
+            # No specific override, so include all available keys for this annotation type
+            for key in ai_prompt[annotation_type]:
+                keys.append((page_index, annotation_id, key))
+        # If include_all is False and no special_include entry, return empty keys
+        return keys
+    def prefetch(self, keys: list):
+        """checks if keys are already cached and asynchronously generates missing ones"""
+        with self.lock:
+            for key in keys:
+                if self.get_from_cache(key) is None and key not in self.in_progress:
+                    # i, annotation_id, annotation_type, ai_prompt
+                    instance_id, annotation_id, ai_assistant = key
+                    future = self.executor.submit(self.compute_help, instance_id, annotation_id, ai_assistant)
+                    self.in_progress[key] = future
+                    def callback(fut, cache_key=key):
+                            with self.lock:
+                                try:
+                                    result = fut.result()
+                                    self.add_to_cache(cache_key, result)
+                                except Exception as e:
+                                    logger.error(f"Prefetch failed for key {cache_key}: {e}")
+                                self.in_progress.pop(cache_key, None)
+                    future.add_done_callback(callback)
+    def get_ai_help(self, instance_id: int, annotation_id: int, ai_assistant: str) -> str:
+        """retrieves AI help either from cache, waits for in-progress, or computes on-demand."""
+        key = (instance_id, annotation_id, ai_assistant)
+        # Check if caching is enabled for this help type
+        if not self.disk_cache_enabled:
+            return self.compute_help(instance_id, annotation_id, ai_assistant)
+        # Try to get from cache if caching is enabled
+        cached_value = self.get_from_cache(key)
+        if cached_value is not None:
+            logger.debug(f"Cache hit for key: {key}")
+            return cached_value
+        with self.lock:
+            if key in self.in_progress:
+                future = self.in_progress[key]
+            else:
+                future = self.executor.submit(self.compute_help, instance_id, annotation_id, ai_assistant)
+                self.in_progress[key] = future
+        try:
+            result = future.result(timeout=60)
+            # Don't cache error responses
+            is_error_response = (
+                isinstance(result, str) and
+                (result.startswith("Unable to generate") or
+                 result.startswith("Error:") or
+                 "error" in result.lower()[:50])
+            )
+            if self.disk_cache_enabled and not is_error_response:
+                self.add_to_cache(key, result)
+            elif is_error_response:
+                logger.warning(f"Not caching error response for key {key}: {result[:100]}")
+            with self.lock:
+                self.in_progress.pop(key, None)
+            return result
+        except Exception as e:
+            logger.error(f"Error computing help for key {key}: {e}")
+            with self.lock:
+                self.in_progress.pop(key, None)
+            return f"Error: {str(e)}"
+    def compute_help(self, instance_id: int, annotation_id: int, ai_assistant: str):
+        # Validate that the assistant type is compatible with the model and input
+        is_valid, error_message = self._validate_assistant_compatibility(
+            instance_id, annotation_id, ai_assistant
+        )
+        if not is_valid:
+            logger.warning(f"Assistant compatibility check failed: {error_message}")
+            return {"error": error_message}
+        annotation_type_str = config["annotation_schemes"][annotation_id]["annotation_type"]
+        annotation_type = Annotation_Type(annotation_type_str)
+        if annotation_type == Annotation_Type.LIKERT:
+            return self.generate_likert(instance_id, annotation_id, ai_assistant)
+        elif annotation_type == Annotation_Type.RADIO:
+            return self.generate_radio(instance_id, annotation_id, ai_assistant)
+        elif annotation_type == Annotation_Type.MULTISELECT:
+            return self.generate_multiselect(instance_id, annotation_id, ai_assistant)
+        elif annotation_type == Annotation_Type.NUMBER:
+            return self.generate_number(instance_id, annotation_id, ai_assistant)
+        elif annotation_type == Annotation_Type.SELECT:
+            return self.generate_select(instance_id, annotation_id, ai_assistant)
+        elif annotation_type == Annotation_Type.SLIDER:
+            return self.generate_slider(instance_id, annotation_id, ai_assistant)
+        elif annotation_type == Annotation_Type.SPAN:
+            return self.generate_span(instance_id, annotation_id, ai_assistant)
+        elif annotation_type == Annotation_Type.TEXTBOX:
+            return self.generate_textbox(instance_id, annotation_id, ai_assistant)
+        elif annotation_type == Annotation_Type.IMAGE_ANNOTATION:
+            return self.generate_image_annotation(instance_id, annotation_id, ai_assistant)
+        elif annotation_type == Annotation_Type.VIDEO_ANNOTATION:
+            return self.generate_video_annotation(instance_id, annotation_id, ai_assistant)
+        else:
+            raise ValueError(f"Unknown annotation type: {annotation_type}")
+    def get_cache_stats(self) -> Dict[str, int]:
+        """returns statistics on disk cache and in-progress cache entries."""
+        with self.lock:
+            disk_count = 0
+            if self.disk_cache_enabled and self.disk_persistence_path and os.path.exists(self.disk_persistence_path):
+                try:
+                    disk_data = self.load_disk_cache_data(self.disk_persistence_path)
+                    disk_count = len(disk_data)
+                except:
+                    pass
+            return {
+                'disk_cache_enabled': self.disk_cache_enabled,
+                'cached_items_disk': disk_count,
+                'in_progress_items': len(self.in_progress)
+            }
+    def clear_cache(self):
+        """clears disk cache and cancels any ongoing generation."""
+        with self.lock:
+            for future in self.in_progress.values():
+                future.cancel()
+            self.in_progress.clear()
+            if self.disk_cache_enabled and self.disk_persistence_path and os.path.exists(self.disk_persistence_path):
+                try:
+                    os.remove(self.disk_persistence_path)
+                    logger.info("Disk cache file removed")
+                except Exception as e:
+                    logger.error(f"Error removing disk cache file: {e}")
+            logger.info("Cache cleared")

potato/ai/ai_endpoint.py ADDED Viewed

	@@ -0,0 +1,688 @@

+"""
+Unified AI endpoint interface for various LLM providers.
+This module provides a common interface for interacting with different LLM providers
+including OpenAI, Anthropic, Hugging Face, Ollama, and VLLM endpoints.
+"""
+from dataclasses import dataclass, field
+from enum import Enum
+import logging
+from abc import ABC, abstractmethod
+import os
+from typing import Dict, Any, Optional, List, Type, Union
+import json
+from string import Template
+from pydantic import BaseModel
+from .ai_prompt import get_ai_prompt
+logger = logging.getLogger(__name__)
+class Annotation_Type(Enum):
+    RADIO = "radio"
+    LIKERT = "likert"
+    NUMBER = "number"
+    TEXTBOX = "text"
+    MULTISELECT = "multiselect"
+    SPAN = "span"
+    SELECT = "select"
+    SLIDER = "slider"
+    IMAGE_ANNOTATION = "image_annotation"
+    VIDEO_ANNOTATION = "video_annotation"
+@dataclass
+class ImageData:
+    """Data structure for image input to visual AI endpoints."""
+    source: str  # 'url' | 'base64'
+    data: str    # The URL or base64-encoded image data
+    width: Optional[int] = None
+    height: Optional[int] = None
+    mime_type: Optional[str] = None  # e.g., 'image/jpeg', 'image/png'
+@dataclass
+class VisualAnnotationInput:
+    """Input data structure for visual annotation AI assistance."""
+    ai_assistant: str           # 'detection', 'classification', 'hint', 'pre_annotate', etc.
+    annotation_type: str        # 'image_annotation' | 'video_annotation'
+    task_type: str              # Specific task: 'detection', 'classification', 'scene_detection', etc.
+    image_data: Union[ImageData, List[ImageData]]  # Single image or list of frames
+    description: str            # Task description from annotation scheme
+    labels: Optional[List[str]] = None  # Available labels for the task
+    video_metadata: Optional[Dict[str, Any]] = field(default_factory=dict)  # fps, duration for video
+    region: Optional[Dict[str, float]] = None  # Selected region for classification (x, y, width, height)
+    confidence_threshold: float = 0.5  # Minimum confidence for detections
+@dataclass
+class AnnotationInput:
+    ai_assistant: str
+    annotation_type: Annotation_Type
+    text: str
+    description: str
+    min_label: Optional[str] = ""
+    max_label: Optional[str] = ""
+    size: Optional[int] = -1
+    labels: Optional[List[str]] = None
+    min_value: Optional[int] = -1
+    max_value: Optional[int] = -1
+    step: Optional[int] = -1
+@dataclass
+class ModelCapabilities:
+    """
+    Declares what operations an AI endpoint can perform.
+    This dataclass is used to define the capabilities of different AI endpoints,
+    enabling the system to automatically filter AI assistant buttons and validate
+    requests based on what each model can actually do.
+    Attributes:
+        text_generation: Can generate text (hints, rationales, descriptions)
+        vision_input: Can process images as input
+        bounding_box_output: Can output precise coordinate detections
+        text_classification: Can classify text into categories
+        image_classification: Can classify images into categories
+        rationale_generation: Can generate explanations/rationales for labels
+        keyword_extraction: Can extract keywords from text (not applicable to images)
+    """
+    text_generation: bool = False
+    vision_input: bool = False
+    bounding_box_output: bool = False
+    text_classification: bool = False
+    image_classification: bool = False
+    rationale_generation: bool = False
+    keyword_extraction: bool = False
+    def supports_assistant(self, assistant_type: str, has_image_input: bool = False) -> bool:
+        """
+        Check if model supports a specific AI assistant type.
+        Args:
+            assistant_type: The type of AI assistant ('hint', 'keyword', 'rationale',
+                          'detection', 'pre_annotate', 'classification')
+            has_image_input: Whether the current content is an image
+        Returns:
+            True if the model supports this assistant type for the given input type
+        """
+        if assistant_type == "hint":
+            # Hints require text generation; for images, also need vision
+            if has_image_input:
+                return self.text_generation and self.vision_input
+            return self.text_generation
+        elif assistant_type == "keyword":
+            # Keywords require keyword extraction AND text input (not images)
+            # Keyword highlighting doesn't make sense for images
+            return self.keyword_extraction and not has_image_input
+        elif assistant_type == "rationale":
+            # Rationales require rationale generation; for images, also need vision
+            if has_image_input:
+                return self.rationale_generation and self.vision_input
+            return self.rationale_generation
+        elif assistant_type in ("detection", "detect", "pre_annotate"):
+            # Detection requires vision and bounding box output
+            return self.bounding_box_output and self.vision_input
+        elif assistant_type == "classification":
+            # Classification depends on input type
+            if has_image_input:
+                return self.image_classification and self.vision_input
+            return self.text_classification
+        # Unknown assistant type - default to False for safety
+        return False
+    def get_supported_assistants(self, has_image_input: bool = False) -> List[str]:
+        """
+        Get list of assistant types supported for the given input type.
+        Args:
+            has_image_input: Whether the current content is an image
+        Returns:
+            List of supported assistant type names
+        """
+        all_types = ["hint", "keyword", "rationale", "detection", "pre_annotate", "classification"]
+        return [t for t in all_types if self.supports_assistant(t, has_image_input)]
+class AIEndpointError(Exception):
+    """Base exception for AI endpoint errors."""
+    pass
+class AIEndpointConfigError(AIEndpointError):
+    """Exception raised for configuration errors."""
+    pass
+class AIEndpointRequestError(AIEndpointError):
+    """Exception raised for request/API errors."""
+    pass
+class BaseAIEndpoint(ABC):
+    """
+    Abstract base class for AI endpoints.
+    All AI endpoint implementations should inherit from this class
+    and implement the required methods.
+    """
+    def __init__(self, config: Dict[str, Any]):
+        """
+        Initialize the AI endpoint with configuration.
+        Args:
+            config: Configuration dictionary containing endpoint-specific settings
+        """
+        self.config = config
+        self.description = config.get("description", "")
+        self.annotation_type = config.get("annotation_type", "")
+        self.ai_config = config.get("ai_config", {})
+        # Model configuration
+        self.model = self.ai_config.get("model", self._get_default_model())
+        self.max_tokens = self.ai_config.get("max_tokens", 100)
+        self.temperature = self.ai_config.get("temperature", 0.1)
+        # prompt
+        self.prompts = get_ai_prompt()
+        # Initialize the client
+        self._initialize_client()
+    @abstractmethod
+    def _initialize_client(self) -> None:
+        """Initialize the client for the specific AI provider."""
+        pass
+    @abstractmethod
+    def _get_default_model(self) -> str:
+        """Get the default model name for this provider."""
+        pass
+    @abstractmethod
+    def query(self, prompt: str, output_format: Type[BaseModel]):
+        """
+        Send a query to the AI model and return the response.
+        Args:
+            prompt: The prompt to send to the model
+        Returns:
+            The model's response as a string
+        Raises:
+            AIEndpointRequestError: If the request fails
+        """
+        pass
+    def chat_query_with_image(
+        self,
+        messages: List[Dict[str, Any]],
+        images: Optional[List["ImageData"]] = None,
+    ) -> str:
+        """
+        Send a multi-turn chat with interleaved images to the AI model.
+        Messages may contain content blocks (text + image) instead of plain strings.
+        Used by the live agent runner for vision-based agent loops.
+        Default implementation raises NotImplementedError — only vision-capable
+        endpoints should override this.
+        Args:
+            messages: List of message dicts. 'content' may be a string or a list
+                      of content blocks (e.g., {"type": "text", "text": "..."} or
+                      {"type": "image", "source": {...}}).
+            images: Optional list of ImageData to include (alternative to inline images).
+        Returns:
+            The model's response as a plain text string.
+        Raises:
+            NotImplementedError: If the endpoint doesn't support vision.
+        """
+        raise NotImplementedError(
+            f"{self.__class__.__name__} does not support chat_query_with_image. "
+            f"Use a vision-capable endpoint (e.g., anthropic_vision)."
+        )
+    def chat_query(self, messages: List[Dict[str, str]]) -> str:
+        """
+        Send a multi-turn chat conversation to the AI model.
+        Default implementation flattens messages into a single prompt and calls query().
+        Subclasses should override with native multi-turn support.
+        Args:
+            messages: List of message dicts with 'role' and 'content' keys.
+                      Roles: 'system', 'user', 'assistant'
+        Returns:
+            The model's response as a plain text string.
+        """
+        # Flatten messages into a single prompt
+        parts = []
+        for msg in messages:
+            role = msg.get("role", "user")
+            content = msg.get("content", "")
+            if role == "system":
+                parts.append(f"System: {content}")
+            elif role == "assistant":
+                parts.append(f"Assistant: {content}")
+            else:
+                parts.append(f"User: {content}")
+        prompt = "\n\n".join(parts) + "\n\nAssistant:"
+        try:
+            result = self.query(prompt)
+            # query() may return parsed JSON or a string; ensure we return a string
+            if isinstance(result, dict):
+                return result.get("response", result.get("content", str(result)))
+            return str(result)
+        except Exception as e:
+            raise AIEndpointRequestError(f"Chat query failed: {e}")
+    def parseStringToJson(self, response_content: str) -> str:
+        """
+        Parse structured output from any LLM response, with robust fallbacks.
+        Handles common issues across all endpoint types (ollama, vllm, openai, etc.):
+        1. Clean JSON responses -> direct parse
+        2. JSON wrapped in markdown code blocks (```json ... ```)
+        3. JSON embedded in surrounding prose text
+        4. Truncated JSON (max_tokens exceeded) -> extract complete key-value pairs
+        5. Plain text with no JSON structure -> return as {"response": text}
+        Args:
+            response_content: Raw response string from the LLM
+        Returns:
+            Parsed dict or the raw string if JSON extraction succeeds
+        Raises:
+            ValueError: Only if response is completely empty
+        """
+        import re
+        # Handle empty or None content
+        if not response_content:
+            raise ValueError("Empty response content received from AI endpoint")
+        # If it's already a dict, return it
+        if isinstance(response_content, dict):
+            return response_content
+        # Convert to string if needed
+        content_str = str(response_content).strip()
+        if not content_str:
+            raise ValueError("Empty response content received from AI endpoint")
+        # Strategy 0: Strip thinking/reasoning blocks that wrap the actual output
+        # Many models (qwen3, deepseek, etc.) produce <think>...</think> blocks
+        cleaned = content_str
+        for tag in ['think', 'thinking', 'thought', 'inner_monologue']:
+            cleaned = re.sub(
+                rf'<{tag}>[\s\S]*?</{tag}>\s*',
+                '', cleaned, flags=re.IGNORECASE
+            ).strip()
+        if cleaned and cleaned != content_str:
+            content_str = cleaned
+        # Strategy 1: Try direct JSON parse
+        try:
+            return json.loads(content_str)
+        except json.JSONDecodeError:
+            pass
+        # Strategy 2: Extract from markdown code blocks
+        for pattern in [
+            r'```json\s*([\s\S]*?)\s*```',
+            r'```\s*([\s\S]*?)\s*```',
+        ]:
+            match = re.search(pattern, content_str)
+            if match:
+                try:
+                    return json.loads(match.group(1).strip())
+                except json.JSONDecodeError:
+                    pass
+        # Strategy 3: Extract from tool call format
+        # Some models return: <tool_call>{"name": "...", "arguments": {...}}</tool_call>
+        # or function_call blocks
+        for pattern in [
+            r'<tool_call>\s*([\s\S]*?)\s*</tool_call>',
+            r'<function_call>\s*([\s\S]*?)\s*</function_call>',
+            r'<output>\s*([\s\S]*?)\s*</output>',
+            r'<result>\s*([\s\S]*?)\s*</result>',
+            r'<answer>\s*([\s\S]*?)\s*</answer>',
+        ]:
+            match = re.search(pattern, content_str, re.IGNORECASE)
+            if match:
+                try:
+                    parsed = json.loads(match.group(1).strip())
+                    # If it's a tool call wrapper, extract the arguments
+                    if isinstance(parsed, dict) and 'arguments' in parsed:
+                        return parsed['arguments']
+                    return parsed
+                except json.JSONDecodeError:
+                    pass
+        # Strategy 4: Find a JSON object anywhere in the text
+        # Greedy match for the outermost { ... }
+        match = re.search(r'\{[\s\S]*\}', content_str)
+        if match:
+            try:
+                return json.loads(match.group(0))
+            except json.JSONDecodeError:
+                pass
+        # Strategy 4: Salvage truncated JSON
+        # Extract complete "key": "value" and "key": number pairs
+        salvaged = self._salvage_key_value_pairs(content_str)
+        if salvaged:
+            logger.warning(
+                f"Salvaged {len(salvaged)} fields from truncated/malformed response"
+            )
+            return salvaged
+        # Strategy 6: Parse XML-style output
+        # Some models (especially larger ones) produce XML like:
+        # <label>joy</label><confidence>90</confidence>
+        # or <response><label>joy</label></response>
+        xml_result = self._parse_xml_to_dict(content_str)
+        if xml_result:
+            logger.info(
+                f"Parsed {len(xml_result)} fields from XML-style response"
+            )
+            return xml_result
+        # Strategy 7: Return raw text wrapped in a dict
+        logger.warning(
+            f"Could not parse JSON or XML from response ({len(content_str)} chars), "
+            f"returning as raw text"
+        )
+        return {"response": content_str}
+    @staticmethod
+    def _salvage_key_value_pairs(text: str) -> Optional[dict]:
+        """Extract key-value pairs from truncated or malformed JSON.
+        Handles cases where max_tokens cuts off a response mid-field, e.g.:
+        {"label": "joy", "confidence": 90, "reasoning": "The text expres...
+        Returns:
+            Dict of extracted key-value pairs, or None if nothing found.
+        """
+        import re
+        result = {}
+        # Extract "key": "value" pairs (string values)
+        for match in re.finditer(r'"(\w+)"\s*:\s*"([^"]*)"', text):
+            result[match.group(1)] = match.group(2)
+        # Extract "key": number pairs
+        for match in re.finditer(r'"(\w+)"\s*:\s*(-?\d+(?:\.\d+)?)\b', text):
+            key = match.group(1)
+            if key not in result:
+                try:
+                    val = float(match.group(2))
+                    result[key] = int(val) if val == int(val) else val
+                except ValueError:
+                    pass
+        # Extract "key": true/false/null
+        for match in re.finditer(r'"(\w+)"\s*:\s*(true|false|null)\b', text):
+            key = match.group(1)
+            if key not in result:
+                val_str = match.group(2)
+                result[key] = (
+                    True if val_str == 'true'
+                    else False if val_str == 'false'
+                    else None
+                )
+        return result if result else None
+    @staticmethod
+    def _parse_xml_to_dict(text: str) -> Optional[dict]:
+        """Extract key-value pairs from XML-style LLM output.
+        Handles patterns like:
+        - <label>joy</label><confidence>90</confidence>
+        - <response><label>joy</label></response>
+        - Mixed XML with text: "The emotion is <label>joy</label>"
+        Returns:
+            Dict of tag->content pairs, or None if no XML tags found.
+        """
+        import re
+        result = {}
+        # Find all <tag>content</tag> pairs (non-nested simple tags)
+        for match in re.finditer(
+            r'<(\w+)>([^<]*)</\1>', text, re.IGNORECASE
+        ):
+            tag = match.group(1).lower()
+            value = match.group(2).strip()
+            # Skip wrapper tags that contain other tags
+            if tag in ('response', 'output', 'result', 'answer', 'root'):
+                continue
+            # Try to parse numeric values
+            try:
+                if '.' in value:
+                    result[tag] = float(value)
+                else:
+                    result[tag] = int(value)
+            except ValueError:
+                # Boolean
+                if value.lower() in ('true', 'false'):
+                    result[tag] = value.lower() == 'true'
+                else:
+                    result[tag] = value
+        return result if result else None
+    def get_ai(self, data: AnnotationInput, output_format) -> str:
+        """
+        Get a hint for annotating the given text.
+        Args:
+            text: The text to get a hint for
+        Returns:
+            A helpful hint for annotation
+        """
+        try:
+            # Check if annotation type exists (comparing string against enum values)
+            valid_types = [e.value for e in Annotation_Type]
+            if data.annotation_type not in valid_types:
+                logger.warning(f"Annotation type '{data.annotation_type}' not found")
+                return "Unable to generate suggestion - annotation type not configured"
+            # Check if ai_assistant exists
+            ai_prompt = get_ai_prompt()
+            if data.ai_assistant not in ai_prompt[data.annotation_type]:
+                logger.warning(f"'ai_assistant' not found for {data.annotation_type}")
+                return "Unable to generate suggestion - prompt not configured"
+            template_str = self.prompts.get(data.annotation_type).get(data.ai_assistant).get("prompt")
+            template = Template(template_str)
+            prompt = template.substitute(
+                text=data.text,
+                description=data.description,
+                min_label=data.min_label,
+                max_label=data.max_label,
+                size=data.size,
+                labels=data.labels,
+                min_value=data.min_value,
+                max_value=data.max_value,
+                step=data.step
+            )
+            return self.query(prompt, output_format)
+        except Exception as e:
+            logger.error(f"[get_ai] AnnotationInput: {data}")
+            logger.error(f"[get_ai] Error for {data.annotation_type}/{data.ai_assistant}: {type(e).__name__}: {e}")
+            import traceback
+            logger.error(f"[get_ai] Traceback:\n{traceback.format_exc()}")
+            return "Unable to generate hint at this time."
+    def health_check(self) -> bool:
+        """
+        Check if the AI endpoint is healthy and accessible.
+        Returns:
+            True if the endpoint is healthy, False otherwise
+        """
+        try:
+            # Simple test query
+            test_response = self.query("Hello")
+            return bool(test_response and test_response.strip())
+        except Exception as e:
+            logger.error(f"Health check failed: {e}")
+            return False
+class AIEndpointFactory:
+    """
+    Factory class for creating AI endpoint instances.
+    """
+    _endpoints = { }
+    @classmethod
+    def register_endpoint(cls, endpoint_type: str, endpoint_class: type):
+        """Register a new endpoint type."""
+        cls._endpoints[endpoint_type] = endpoint_class
+    @classmethod
+    def create_endpoint(cls, config: Dict[str, Any]) -> Optional[BaseAIEndpoint]:
+        """
+        Create an AI endpoint instance based on configuration.
+        Args:
+            config: Configuration dictionary containing ai_support settings
+        Returns:
+            An AI endpoint instance or None if AI support is disabled
+        Raises:
+            AIEndpointConfigError: If the configuration is invalid
+        """
+        if not config.get("ai_support", {}).get("enabled", False):
+            return None
+        ai_support = config["ai_support"]
+        endpoint_type = ai_support.get("endpoint_type")
+        if not endpoint_type:
+            raise AIEndpointConfigError("endpoint_type is required when ai_support is enabled")
+        if endpoint_type not in cls._endpoints:
+            raise AIEndpointConfigError(f"Unknown endpoint type: {endpoint_type}")
+        # Prepare endpoint configuration
+        endpoint_config = {
+            "ai_config": ai_support.get("ai_config", {})
+        }
+        try:
+            endpoint_class = cls._endpoints[endpoint_type]
+            return endpoint_class(endpoint_config)
+        except Exception as e:
+            raise AIEndpointConfigError(f"Failed to create {endpoint_type} endpoint: {e}")
+# Legacy function for backward compatibility
+def get_ai_endpoint(config: dict):
+    """
+    Get an AI endpoint instance (legacy function).
+    This function is maintained for backward compatibility.
+    New code should use AIEndpointFactory.create_endpoint().
+    """
+    return AIEndpointFactory.create_endpoint(config)
+# Register built-in endpoints
+try:
+    from .ollama_endpoint import OllamaEndpoint
+    AIEndpointFactory.register_endpoint("ollama", OllamaEndpoint)
+except ImportError:
+    logger.debug("Ollama endpoint not available")
+try:
+    from .openai_endpoint import OpenAIEndpoint
+    AIEndpointFactory.register_endpoint("openai", OpenAIEndpoint)
+except ImportError:
+    logger.debug("OpenAI endpoint not available")
+try:
+    from .huggingface_endpoint import HuggingfaceEndpoint
+    AIEndpointFactory.register_endpoint("huggingface", HuggingfaceEndpoint)
+except ImportError:
+    logger.debug("Hugging Face endpoint not available")
+try:
+    from .gemini_endpoint import GeminiEndpoint
+    AIEndpointFactory.register_endpoint("gemini", GeminiEndpoint)
+except ImportError:
+    logger.debug("Gemini endpoint not available")
+try:
+    from .anthropic_endpoint import AnthropicEndpoint
+    AIEndpointFactory.register_endpoint("anthropic", AnthropicEndpoint)
+except ImportError:
+    logger.debug("Anthropic endpoint not available")
+try:
+    from .vllm_endpoint import VLLMEndpoint
+    AIEndpointFactory.register_endpoint("vllm", VLLMEndpoint)
+except ImportError:
+    logger.debug("VLLM endpoint not available")
+# Register visual AI endpoints
+try:
+    from .yolo_endpoint import YOLOEndpoint
+    AIEndpointFactory.register_endpoint("yolo", YOLOEndpoint)
+except ImportError:
+    logger.debug("YOLO endpoint not available (ultralytics not installed)")
+try:
+    from .ollama_vision_endpoint import OllamaVisionEndpoint
+    AIEndpointFactory.register_endpoint("ollama_vision", OllamaVisionEndpoint)
+except ImportError:
+    logger.debug("Ollama Vision endpoint not available")
+try:
+    from .openai_vision_endpoint import OpenAIVisionEndpoint
+    AIEndpointFactory.register_endpoint("openai_vision", OpenAIVisionEndpoint)
+except ImportError:
+    logger.debug("OpenAI Vision endpoint not available")
+try:
+    from .anthropic_vision_endpoint import AnthropicVisionEndpoint
+    AIEndpointFactory.register_endpoint("anthropic_vision", AnthropicVisionEndpoint)
+except ImportError:
+    logger.debug("Anthropic Vision endpoint not available")
+try:
+    from .openrouter_endpoint import OpenRouterEndpoint
+    AIEndpointFactory.register_endpoint("openrouter", OpenRouterEndpoint)
+except ImportError:
+    logger.debug("OpenRouter endpoint not available")

potato/ai/ai_help_wrapper.py ADDED Viewed

	@@ -0,0 +1,203 @@

+from flask import render_template_string
+from typing import Optional, Dict, Any, List
+from potato.ai.ai_cache import get_ai_cache_manager, _is_image_url, _get_instance_text
+from potato.ai.ai_prompt import get_ai_prompt
+from potato.server_utils.config_module import config
+import logging
+logger = logging.getLogger(__name__)
+# Global instance
+DYNAMICAIHELP = None
+def init_dynamic_ai_help():
+    import logging
+    logger = logging.getLogger(__name__)
+    logger.info(f"[init_dynamic_ai_help] Called. ai_support.enabled={config.get('ai_support', {}).get('enabled', False)}")
+    if not config["ai_support"]["enabled"]:
+        logger.info("[init_dynamic_ai_help] AI support disabled, returning")
+        return
+    global DYNAMICAIHELP
+    if DYNAMICAIHELP is None:
+        DYNAMICAIHELP = DynamicAIHelp()
+        logger.info(f"[init_dynamic_ai_help] Created DYNAMICAIHELP instance: {id(DYNAMICAIHELP)}")
+    else:
+        logger.info(f"[init_dynamic_ai_help] DYNAMICAIHELP already exists: {id(DYNAMICAIHELP)}")
+    return DYNAMICAIHELP
+def get_dynamic_ai_help():
+    global DYNAMICAIHELP
+    return DYNAMICAIHELP
+class DynamicAIHelp:
+    def __init__(self):
+        self.template = """
+        {% if ai_assistant %}
+        {{ ai_assistant | safe }}
+        {% elif error_message %}
+        <span class="error">{{ error_message }}</span>
+        {% endif %}
+        """
+    def get_empty_wrapper(self):
+        return f'<div class="ai-help none"><div class="tooltip"></div></div>'
+    def generate_ai_assistant(self, ai_prompts, annotation_type, ai_assistant):
+        str_html = f'<div class="{ai_assistant} ai-assistant-containter">'
+        img_url = ai_prompts[annotation_type].get(ai_assistant).get("img")
+        if img_url:
+            # Use empty alt since the button already has a text label
+            str_html += f'<span class="ai-assistant-img"><img src="{img_url}" alt=""></span>'
+        name = ai_prompts[annotation_type].get(ai_assistant).get("name", ai_assistant.capitalize())
+        str_html += f'<span>{name}</span>'
+        str_html += "</div>"
+        return str_html
+    def _filter_assistants_by_capability(
+        self, ai_cache_manager, assistant_keys: List[str], is_image_content: bool
+    ) -> List[str]:
+        """
+        Filter assistant types based on model capabilities.
+        Args:
+            ai_cache_manager: The AI cache manager instance
+            assistant_keys: List of assistant type keys to filter
+            is_image_content: Whether the current content is an image
+        Returns:
+            Filtered list of assistant keys that the model supports
+        """
+        # Get capabilities from the cache manager
+        capabilities = ai_cache_manager.get_endpoint_capabilities(for_image=is_image_content)
+        filtered_keys = []
+        for key in assistant_keys:
+            if capabilities.supports_assistant(key, is_image_content):
+                filtered_keys.append(key)
+            else:
+                logger.debug(
+                    f"[get_ai_help_data] Skipping '{key}' button - "
+                    f"not supported for {'image' if is_image_content else 'text'} content"
+                )
+        return filtered_keys
+    def get_ai_help_data(self, instance: int, annotation_id: int, annotation_type: str) -> Dict[str, Any]:
+        """Get current AI help configuration with the new prompt structure"""
+        try:
+            context = {
+                'ai_assistant': None,
+                'error_message': None,
+            }
+            ai_prompts = get_ai_prompt()
+            logger.debug(f"[get_ai_help_data] ai_prompts keys: {list(ai_prompts.keys()) if ai_prompts else 'None'}")
+            if not ai_prompts:
+                context["error_message"] = f'No AI prompt configured'
+                logger.debug("[get_ai_help_data] No AI prompts configured")
+                return context
+            elif annotation_type not in ai_prompts:
+                context["error_message"] = f'annotation type {annotation_type} does not exist in ai_prompts'
+                logger.debug(f"[get_ai_help_data] annotation type {annotation_type} not in prompts")
+                return context
+            ai_cache_manager = get_ai_cache_manager()
+            logger.debug(f"[get_ai_help_data] ai_cache_manager: {ai_cache_manager is not None}")
+            if ai_cache_manager is None:
+                context["error_message"] = "AI cache manager not initialized"
+                logger.debug("[get_ai_help_data] AI cache manager is None")
+                return context
+            ai_assistant_html_parts = []
+            # Determine if content is an image for capability-based filtering
+            is_image_content = False
+            try:
+                text = _get_instance_text(instance)
+                is_image_content = _is_image_url(text)
+                if is_image_content:
+                    logger.debug(f"[get_ai_help_data] Content is an image URL")
+            except Exception as e:
+                logger.debug(f"[get_ai_help_data] Could not determine if content is image: {e}")
+            # Check if user specified specific assistant types
+            special_include_types = ai_cache_manager.get_special_include(instance, annotation_id)
+            logger.debug(f"[get_ai_help_data] special_include_types: {special_include_types}")
+            if special_include_types:
+                # Generate HTML for specific included keys
+                logger.debug(f"[get_ai_help_data] Using special include types: {special_include_types}")
+                # Filter by capability
+                valid_keys = [k for k in special_include_types if k in ai_prompts[annotation_type]]
+                filtered_keys = self._filter_assistants_by_capability(
+                    ai_cache_manager, valid_keys, is_image_content
+                )
+                for key in filtered_keys:
+                    ai_assistant_html_parts.append(self.generate_ai_assistant(ai_prompts, annotation_type, key))
+            elif ai_cache_manager.get_include_all():
+                # Generate HTML for all keys in the annotation type
+                all_keys = list(ai_prompts[annotation_type].keys())
+                logger.debug(f"[get_ai_help_data] include_all=True, available keys: {all_keys}")
+                # Filter by capability
+                filtered_keys = self._filter_assistants_by_capability(
+                    ai_cache_manager, all_keys, is_image_content
+                )
+                logger.debug(f"[get_ai_help_data] After capability filter: {filtered_keys}")
+                for key in filtered_keys:
+                    ai_assistant_html_parts.append(self.generate_ai_assistant(ai_prompts, annotation_type, key))
+            else:
+                logger.debug("[get_ai_help_data] No special includes and include_all=False")
+            # Combine all HTML parts
+            ai_assistant_html = '<span>|</span>'.join(ai_assistant_html_parts) if ai_assistant_html_parts else None
+            logger.debug(f"[get_ai_help_data] ai_assistant_html_parts count: {len(ai_assistant_html_parts)}")
+            if ai_assistant_html:
+                context['ai_assistant'] = ai_assistant_html
+            logger.debug(f"[get_ai_help_data] Final context: ai_assistant={'set' if context['ai_assistant'] else 'None'}, error={'set' if context['error_message'] else 'None'}")
+            return context
+        except Exception as e:
+            logger.error(f"[get_ai_help_data] Exception: {e}", exc_info=True)
+            return {
+                'ai_assistant': None,
+                'error_message': f'Error loading AI help: {str(e)}',
+            }
+    def render(self, instance: int, annotation_id: int, annotation_type) -> str:
+        """Render AI help HTML with current data"""
+        context = self.get_ai_help_data(instance, annotation_id, annotation_type)
+        context.update({
+            'instance': instance,
+            'annotation_id': annotation_id
+        })
+        return render_template_string(self.template, **context)
+def generate_ai_help_html(instance: int, annotation_id: int, annotation_type: str) -> Optional[str]:
+    """
+    Generates dynamic AI help HTML using template rendering.
+    Now works with the new prompt structure: {annotation_type: {prompt: ..., outputformat: ...}}
+    """
+    import logging
+    logger = logging.getLogger(__name__)
+    if DYNAMICAIHELP is None:
+        logger.debug("[generate_ai_help_html] DYNAMICAIHELP is None - AI support not enabled")
+        return ""  # AI support not enabled
+    result = DYNAMICAIHELP.render(instance, annotation_id, annotation_type)
+    logger.debug(f"[generate_ai_help_html] Rendered result: '{result[:100] if result else 'empty'}...'")
+    return result
+def get_ai_wrapper():
+    import logging
+    logger = logging.getLogger(__name__)
+    helper = get_dynamic_ai_help()
+    logger.debug(f"[get_ai_wrapper] DYNAMICAIHELP is {'set' if helper else 'None'}")
+    result = helper.get_empty_wrapper() if helper else ""
+    logger.debug(f"[get_ai_wrapper] Returning: '{result[:50] if result else 'empty'}...'")
+    return result

potato/ai/ai_prompt.py ADDED Viewed

	@@ -0,0 +1,94 @@

+import importlib
+import json
+import os
+from pathlib import Path
+from typing import Optional, Type
+from pydantic import BaseModel
+from potato.server_utils.config_module import config
+ANNOTATIONS = None
+class ModelManager:
+    def __init__(self):
+        self.models_module = None
+    def load_models_module(self):
+        """Load the models module if not already loaded"""
+        if self.models_module is None:
+            # absolute pathing
+            module_path = config.get("ai_support").get("model_module")
+            if module_path:
+                file_path = Path(module_path)
+                if not file_path.exists():
+                    raise FileNotFoundError(f"Model module file not found: {file_path}")
+                module_name = file_path.stem
+                spec = importlib.util.spec_from_file_location(module_name, file_path)
+                self.models_module = importlib.util.module_from_spec(spec)
+                spec.loader.exec_module(self.models_module)
+            else:
+                default_path = Path(__file__).resolve().parent / "prompt" / "models_module.py"
+                if not default_path.exists():
+                    raise FileNotFoundError(f"Default model module file not found: {default_path}")
+                module_name = default_path.stem
+                spec = importlib.util.spec_from_file_location(module_name, default_path)
+                self.models_module = importlib.util.module_from_spec(spec)
+                spec.loader.exec_module(self.models_module)
+        return self.models_module
+    def get_model_class_by_name(self, name: str) -> Optional[Type[BaseModel]]:
+        """
+        Return a Pydantic model class based on the provided name.
+        """
+        models_module = self.load_models_module()
+        return models_module.CLASS_REGISTRY.get(name)
+def init_ai_prompt(config):
+    global ANNOTATIONS
+    if not config["ai_support"]["enabled"]:
+        return
+    try:
+        annotation_paths = config.get("ai_support", {}).get("annotation_path")
+        ANNOTATIONS = {}
+        if annotation_paths:
+            # Load files from specified paths
+            for key, path in annotation_paths.items():
+                if path and os.path.exists(path):
+                    with open(path, "r", encoding="utf-8") as f:
+                        ANNOTATIONS[key] = json.load(f)
+                else:
+                    raise Exception(f"File path for annotations does not exist: {path}")
+        else:
+            # Load all JSON files from default directory (parent/prompt)
+            default_path = Path(__file__).resolve().parent / "prompt"
+            if default_path.exists() and default_path.is_dir():
+                # Find all JSON files in the directory
+                json_files = list(default_path.glob("*.json"))
+                if not json_files:
+                    raise Exception(f"No JSON files found in default directory: {default_path}")
+                # Load each JSON file, using filename (without extension) as key
+                for file_path in json_files:
+                    key = file_path.stem
+                    with open(file_path, "r", encoding="utf-8") as f:
+                        ANNOTATIONS[key] = json.load(f)
+            else:
+                raise Exception(f"Default annotation directory does not exist: {default_path}")
+    except json.JSONDecodeError as e:
+        raise ValueError(f"Invalid JSON in annotation file: {e}")
+    except Exception as e:
+        raise RuntimeError(f"Unexpected error loading AI prompt: {e}")
+def get_ai_prompt():
+    global ANNOTATIONS
+    return ANNOTATIONS

potato/ai/anthropic_endpoint.py ADDED Viewed

	@@ -0,0 +1,118 @@

+"""
+Anthropic AI endpoint implementation.
+This module provides integration with Anthropic's Claude API for LLM inference.
+"""
+from typing import Dict, List
+import anthropic
+from .ai_endpoint import BaseAIEndpoint, AIEndpointRequestError, ModelCapabilities
+DEFAULT_MODEL = "claude-3-5-sonnet-20241022"
+DEFAULT_HINT_PROMPT = '''
+    You are assisting a user with an annotation task.
+        The annotation instruction is : {description}
+        The annotation task type is: {annotation_type}
+        The sentence (or item) to annotate is : {text}
+        Your goal is to generate a short, helpful hint that guides the annotator in how to think about the input — **without providing the answer**.
+        The hint should:
+        - Highlight key aspects of the input relevant to the task
+        - Encourage thoughtful reasoning or observation
+        - Point to subtle features (tone, wording, structure, implication) that matter for the annotation
+        - Be specific and informative, not vague or generic
+        '''
+DEFAULT_KEYWORD_PROMPT = '''
+    You are assisting a user with an annotation task.
+        The annotation instruction is : {description}
+        The annotation task type is: {annotation_type}
+        The sentence (or item) to annotate is : {text}
+        Your goal is : Print out just a sequence of keywords, not sentences, in the text that most relate to the task. Do not explain your answer. Do not print out the entire text. If no part of the text relates to the task, print the empty string.
+    '''
+class AnthropicEndpoint(BaseAIEndpoint):
+    """Anthropic Claude endpoint for cloud-based LLM inference."""
+    # Capabilities declaration for text-based Anthropic Claude models
+    CAPABILITIES = ModelCapabilities(
+        text_generation=True,
+        vision_input=False,
+        bounding_box_output=False,
+        text_classification=True,
+        image_classification=False,
+        rationale_generation=True,
+        keyword_extraction=True,
+    )
+    def _initialize_client(self) -> None:
+        """Initialize the Anthropic client."""
+        api_key = self.ai_config.get("api_key", "")
+        if not api_key:
+            raise AIEndpointRequestError("Anthropic API key is required")
+        # Default timeout of 30 seconds, configurable via ai_config
+        timeout = self.ai_config.get("timeout", 30)
+        self.client = anthropic.Anthropic(api_key=api_key, timeout=timeout)
+    def _get_default_model(self) -> str:
+        """Get the default Anthropic model."""
+        return DEFAULT_MODEL
+    def _get_default_hint_prompt(self) -> str:
+        """Get the default hint prompt for Anthropic."""
+        return DEFAULT_HINT_PROMPT
+    def _get_default_keyword_prompt(self) -> str:
+        """Get the default keyword prompt for Anthropic."""
+        return DEFAULT_KEYWORD_PROMPT
+    def query(self, prompt: str) -> str:
+        """
+        Send a query to Anthropic Claude and return the response.
+        Args:
+            prompt: The prompt to send to the model
+        Returns:
+            The model's response as a string
+        Raises:
+            AIEndpointRequestError: If the request fails
+        """
+        try:
+            response = self.client.messages.create(
+                model=self.model,
+                max_tokens=self.max_tokens,
+                temperature=self.temperature,
+                messages=[{"role": "user", "content": prompt}]
+            )
+            return response.content[0].text
+        except Exception as e:
+            raise AIEndpointRequestError(f"Anthropic request failed: {e}")
+    def chat_query(self, messages: List[Dict[str, str]]) -> str:
+        """Send a multi-turn chat to Anthropic using native messages API."""
+        try:
+            # Extract system message if present
+            system_text = ""
+            chat_messages = []
+            for msg in messages:
+                if msg["role"] == "system":
+                    system_text = msg["content"]
+                else:
+                    chat_messages.append({"role": msg["role"], "content": msg["content"]})
+            kwargs = {
+                "model": self.model,
+                "max_tokens": self.max_tokens,
+                "temperature": self.temperature,
+                "messages": chat_messages,
+            }
+            if system_text:
+                kwargs["system"] = system_text
+            response = self.client.messages.create(**kwargs)
+            return response.content[0].text
+        except Exception as e:
+            raise AIEndpointRequestError(f"Anthropic chat request failed: {e}")

potato/ai/anthropic_vision_endpoint.py ADDED Viewed

	@@ -0,0 +1,405 @@

+"""
+Anthropic Vision AI Endpoint
+This module provides integration with Anthropic's Claude models for visual
+analysis using the image content block format.
+"""
+import base64
+import logging
+from typing import Any, Dict, List, Type, Union
+from pydantic import BaseModel
+from .ai_endpoint import AIEndpointRequestError, ImageData, ModelCapabilities
+from .visual_ai_endpoint import BaseVisualAIEndpoint
+logger = logging.getLogger(__name__)
+DEFAULT_MODEL = "claude-sonnet-4-20250514"
+# Supported image types for Claude
+SUPPORTED_MEDIA_TYPES = [
+    "image/jpeg",
+    "image/png",
+    "image/gif",
+    "image/webp"
+]
+class AnthropicVisionEndpoint(BaseVisualAIEndpoint):
+    """
+    Anthropic Vision endpoint for Claude models with vision capabilities.
+    Uses the image content block format for multimodal inputs.
+    Configuration options:
+    - model: Model to use (default: claude-sonnet-4-20250514)
+    - api_key: Anthropic API key (can also use ANTHROPIC_API_KEY env var)
+    - max_tokens: Maximum response tokens (default: 1024)
+    - temperature: Sampling temperature (default: 0.1)
+    """
+    # Capabilities declaration for Anthropic Claude vision models
+    # Claude models can understand images and generate detailed reasoning but bboxes are approximate
+    CAPABILITIES = ModelCapabilities(
+        text_generation=True,
+        vision_input=True,
+        bounding_box_output=False,  # Claude bboxes are approximate, not precise
+        text_classification=True,
+        image_classification=True,
+        rationale_generation=True,
+        keyword_extraction=False,  # Keywords don't apply to images
+    )
+    def _initialize_client(self) -> None:
+        """Initialize the Anthropic client."""
+        try:
+            import anthropic
+        except ImportError:
+            raise AIEndpointRequestError(
+                "anthropic package is required. Install it with: pip install anthropic"
+            )
+        import os
+        api_key = self.ai_config.get("api_key") or os.environ.get("ANTHROPIC_API_KEY")
+        if not api_key:
+            raise AIEndpointRequestError(
+                "Anthropic API key is required. Set it in config or ANTHROPIC_API_KEY env var."
+            )
+        timeout = self.ai_config.get("timeout", 60)
+        self.client = anthropic.Anthropic(api_key=api_key, timeout=timeout)
+        logger.info(f"Anthropic Vision client initialized with model: {self.model}")
+    def _get_default_model(self) -> str:
+        """Get the default Anthropic model."""
+        return DEFAULT_MODEL
+    def query(self, prompt: str, output_format: Type[BaseModel]) -> Any:
+        """
+        Standard text query without images.
+        Args:
+            prompt: Text prompt
+            output_format: Pydantic model for structured output
+        Returns:
+            Parsed response
+        """
+        try:
+            # Add JSON instruction to prompt
+            json_prompt = f"""{prompt}
+Please respond with valid JSON matching this schema:
+{output_format.model_json_schema()}"""
+            response = self.client.messages.create(
+                model=self.model,
+                max_tokens=self.max_tokens,
+                messages=[{"role": "user", "content": json_prompt}],
+            )
+            content = response.content[0].text
+            return self.parseStringToJson(content)
+        except Exception as e:
+            raise AIEndpointRequestError(f"Anthropic query failed: {e}")
+    def query_with_image(
+        self,
+        prompt: str,
+        image_data: Union[ImageData, List[ImageData]],
+        output_format: Type[BaseModel]
+    ) -> Any:
+        """
+        Send a query with image(s) to Claude vision model.
+        Args:
+            prompt: Text prompt describing what to analyze
+            image_data: Single ImageData or list of ImageData
+            output_format: Pydantic model for structured output
+        Returns:
+            Parsed response according to output_format
+        Raises:
+            AIEndpointRequestError: If the request fails
+        """
+        try:
+            # Prepare images
+            images = [image_data] if isinstance(image_data, ImageData) else image_data
+            # Build content array with images first, then text
+            content = []
+            for img in images:
+                image_block = self._build_image_block(img)
+                content.append(image_block)
+            # Add JSON instruction to prompt
+            json_prompt = f"""{prompt}
+Please respond with valid JSON matching this schema:
+{output_format.model_json_schema()}
+Only return the JSON object, no other text."""
+            content.append({"type": "text", "text": json_prompt})
+            # Make request
+            response = self.client.messages.create(
+                model=self.model,
+                max_tokens=self.max_tokens,
+                messages=[{"role": "user", "content": content}],
+            )
+            response_content = response.content[0].text
+            logger.debug(f"Anthropic vision response: {response_content[:500] if response_content else 'empty'}")
+            return self.parseStringToJson(response_content)
+        except AIEndpointRequestError:
+            raise
+        except Exception as e:
+            logger.error(f"Anthropic vision query failed: {e}")
+            import traceback
+            logger.error(traceback.format_exc())
+            raise AIEndpointRequestError(f"Anthropic vision query failed: {e}")
+    def chat_query_with_image(
+        self,
+        messages: List[Dict[str, Any]],
+        images: Any = None,
+    ) -> str:
+        """
+        Multi-turn chat with interleaved images for vision-based agent loops.
+        Messages may have 'content' as a string (text only) or a list of
+        content blocks (text + image dicts in Anthropic format).
+        Args:
+            messages: List of message dicts with 'role' and 'content'.
+            images: Unused (images are inline in messages).
+        Returns:
+            The model's response as a plain text string.
+        """
+        try:
+            system = ""
+            api_messages = []
+            for msg in messages:
+                if msg["role"] == "system":
+                    system = msg["content"] if isinstance(msg["content"], str) else str(msg["content"])
+                else:
+                    api_messages.append({
+                        "role": msg["role"],
+                        "content": msg["content"],
+                    })
+            kwargs = {
+                "model": self.model,
+                "max_tokens": self.max_tokens,
+                "temperature": self.temperature,
+                "messages": api_messages,
+            }
+            if system:
+                kwargs["system"] = system
+            response = self.client.messages.create(**kwargs)
+            return response.content[0].text
+        except Exception as e:
+            logger.error(f"Anthropic vision chat query failed: {e}")
+            raise AIEndpointRequestError(f"Anthropic vision chat query failed: {e}")
+    def _build_image_block(self, image_data: ImageData) -> Dict[str, Any]:
+        """
+        Build image content block for Anthropic API.
+        Args:
+            image_data: ImageData object
+        Returns:
+            Dict with type: "image" and source content
+        """
+        if image_data.source == "url":
+            # Claude supports URL sources directly
+            return {
+                "type": "image",
+                "source": {
+                    "type": "url",
+                    "url": image_data.data
+                }
+            }
+        elif image_data.source == "base64":
+            # Determine media type
+            media_type = image_data.mime_type or "image/jpeg"
+            # Validate media type
+            if media_type not in SUPPORTED_MEDIA_TYPES:
+                logger.warning(f"Media type {media_type} may not be supported. Using image/jpeg.")
+                media_type = "image/jpeg"
+            return {
+                "type": "image",
+                "source": {
+                    "type": "base64",
+                    "media_type": media_type,
+                    "data": image_data.data
+                }
+            }
+        else:
+            raise AIEndpointRequestError(f"Unknown image source: {image_data.source}")
+    def analyze_image(
+        self,
+        image_path_or_url: str,
+        prompt: str,
+        output_format: Type[BaseModel] = None
+    ) -> Any:
+        """
+        Convenience method for analyzing a single image.
+        Args:
+            image_path_or_url: Path to image file or URL
+            prompt: Analysis prompt
+            output_format: Optional output format model
+        Returns:
+            Analysis result
+        """
+        # Prepare image data
+        if image_path_or_url.startswith(("http://", "https://")):
+            # Claude can use URLs directly
+            image_data = self.create_url_image_data(image_path_or_url)
+        else:
+            image_data = self.encode_image_to_base64(image_path_or_url)
+        # Use a generic format if not specified
+        if output_format is None:
+            from .prompt.models_module import GeneralHintFormat
+            output_format = GeneralHintFormat
+        return self.query_with_image(prompt, image_data, output_format)
+    def detect_objects(
+        self,
+        image_path_or_url: str,
+        labels: List[str] = None
+    ) -> Dict[str, Any]:
+        """
+        Detect objects in an image and return bounding boxes.
+        Args:
+            image_path_or_url: Path to image file or URL
+            labels: Optional list of labels to detect
+        Returns:
+            Dict with detections list
+        """
+        from .prompt.models_module import VisualDetectionFormat
+        labels_str = ", ".join(labels) if labels else "all visible objects"
+        prompt = f"""Analyze this image and detect objects. For each object, provide:
+1. The label (from: {labels_str})
+2. A bounding box with normalized coordinates (0-1 range)
+3. Confidence score (0-1)
+Return a JSON object with this exact structure:
+{{
+    "detections": [
+        {{
+            "label": "object_name",
+            "bbox": {{"x": 0.1, "y": 0.2, "width": 0.3, "height": 0.4}},
+            "confidence": 0.95
+        }}
+    ]
+}}
+Important:
+- Coordinates are normalized (0-1) where x,y is the top-left corner
+- x increases left to right, y increases top to bottom
+- width and height are also normalized (0-1)
+- Only include objects you can clearly identify
+- Estimate bounding boxes as accurately as possible"""
+        # Prepare image
+        if image_path_or_url.startswith(("http://", "https://")):
+            image_data = self.create_url_image_data(image_path_or_url)
+        else:
+            image_data = self.encode_image_to_base64(image_path_or_url)
+        return self.query_with_image(prompt, image_data, VisualDetectionFormat)
+    def get_annotation_hint(
+        self,
+        image_path_or_url: str,
+        task_description: str,
+        labels: List[str]
+    ) -> Dict[str, Any]:
+        """
+        Get a hint for annotating an image without revealing exact locations.
+        Args:
+            image_path_or_url: Path to image file or URL
+            task_description: Description of the annotation task
+            labels: Available labels
+        Returns:
+            Dict with hint text and optional suggested label
+        """
+        labels_str = ", ".join(labels)
+        prompt = f"""You are helping an annotator with this task: {task_description}
+Available labels: {labels_str}
+Provide a helpful hint that guides the annotator without giving away the exact answer.
+The hint should:
+1. Point out relevant features to consider
+2. Suggest what to look for
+3. Not explicitly state the answer or exact locations
+Return JSON:
+{{
+    "hint": "Your helpful hint here",
+    "suggested_focus": "What area or aspect to focus on"
+}}"""
+        # Prepare image
+        if image_path_or_url.startswith(("http://", "https://")):
+            image_data = self.create_url_image_data(image_path_or_url)
+        else:
+            image_data = self.encode_image_to_base64(image_path_or_url)
+        class HintFormat(BaseModel):
+            hint: str
+            suggested_focus: str
+        return self.query_with_image(prompt, image_data, HintFormat)
+    def health_check(self) -> bool:
+        """
+        Check if the Anthropic API is accessible.
+        Returns:
+            True if API is reachable, False otherwise
+        """
+        try:
+            # Simple test message
+            self.client.messages.create(
+                model=self.model,
+                max_tokens=10,
+                messages=[{"role": "user", "content": "Hello"}]
+            )
+            return True
+        except Exception as e:
+            logger.error(f"Anthropic health check failed: {e}")
+            return False

potato/ai/gemini_endpoint.py ADDED Viewed

	@@ -0,0 +1,58 @@

+"""
+Google Gemini AI endpoint implementation.
+This module provides integration with Google's Gemini API for LLM inference.
+"""
+from google import genai
+from .ai_endpoint import BaseAIEndpoint, AIEndpointRequestError
+DEFAULT_MODEL = "gemini-2.0-flash-exp"
+class GeminiEndpoint(BaseAIEndpoint):
+    """Google Gemini endpoint for cloud-based LLM inference."""
+    def _initialize_client(self) -> None:
+        """Initialize the Gemini client."""
+        api_key = self.ai_config.get("api_key", "")
+        if not api_key:
+            raise AIEndpointRequestError("Gemini API key is required")
+        # Default timeout of 30 seconds, configurable via ai_config
+        timeout = self.ai_config.get("timeout", 30)
+        self.client = genai.Client(
+            api_key=api_key,
+            http_options={'timeout': timeout}
+        )
+    def _get_default_model(self) -> str:
+        """Get the default Gemini model."""
+        return DEFAULT_MODEL
+    def query(self, prompt: str, prompt_format: dict) -> str:
+        """
+        Send a query to Gemini and return the response.
+        Args:
+            prompt: The prompt to send to the model
+        Returns:
+            The model's response as a string
+        Raises:
+            AIEndpointRequestError: If the request fails
+        """
+        try:
+            response = self.client.models.generate_content(
+                model=self.model,
+                contents=prompt,
+                generation_config={
+                    'max_output_tokens': self.max_tokens,
+                    'temperature': self.temperature,
+                    'response_schema': prompt_format.model_json_schema(),
+                }
+            )
+            return response.text
+        except Exception as e:
+            raise AIEndpointRequestError(f"Gemini request failed: {e}")

potato/ai/huggingface_endpoint.py ADDED Viewed

	@@ -0,0 +1,62 @@

+"""
+Hugging Face AI endpoint implementation.
+This module provides integration with Hugging Face's Inference API for LLM inference.
+"""
+from huggingface_hub import InferenceClient
+from .ai_endpoint import BaseAIEndpoint, AIEndpointRequestError
+DEFAULT_MODEL = "meta-llama/Llama-3.2-3B-Instruct"
+class HuggingfaceEndpoint(BaseAIEndpoint):
+    """Hugging Face endpoint for cloud-based LLM inference."""
+    def _initialize_client(self) -> None:
+        """Initialize the Hugging Face client."""
+        api_key = self.ai_config.get("api_key", "")
+        if not api_key:
+            raise AIEndpointRequestError("Hugging Face API key is required")
+        # Default timeout of 30 seconds, configurable via ai_config
+        timeout = self.ai_config.get("timeout", 30)
+        self.client = InferenceClient(
+            model=self.model,
+            token=api_key,
+            timeout=timeout
+        )
+    def _get_default_model(self) -> str:
+        """Get the default Hugging Face model."""
+        return DEFAULT_MODEL
+    def query(self, prompt: str, output_format: dict) -> str:
+        """
+        Send a query to Hugging Face and return the response.
+        Args:
+            prompt: The prompt to send to the model
+        Returns:
+            The model's response as a string
+        Raises:
+            AIEndpointRequestError: If the request fails
+        """
+        try:
+            response = self.client.chat_completion(
+                messages=[{"role": "user", "content": prompt}],
+                max_tokens=self.max_tokens,
+                temperature=self.temperature,
+                response_format= {
+                    "type": "json_schema",
+                    "json_schema": {
+                        "name": "output_format",
+                        "schema": output_format.model_json_schema(),
+                        "strict": True,
+                    }
+                }
+            )
+            return response.choices[0].message.content
+        except Exception as e:
+            raise AIEndpointRequestError(f"Hugging Face request failed: {e}")

potato/ai/icl_labeler.py ADDED Viewed

	@@ -0,0 +1,1110 @@

+"""
+In-Context Learning (ICL) Labeler Module
+This module provides AI-assisted labeling using high-confidence human annotations
+as in-context examples to prompt an LLM to label remaining data.
+Key features:
+- Identifies high-confidence examples where annotators agree
+- Uses examples as in-context demonstrations for LLM labeling
+- Tracks LLM confidence scores on predictions
+- Routes subset of LLM labels to humans for verification
+- Calculates and reports LLM accuracy based on verification
+"""
+import json
+import logging
+import os
+import random
+import threading
+import time
+from collections import Counter, defaultdict
+from dataclasses import dataclass, field, asdict
+from datetime import datetime
+from typing import Dict, List, Optional, Any, Tuple, Set
+logger = logging.getLogger(__name__)
+@dataclass
+class HighConfidenceExample:
+    """A human-annotated example suitable for in-context learning."""
+    instance_id: str
+    text: str
+    schema_name: str
+    label: str
+    agreement_score: float  # Proportion of annotators who chose this label
+    annotator_count: int
+    timestamp: datetime = field(default_factory=datetime.now)
+    def to_dict(self) -> Dict[str, Any]:
+        """Serialize to dictionary."""
+        return {
+            'instance_id': self.instance_id,
+            'text': self.text,
+            'schema_name': self.schema_name,
+            'label': self.label,
+            'agreement_score': self.agreement_score,
+            'annotator_count': self.annotator_count,
+            'timestamp': self.timestamp.isoformat()
+        }
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]) -> 'HighConfidenceExample':
+        """Deserialize from dictionary."""
+        timestamp = data.get('timestamp')
+        if isinstance(timestamp, str):
+            timestamp = datetime.fromisoformat(timestamp)
+        elif timestamp is None:
+            timestamp = datetime.now()
+        return cls(
+            instance_id=data['instance_id'],
+            text=data['text'],
+            schema_name=data['schema_name'],
+            label=data['label'],
+            agreement_score=data['agreement_score'],
+            annotator_count=data['annotator_count'],
+            timestamp=timestamp
+        )
+@dataclass
+class ICLPrediction:
+    """Record of an LLM prediction using in-context learning."""
+    instance_id: str
+    schema_name: str
+    predicted_label: str
+    confidence_score: float  # 0.0-1.0
+    timestamp: datetime = field(default_factory=datetime.now)
+    # In-context examples used
+    example_instance_ids: List[str] = field(default_factory=list)
+    # Verification tracking
+    verification_status: str = 'pending'  # 'pending', 'verified_correct', 'verified_incorrect'
+    verified_by: Optional[str] = None
+    verified_at: Optional[datetime] = None
+    human_label: Optional[str] = None  # Human's label if verified
+    # LLM metadata
+    model_name: str = ""
+    reasoning: str = ""
+    def to_dict(self) -> Dict[str, Any]:
+        """Serialize to dictionary."""
+        return {
+            'instance_id': self.instance_id,
+            'schema_name': self.schema_name,
+            'predicted_label': self.predicted_label,
+            'confidence_score': self.confidence_score,
+            'timestamp': self.timestamp.isoformat(),
+            'example_instance_ids': self.example_instance_ids,
+            'verification_status': self.verification_status,
+            'verified_by': self.verified_by,
+            'verified_at': self.verified_at.isoformat() if self.verified_at else None,
+            'human_label': self.human_label,
+            'model_name': self.model_name,
+            'reasoning': self.reasoning
+        }
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]) -> 'ICLPrediction':
+        """Deserialize from dictionary."""
+        timestamp = data.get('timestamp')
+        if isinstance(timestamp, str):
+            timestamp = datetime.fromisoformat(timestamp)
+        elif timestamp is None:
+            timestamp = datetime.now()
+        verified_at = data.get('verified_at')
+        if isinstance(verified_at, str):
+            verified_at = datetime.fromisoformat(verified_at)
+        return cls(
+            instance_id=data['instance_id'],
+            schema_name=data['schema_name'],
+            predicted_label=data['predicted_label'],
+            confidence_score=data['confidence_score'],
+            timestamp=timestamp,
+            example_instance_ids=data.get('example_instance_ids', []),
+            verification_status=data.get('verification_status', 'pending'),
+            verified_by=data.get('verified_by'),
+            verified_at=verified_at,
+            human_label=data.get('human_label'),
+            model_name=data.get('model_name', ''),
+            reasoning=data.get('reasoning', '')
+        )
+class ICLLabeler:
+    """
+    Manages in-context learning based labeling using high-confidence human annotations.
+    Workflow:
+    1. Monitors annotation progress for high-confidence examples
+    2. Periodically refreshes pool of high-confidence examples
+    3. Uses examples to prompt LLM for labeling unlabeled instances
+    4. Routes some LLM-labeled instances for human verification (blind)
+    5. Tracks accuracy metrics
+    """
+    _instance = None
+    _lock = threading.RLock()
+    def __new__(cls, *args, **kwargs):
+        """Singleton pattern."""
+        if cls._instance is None:
+            with cls._lock:
+                if cls._instance is None:
+                    cls._instance = super().__new__(cls)
+                    cls._instance._initialized = False
+        return cls._instance
+    def __init__(self, config: Optional[Dict[str, Any]] = None):
+        """
+        Initialize the ICLLabeler.
+        Args:
+            config: Configuration dictionary with settings
+        """
+        if self._initialized:
+            return
+        self.config = config or {}
+        self._ai_endpoint = None
+        # Get ICL labeling config
+        icl_config = self.config.get('icl_labeling', {})
+        # Example selection config
+        example_config = icl_config.get('example_selection', {})
+        self.min_agreement_threshold = example_config.get('min_agreement_threshold', 0.8)
+        self.min_annotators_per_instance = example_config.get('min_annotators_per_instance', 2)
+        self.max_examples_per_schema = example_config.get('max_examples_per_schema', 10)
+        self.example_refresh_interval = example_config.get('refresh_interval_seconds', 300)
+        # LLM labeling config
+        llm_config = icl_config.get('llm_labeling', {})
+        self.batch_size = llm_config.get('batch_size', 20)
+        self.trigger_threshold = llm_config.get('trigger_threshold', 5)
+        self.confidence_threshold = llm_config.get('confidence_threshold', 0.7)
+        self.batch_interval = llm_config.get('batch_interval_seconds', 600)
+        # Limits to prevent labeling entire dataset at once
+        # This allows iterative improvement - verify accuracy before labeling more
+        self.max_total_labels = llm_config.get('max_total_labels', None)  # Max instances to label total
+        self.max_unlabeled_ratio = llm_config.get('max_unlabeled_ratio', 0.5)  # Max % of unlabeled to label
+        self.pause_on_low_accuracy = llm_config.get('pause_on_low_accuracy', True)
+        self.min_accuracy_threshold = llm_config.get('min_accuracy_threshold', 0.7)  # Pause if accuracy below
+        # Verification config
+        verification_config = icl_config.get('verification', {})
+        self.verification_enabled = verification_config.get('enabled', True)
+        self.verification_sample_rate = verification_config.get('sample_rate', 0.2)
+        self.verification_strategy = verification_config.get('selection_strategy', 'low_confidence')
+        # Persistence config
+        persistence_config = icl_config.get('persistence', {})
+        self.predictions_file = persistence_config.get('predictions_file', 'icl_predictions.json')
+        # State
+        self.schema_to_examples: Dict[str, List[HighConfidenceExample]] = {}
+        self.predictions: Dict[str, Dict[str, ICLPrediction]] = {}  # instance_id -> schema -> prediction
+        self.verification_queue: List[Tuple[str, str]] = []  # [(instance_id, schema_name), ...]
+        self.labeled_instance_ids: Set[str] = set()  # Instances labeled by LLM
+        self.last_example_refresh: Optional[datetime] = None
+        self.last_batch_run: Optional[datetime] = None
+        # Background worker
+        self._worker_thread: Optional[threading.Thread] = None
+        self._stop_worker = threading.Event()
+        self._initialized = True
+        logger.info("ICLLabeler initialized")
+    def _get_ai_endpoint(self):
+        """Get or create AI endpoint from config (reuses ai_support config)."""
+        if self._ai_endpoint is None:
+            from potato.ai.ai_endpoint import AIEndpointFactory
+            self._ai_endpoint = AIEndpointFactory.create_endpoint(self.config)
+        return self._ai_endpoint
+    def _get_annotation_schemes(self) -> List[Dict[str, Any]]:
+        """Get annotation schemes from config."""
+        return self.config.get('annotation_schemes', [])
+    def _get_text_key(self) -> str:
+        """Get the text key from item_properties."""
+        return self.config.get('item_properties', {}).get('text_key', 'text')
+    # === High-Confidence Example Collection ===
+    def refresh_high_confidence_examples(self) -> Dict[str, List[HighConfidenceExample]]:
+        """
+        Scan annotations and identify high-confidence examples.
+        Returns:
+            Dictionary mapping schema name to list of high-confidence examples
+        """
+        from potato.flask_server import get_users, get_user_state, get_item_state_manager
+        with self._lock:
+            new_examples: Dict[str, List[HighConfidenceExample]] = defaultdict(list)
+            try:
+                ism = get_item_state_manager()
+                if ism is None:
+                    logger.warning("ItemStateManager not available")
+                    return new_examples
+                text_key = self._get_text_key()
+                schemas = self._get_annotation_schemes()
+                schema_names = [s.get('name') for s in schemas if s.get('name')]
+                # Collect all annotations per instance
+                instance_annotations: Dict[str, Dict[str, List[Tuple[str, Any]]]] = defaultdict(
+                    lambda: defaultdict(list)
+                )  # instance_id -> schema_name -> [(user_id, value), ...]
+                for username in get_users():
+                    user_state = get_user_state(username)
+                    if not user_state:
+                        continue
+                    all_annotations = user_state.get_all_annotations()
+                    for instance_id, instance_data in all_annotations.items():
+                        if 'labels' not in instance_data:
+                            continue
+                        for label, value in instance_data['labels'].items():
+                            schema_name = label.get_schema() if hasattr(label, 'get_schema') else str(label)
+                            if schema_name in schema_names:
+                                instance_annotations[instance_id][schema_name].append((username, value))
+                # Find high-confidence examples
+                for instance_id, schema_data in instance_annotations.items():
+                    for schema_name, annotations in schema_data.items():
+                        annotator_count = len(annotations)
+                        if annotator_count < self.min_annotators_per_instance:
+                            continue
+                        # Count votes per label
+                        label_counts = Counter(value for _, value in annotations)
+                        most_common_label, most_common_count = label_counts.most_common(1)[0]
+                        # Calculate agreement
+                        agreement_score = most_common_count / annotator_count
+                        if agreement_score >= self.min_agreement_threshold:
+                            # Get instance text
+                            item = ism.get_item(instance_id)
+                            instance_data = item.get_data() if item else None
+                            if instance_data is None:
+                                continue
+                            text = instance_data.get(text_key, '')
+                            if not text:
+                                continue
+                            example = HighConfidenceExample(
+                                instance_id=instance_id,
+                                text=text,
+                                schema_name=schema_name,
+                                label=str(most_common_label),
+                                agreement_score=agreement_score,
+                                annotator_count=annotator_count
+                            )
+                            new_examples[schema_name].append(example)
+                # Select examples using coverage-based selection (CoverICL-inspired)
+                # or fall back to agreement-score sorting
+                for schema_name in new_examples:
+                    candidates = new_examples[schema_name]
+                    if len(candidates) > self.max_examples_per_schema:
+                        selected = self._select_diverse_examples(
+                            candidates, self.max_examples_per_schema
+                        )
+                        new_examples[schema_name] = selected
+                    else:
+                        new_examples[schema_name].sort(
+                            key=lambda x: x.agreement_score, reverse=True
+                        )
+                self.schema_to_examples = dict(new_examples)
+                self.last_example_refresh = datetime.now()
+                total_examples = sum(len(examples) for examples in new_examples.values())
+                logger.info(f"Refreshed high-confidence examples: {total_examples} examples across {len(new_examples)} schemas")
+            except Exception as e:
+                logger.error(f"Error refreshing examples: {e}")
+            return self.schema_to_examples
+    def get_examples_for_schema(self, schema_name: str) -> List[HighConfidenceExample]:
+        """Get high-confidence examples for a specific schema."""
+        return self.schema_to_examples.get(schema_name, [])
+    def has_enough_examples(self, schema_name: str) -> bool:
+        """Check if we have enough examples to start labeling."""
+        return len(self.get_examples_for_schema(schema_name)) >= self.trigger_threshold
+    def _select_diverse_examples(
+        self,
+        candidates: List[HighConfidenceExample],
+        k: int,
+    ) -> List[HighConfidenceExample]:
+        """CoverICL-inspired coverage-based example selection.
+        Uses greedy facility location to select examples that maximize
+        coverage of the instance embedding space, ensuring diverse and
+        representative ICL demonstrations.
+        Inspired by Mavromatis et al. (2024) CoverICL. Falls back to
+        agreement-score sorting if vectorization fails.
+        Args:
+            candidates: Pool of high-confidence examples
+            k: Number of examples to select
+        Returns:
+            List of selected examples maximizing coverage
+        """
+        if len(candidates) <= k:
+            return candidates
+        try:
+            from sklearn.feature_extraction.text import TfidfVectorizer
+            from sklearn.metrics.pairwise import cosine_distances
+            texts = [c.text for c in candidates]
+            vectorizer = TfidfVectorizer(max_features=5000)
+            features = vectorizer.fit_transform(texts).toarray()
+            # Greedy facility location: iteratively pick the candidate that
+            # maximizes the minimum distance to already-selected examples
+            n = len(candidates)
+            selected_indices = []
+            # Start with the candidate that has highest agreement score
+            # (quality-weighted seed)
+            agreements = [c.agreement_score for c in candidates]
+            first = int(max(range(n), key=lambda i: agreements[i]))
+            selected_indices.append(first)
+            dist_matrix = cosine_distances(features)
+            for _ in range(k - 1):
+                # For each unselected candidate, compute min distance to selected set
+                best_idx = -1
+                best_score = -1.0
+                for i in range(n):
+                    if i in selected_indices:
+                        continue
+                    min_dist = min(dist_matrix[i][j] for j in selected_indices)
+                    # Weight by agreement score for quality-aware selection
+                    score = min_dist * candidates[i].agreement_score
+                    if score > best_score:
+                        best_score = score
+                        best_idx = i
+                if best_idx >= 0:
+                    selected_indices.append(best_idx)
+                else:
+                    break
+            selected = [candidates[i] for i in selected_indices]
+            logger.debug(f"CoverICL selection: {len(selected)} diverse examples from {n} candidates")
+            return selected
+        except Exception as e:
+            logger.warning(f"Coverage-based selection failed, using agreement sorting: {e}")
+            candidates.sort(key=lambda x: x.agreement_score, reverse=True)
+            return candidates[:k]
+    # === LLM Labeling ===
+    def label_instance(
+        self,
+        instance_id: str,
+        schema_name: str,
+        instance_text: str
+    ) -> Optional[ICLPrediction]:
+        """
+        Label a single instance using in-context learning.
+        Args:
+            instance_id: The instance to label
+            schema_name: The annotation schema to use
+            instance_text: The text to label
+        Returns:
+            ICLPrediction if successful, None otherwise
+        """
+        from potato.ai.icl_prompt_builder import ICLPromptBuilder
+        examples = self.get_examples_for_schema(schema_name)
+        if not examples:
+            logger.warning(f"No examples available for schema {schema_name}")
+            return None
+        # Get schema info
+        schemas = self._get_annotation_schemes()
+        schema_info = next((s for s in schemas if s.get('name') == schema_name), None)
+        if not schema_info:
+            logger.warning(f"Schema {schema_name} not found in config")
+            return None
+        endpoint = self._get_ai_endpoint()
+        if endpoint is None:
+            logger.warning("AI endpoint not available")
+            return None
+        try:
+            # Build prompt
+            prompt_builder = ICLPromptBuilder()
+            prompt = prompt_builder.build_prompt(
+                schema=schema_info,
+                examples=examples,
+                target_text=instance_text
+            )
+            # Query LLM
+            from pydantic import BaseModel
+            class ICLResponse(BaseModel):
+                label: str
+                confidence: float
+                reasoning: str = ""
+            response = endpoint.query(prompt, ICLResponse)
+            # Parse response
+            if isinstance(response, str):
+                response_data = json.loads(response)
+            elif hasattr(response, 'model_dump'):
+                response_data = response.model_dump()
+            else:
+                response_data = response
+            predicted_label = response_data.get('label', '')
+            confidence = float(response_data.get('confidence', 0.5))
+            reasoning = response_data.get('reasoning', '')
+            # Validate label against schema
+            valid_labels = self._get_valid_labels(schema_info)
+            if valid_labels and predicted_label not in valid_labels:
+                # Try fuzzy matching
+                predicted_label = self._fuzzy_match_label(predicted_label, valid_labels)
+                if predicted_label is None:
+                    logger.warning(f"LLM returned invalid label for {instance_id}")
+                    return None
+            # Create prediction
+            prediction = ICLPrediction(
+                instance_id=instance_id,
+                schema_name=schema_name,
+                predicted_label=predicted_label,
+                confidence_score=min(1.0, max(0.0, confidence)),
+                example_instance_ids=[e.instance_id for e in examples],
+                model_name=endpoint.model if hasattr(endpoint, 'model') else '',
+                reasoning=reasoning
+            )
+            # Store prediction
+            with self._lock:
+                if instance_id not in self.predictions:
+                    self.predictions[instance_id] = {}
+                self.predictions[instance_id][schema_name] = prediction
+                self.labeled_instance_ids.add(instance_id)
+                # Maybe add to verification queue
+                if self.verification_enabled and random.random() < self.verification_sample_rate:
+                    self.verification_queue.append((instance_id, schema_name))
+            logger.debug(f"Labeled {instance_id} with {predicted_label} (confidence: {confidence:.2f})")
+            return prediction
+        except Exception as e:
+            logger.error(f"Error labeling instance {instance_id}: {e}")
+            return None
+    def _get_valid_labels(self, schema_info: Dict[str, Any]) -> List[str]:
+        """Extract valid labels from schema info."""
+        labels = schema_info.get('labels', [])
+        valid_labels = []
+        for label in labels:
+            if isinstance(label, str):
+                valid_labels.append(label)
+            elif isinstance(label, dict):
+                valid_labels.append(label.get('name', str(label)))
+        return valid_labels
+    def _fuzzy_match_label(self, predicted: str, valid_labels: List[str]) -> Optional[str]:
+        """Try to match predicted label to a valid label."""
+        predicted_lower = predicted.lower().strip()
+        for label in valid_labels:
+            if label.lower().strip() == predicted_lower:
+                return label
+        return None
+    def should_pause_labeling(self) -> Tuple[bool, str]:
+        """
+        Check if labeling should be paused based on limits and accuracy.
+        Returns:
+            Tuple of (should_pause, reason)
+        """
+        # Check if max total labels reached
+        if self.max_total_labels is not None:
+            current_count = len(self.labeled_instance_ids)
+            if current_count >= self.max_total_labels:
+                return True, f"Reached max_total_labels limit ({self.max_total_labels})"
+        # Check accuracy threshold
+        if self.pause_on_low_accuracy:
+            metrics = self.get_accuracy_metrics()
+            total_verified = metrics.get('total_verified', 0)
+            accuracy = metrics.get('accuracy')
+            # Only check accuracy if we have enough verifications
+            min_verifications = 10
+            if total_verified >= min_verifications and accuracy is not None:
+                if accuracy < self.min_accuracy_threshold:
+                    return True, f"Accuracy ({accuracy:.1%}) below threshold ({self.min_accuracy_threshold:.1%})"
+        return False, ""
+    def get_remaining_label_capacity(self) -> int:
+        """
+        Get how many more instances can be labeled.
+        Returns:
+            Number of instances that can still be labeled, or -1 for unlimited
+        """
+        from potato.item_state_management import get_item_state_manager
+        from potato.user_state_management import get_user_state_manager
+        try:
+            ism = get_item_state_manager()
+        except ValueError:
+            # ISM not initialized yet
+            return 0
+        if ism is None:
+            return 0
+        # Count unlabeled instances (not labeled by humans or LLM)
+        unlabeled_count = 0
+        usm = get_user_state_manager()
+        for instance_id in ism.instance_id_ordering:
+            if instance_id in self.labeled_instance_ids:
+                continue
+            has_human_annotation = False
+            for username in usm.get_all_users():
+                user_state = usm.get_user_state(username)
+                if user_state:
+                    all_annotations = user_state.get_all_annotations()
+                    if instance_id in all_annotations:
+                        has_human_annotation = True
+                        break
+            if not has_human_annotation:
+                unlabeled_count += 1
+        current_llm_labels = len(self.labeled_instance_ids)
+        # Calculate max based on ratio
+        max_from_ratio = int(unlabeled_count * self.max_unlabeled_ratio)
+        # Calculate max based on total limit
+        if self.max_total_labels is not None:
+            max_from_total = self.max_total_labels - current_llm_labels
+            return min(max_from_ratio, max_from_total)
+        return max_from_ratio
+    def batch_label_instances(self, schema_name: str) -> List[ICLPrediction]:
+        """
+        Label multiple unlabeled instances for a schema.
+        Respects configured limits to prevent labeling entire dataset at once.
+        Returns:
+            List of successful predictions
+        """
+        from potato.flask_server import get_item_state_manager, get_users, get_user_state
+        # Check if we should pause labeling
+        should_pause, reason = self.should_pause_labeling()
+        if should_pause:
+            logger.info(f"Labeling paused: {reason}")
+            return []
+        if not self.has_enough_examples(schema_name):
+            logger.info(f"Not enough examples for schema {schema_name}")
+            return []
+        ism = get_item_state_manager()
+        if ism is None:
+            return []
+        # Check remaining capacity
+        remaining_capacity = self.get_remaining_label_capacity()
+        if remaining_capacity <= 0:
+            logger.info("No remaining label capacity")
+            return []
+        # Limit batch size to remaining capacity
+        effective_batch_size = min(self.batch_size, remaining_capacity)
+        text_key = self._get_text_key()
+        predictions = []
+        # Find unlabeled instances
+        unlabeled_ids = []
+        for instance_id in ism.instance_id_ordering:
+            # Skip if already labeled by LLM
+            if instance_id in self.labeled_instance_ids:
+                continue
+            # Skip if already annotated by humans
+            has_human_annotation = False
+            for username in get_users():
+                user_state = get_user_state(username)
+                if user_state:
+                    all_annotations = user_state.get_all_annotations()
+                    if instance_id in all_annotations:
+                        has_human_annotation = True
+                        break
+            if not has_human_annotation:
+                unlabeled_ids.append(instance_id)
+            if len(unlabeled_ids) >= effective_batch_size:
+                break
+        # Label instances
+        for instance_id in unlabeled_ids:
+            item = ism.get_item(instance_id)
+            instance_data = item.get_data() if item else None
+            if instance_data is None:
+                continue
+            text = instance_data.get(text_key, '')
+            if not text:
+                continue
+            prediction = self.label_instance(instance_id, schema_name, text)
+            if prediction and prediction.confidence_score >= self.confidence_threshold:
+                predictions.append(prediction)
+        self.last_batch_run = datetime.now()
+        logger.info(f"Batch labeled {len(predictions)} instances for schema {schema_name}")
+        return predictions
+    # === Verification Workflow ===
+    def get_pending_verifications(self, count: int = 1) -> List[Tuple[str, str]]:
+        """
+        Get instances pending human verification.
+        Args:
+            count: Number of verification tasks to return
+        Returns:
+            List of (instance_id, schema_name) tuples
+        """
+        with self._lock:
+            if self.verification_strategy == 'low_confidence':
+                # Sort by confidence ascending
+                pending = [
+                    (inst_id, schema)
+                    for inst_id, schema in self.verification_queue
+                    if (inst_id in self.predictions and
+                        schema in self.predictions[inst_id] and
+                        self.predictions[inst_id][schema].verification_status == 'pending')
+                ]
+                pending.sort(
+                    key=lambda x: self.predictions[x[0]][x[1]].confidence_score
+                )
+                return pending[:count]
+            elif self.verification_strategy == 'random':
+                pending = [
+                    (inst_id, schema)
+                    for inst_id, schema in self.verification_queue
+                    if (inst_id in self.predictions and
+                        schema in self.predictions[inst_id] and
+                        self.predictions[inst_id][schema].verification_status == 'pending')
+                ]
+                random.shuffle(pending)
+                return pending[:count]
+            else:  # mixed
+                pending = [
+                    (inst_id, schema)
+                    for inst_id, schema in self.verification_queue
+                    if (inst_id in self.predictions and
+                        schema in self.predictions[inst_id] and
+                        self.predictions[inst_id][schema].verification_status == 'pending')
+                ]
+                # 50% low confidence, 50% random
+                pending.sort(
+                    key=lambda x: self.predictions[x[0]][x[1]].confidence_score
+                )
+                half = count // 2
+                low_conf = pending[:half]
+                rest = pending[half:]
+                random.shuffle(rest)
+                return low_conf + rest[:count - half]
+    def record_verification(
+        self,
+        instance_id: str,
+        schema_name: str,
+        human_label: str,
+        verified_by: str
+    ) -> bool:
+        """
+        Record human verification of an LLM prediction.
+        Args:
+            instance_id: The verified instance
+            schema_name: The schema verified
+            human_label: The human's label
+            verified_by: Username of verifier
+        Returns:
+            True if verification recorded successfully
+        """
+        with self._lock:
+            if instance_id not in self.predictions:
+                logger.warning(f"No prediction found for instance {instance_id}")
+                return False
+            if schema_name not in self.predictions[instance_id]:
+                logger.warning(f"No prediction found for schema {schema_name}")
+                return False
+            prediction = self.predictions[instance_id][schema_name]
+            prediction.human_label = human_label
+            prediction.verified_by = verified_by
+            prediction.verified_at = datetime.now()
+            if prediction.predicted_label == human_label:
+                prediction.verification_status = 'verified_correct'
+            else:
+                prediction.verification_status = 'verified_incorrect'
+            # Remove from verification queue
+            try:
+                self.verification_queue.remove((instance_id, schema_name))
+            except ValueError:
+                pass
+            logger.info(
+                f"Verification recorded for {instance_id}: "
+                f"predicted={prediction.predicted_label}, human={human_label}, "
+                f"status={prediction.verification_status}"
+            )
+            return True
+    # === Accuracy Tracking ===
+    def get_accuracy_metrics(self, schema_name: Optional[str] = None) -> Dict[str, Any]:
+        """
+        Calculate accuracy metrics from verified predictions.
+        Args:
+            schema_name: Optional schema to filter by
+        Returns:
+            Dictionary with accuracy metrics
+        """
+        with self._lock:
+            verified_correct = 0
+            verified_incorrect = 0
+            pending = 0
+            total_predictions = 0
+            confidence_correct = []
+            confidence_incorrect = []
+            for inst_id, schemas in self.predictions.items():
+                for s_name, prediction in schemas.items():
+                    if schema_name and s_name != schema_name:
+                        continue
+                    total_predictions += 1
+                    if prediction.verification_status == 'verified_correct':
+                        verified_correct += 1
+                        confidence_correct.append(prediction.confidence_score)
+                    elif prediction.verification_status == 'verified_incorrect':
+                        verified_incorrect += 1
+                        confidence_incorrect.append(prediction.confidence_score)
+                    else:
+                        pending += 1
+            total_verified = verified_correct + verified_incorrect
+            accuracy = verified_correct / total_verified if total_verified > 0 else None
+            avg_confidence_correct = (
+                sum(confidence_correct) / len(confidence_correct)
+                if confidence_correct else None
+            )
+            avg_confidence_incorrect = (
+                sum(confidence_incorrect) / len(confidence_incorrect)
+                if confidence_incorrect else None
+            )
+            return {
+                'total_predictions': total_predictions,
+                'verified_correct': verified_correct,
+                'verified_incorrect': verified_incorrect,
+                'pending_verification': pending,
+                'total_verified': total_verified,
+                'accuracy': accuracy,
+                'avg_confidence_correct': avg_confidence_correct,
+                'avg_confidence_incorrect': avg_confidence_incorrect,
+                'schema_name': schema_name
+            }
+    def get_status(self) -> Dict[str, Any]:
+        """Get overall ICL labeler status."""
+        with self._lock:
+            total_examples = sum(len(ex) for ex in self.schema_to_examples.values())
+            examples_by_schema = {
+                schema: len(examples)
+                for schema, examples in self.schema_to_examples.items()
+            }
+            # Check labeling status
+            should_pause, pause_reason = self.should_pause_labeling()
+            remaining_capacity = self.get_remaining_label_capacity()
+            return {
+                'enabled': self.config.get('icl_labeling', {}).get('enabled', False),
+                'total_examples': total_examples,
+                'examples_by_schema': examples_by_schema,
+                'total_predictions': sum(
+                    len(schemas) for schemas in self.predictions.values()
+                ),
+                'labeled_instances': len(self.labeled_instance_ids),
+                'verification_queue_size': len(self.verification_queue),
+                'last_example_refresh': (
+                    self.last_example_refresh.isoformat()
+                    if self.last_example_refresh else None
+                ),
+                'last_batch_run': (
+                    self.last_batch_run.isoformat()
+                    if self.last_batch_run else None
+                ),
+                'worker_running': (
+                    self._worker_thread is not None and
+                    self._worker_thread.is_alive()
+                ),
+                'accuracy_metrics': self.get_accuracy_metrics(),
+                # Labeling limits status
+                'labeling_paused': should_pause,
+                'pause_reason': pause_reason,
+                'remaining_label_capacity': remaining_capacity,
+                'max_total_labels': self.max_total_labels,
+                'max_unlabeled_ratio': self.max_unlabeled_ratio,
+                'min_accuracy_threshold': self.min_accuracy_threshold
+            }
+    # === Background Worker ===
+    def start_background_worker(self) -> None:
+        """Start the background worker thread."""
+        if self._worker_thread is not None and self._worker_thread.is_alive():
+            logger.warning("Background worker already running")
+            return
+        self._stop_worker.clear()
+        self._worker_thread = threading.Thread(
+            target=self._worker_loop,
+            name="ICLLabelerWorker",
+            daemon=True
+        )
+        self._worker_thread.start()
+        logger.info("Started ICL labeler background worker")
+    def stop_background_worker(self) -> None:
+        """Stop the background worker thread."""
+        if self._worker_thread is None:
+            return
+        self._stop_worker.set()
+        self._worker_thread.join(timeout=5.0)
+        self._worker_thread = None
+        logger.info("Stopped ICL labeler background worker")
+    def _worker_loop(self) -> None:
+        """Main loop for the background worker."""
+        logger.info(
+            f"ICL background worker started, "
+            f"example_refresh={self.example_refresh_interval}s, "
+            f"batch_interval={self.batch_interval}s"
+        )
+        last_example_refresh = 0
+        last_batch = 0
+        while not self._stop_worker.is_set():
+            try:
+                current_time = time.time()
+                # Refresh examples periodically
+                if current_time - last_example_refresh >= self.example_refresh_interval:
+                    self.refresh_high_confidence_examples()
+                    last_example_refresh = current_time
+                # Run batch labeling periodically
+                if current_time - last_batch >= self.batch_interval:
+                    schemas = self._get_annotation_schemes()
+                    for schema in schemas:
+                        schema_name = schema.get('name')
+                        if schema_name and self.has_enough_examples(schema_name):
+                            predictions = self.batch_label_instances(schema_name)
+                            if predictions:
+                                self.save_state()
+                    last_batch = current_time
+            except Exception as e:
+                logger.error(f"ICL background worker error: {e}")
+            # Wait for next interval or stop signal
+            self._stop_worker.wait(min(self.example_refresh_interval, self.batch_interval) / 2)
+    # === Persistence ===
+    def save_state(self) -> None:
+        """Save current state to disk."""
+        task_dir = self.config.get('output_annotation_dir', '')
+        if not task_dir:
+            return
+        filepath = os.path.join(task_dir, self.predictions_file)
+        try:
+            with self._lock:
+                state = {
+                    'predictions': {
+                        inst_id: {
+                            schema: pred.to_dict()
+                            for schema, pred in schemas.items()
+                        }
+                        for inst_id, schemas in self.predictions.items()
+                    },
+                    'examples': {
+                        schema: [ex.to_dict() for ex in examples]
+                        for schema, examples in self.schema_to_examples.items()
+                    },
+                    'verification_queue': self.verification_queue,
+                    'labeled_instance_ids': list(self.labeled_instance_ids),
+                    'last_example_refresh': (
+                        self.last_example_refresh.isoformat()
+                        if self.last_example_refresh else None
+                    ),
+                    'last_batch_run': (
+                        self.last_batch_run.isoformat()
+                        if self.last_batch_run else None
+                    )
+                }
+            # Atomic write
+            temp_path = filepath + '.tmp'
+            with open(temp_path, 'w') as f:
+                json.dump(state, f, indent=2)
+            os.replace(temp_path, filepath)
+            logger.debug(f"Saved ICL state to {filepath}")
+        except Exception as e:
+            logger.error(f"Error saving ICL state: {e}")
+    def load_state(self) -> None:
+        """Load state from disk."""
+        task_dir = self.config.get('output_annotation_dir', '')
+        if not task_dir:
+            return
+        filepath = os.path.join(task_dir, self.predictions_file)
+        if not os.path.exists(filepath):
+            return
+        try:
+            with open(filepath, 'r') as f:
+                state = json.load(f)
+            with self._lock:
+                # Load predictions
+                self.predictions = {}
+                for inst_id, schemas in state.get('predictions', {}).items():
+                    self.predictions[inst_id] = {
+                        schema: ICLPrediction.from_dict(pred_data)
+                        for schema, pred_data in schemas.items()
+                    }
+                # Load examples
+                self.schema_to_examples = {}
+                for schema, examples in state.get('examples', {}).items():
+                    self.schema_to_examples[schema] = [
+                        HighConfidenceExample.from_dict(ex) for ex in examples
+                    ]
+                # Load other state
+                self.verification_queue = [
+                    tuple(item) for item in state.get('verification_queue', [])
+                ]
+                self.labeled_instance_ids = set(state.get('labeled_instance_ids', []))
+                if state.get('last_example_refresh'):
+                    self.last_example_refresh = datetime.fromisoformat(
+                        state['last_example_refresh']
+                    )
+                if state.get('last_batch_run'):
+                    self.last_batch_run = datetime.fromisoformat(
+                        state['last_batch_run']
+                    )
+            logger.info(f"Loaded ICL state from {filepath}")
+        except Exception as e:
+            logger.error(f"Error loading ICL state: {e}")
+# Module-level singleton access
+_icl_labeler: Optional[ICLLabeler] = None
+def init_icl_labeler(config: Dict[str, Any]) -> ICLLabeler:
+    """Initialize the global ICL labeler."""
+    global _icl_labeler
+    _icl_labeler = ICLLabeler(config)
+    _icl_labeler.load_state()
+    return _icl_labeler
+def get_icl_labeler() -> Optional[ICLLabeler]:
+    """Get the global ICL labeler instance."""
+    return _icl_labeler
+def clear_icl_labeler() -> None:
+    """Clear the global ICL labeler (for testing)."""
+    global _icl_labeler
+    if _icl_labeler:
+        _icl_labeler.stop_background_worker()
+    _icl_labeler = None
+    ICLLabeler._instance = None

potato/ai/icl_prompt_builder.py ADDED Viewed

	@@ -0,0 +1,315 @@

+"""
+In-Context Learning Prompt Builder
+This module builds effective prompts for in-context learning based labeling.
+It formats high-confidence examples and target instances into prompts that
+elicit accurate label predictions with confidence scores.
+"""
+import json
+import logging
+import re
+from typing import Dict, List, Any, Tuple, Optional, TYPE_CHECKING
+if TYPE_CHECKING:
+    from potato.ai.icl_labeler import HighConfidenceExample
+logger = logging.getLogger(__name__)
+class ICLPromptBuilder:
+    """
+    Builds effective prompts for in-context learning.
+    The prompt structure:
+    1. System instructions explaining the task
+    2. Schema description and available labels
+    3. High-confidence examples with their labels
+    4. Target text to label
+    5. Output format instructions (JSON with label, confidence, reasoning)
+    """
+    def __init__(self, max_example_length: int = 500, max_target_length: int = 1000):
+        """
+        Initialize the prompt builder.
+        Args:
+            max_example_length: Maximum characters per example text
+            max_target_length: Maximum characters for target text
+        """
+        self.max_example_length = max_example_length
+        self.max_target_length = max_target_length
+    def build_prompt(
+        self,
+        schema: Dict[str, Any],
+        examples: List['HighConfidenceExample'],
+        target_text: str
+    ) -> str:
+        """
+        Build a complete ICL prompt.
+        Args:
+            schema: Annotation schema dictionary with name, description, labels
+            examples: List of high-confidence examples
+            target_text: The text to be labeled
+        Returns:
+            Complete prompt string
+        """
+        parts = []
+        # System instructions
+        parts.append(self._build_system_prompt(schema))
+        # Examples section
+        if examples:
+            parts.append("\n## Examples\n")
+            parts.append("Here are examples of correctly labeled texts:\n")
+            for i, example in enumerate(examples, 1):
+                parts.append(self._format_example(example, i))
+        # Target text section
+        parts.append("\n## Your Task\n")
+        parts.append("Now label the following text:\n")
+        parts.append(f'Text: "{self._truncate_text(target_text, self.max_target_length)}"\n')
+        # Output format instructions
+        parts.append(self._build_output_instructions(schema))
+        return "\n".join(parts)
+    def _build_system_prompt(self, schema: Dict[str, Any]) -> str:
+        """Build the system/instruction portion of the prompt."""
+        schema_name = schema.get('name', 'unknown')
+        description = schema.get('description', 'Label the text according to the schema.')
+        labels = self._get_labels_from_schema(schema)
+        annotation_type = schema.get('annotation_type', 'radio')
+        prompt = f"""You are an expert annotation assistant. Your task is to label text according to a specific annotation schema.
+## Schema: {schema_name}
+**Description:** {description}
+**Available Labels:** {', '.join(labels)}
+"""
+        # Add type-specific instructions
+        if annotation_type == 'radio':
+            prompt += "\n**Task Type:** Single-choice classification. Select exactly ONE label.\n"
+        elif annotation_type == 'multiselect':
+            prompt += "\n**Task Type:** Multi-label classification. Select ALL applicable labels.\n"
+        elif annotation_type == 'likert':
+            prompt += "\n**Task Type:** Rating scale. Choose the most appropriate rating.\n"
+        return prompt
+    def _format_example(self, example: 'HighConfidenceExample', index: int) -> str:
+        """Format a single example for the prompt."""
+        truncated_text = self._truncate_text(example.text, self.max_example_length)
+        return f"""
+### Example {index}
+Text: "{truncated_text}"
+Label: **{example.label}**
+(Agreement: {example.agreement_score:.0%} from {example.annotator_count} annotators)
+"""
+    def _build_output_instructions(self, schema: Dict[str, Any]) -> str:
+        """Build instructions for the expected output format."""
+        labels = self._get_labels_from_schema(schema)
+        labels_json = json.dumps(labels)
+        return f"""
+## Output Format
+Respond with a JSON object containing:
+- `label`: Your chosen label (must be one of: {labels_json})
+- `confidence`: Your confidence score from 0.0 to 1.0
+  - 1.0 = Absolutely certain
+  - 0.7-0.9 = High confidence
+  - 0.5-0.7 = Moderate confidence
+  - 0.3-0.5 = Low confidence
+  - 0.0-0.3 = Very uncertain
+- `reasoning`: Brief explanation for your choice (1-2 sentences)
+**Important:**
+- Only use labels from the provided list
+- Be honest about your confidence - reflect your actual certainty, not a fixed value
+- Base your decision on the examples and schema description
+- Use the full 0.0–1.0 range: reserve 0.9+ for near-certain cases, use 0.4–0.6 when genuinely unsure
+Example response (the confidence value here is illustrative only — yours should reflect actual certainty):
+```json
+{{"label": "example_label", "confidence": 0.72, "reasoning": "The text shows clear indicators of..."}}
+```
+Now provide your response as JSON:
+"""
+    def _get_labels_from_schema(self, schema: Dict[str, Any]) -> List[str]:
+        """Extract label names from schema definition."""
+        labels = schema.get('labels', [])
+        result = []
+        for label in labels:
+            if isinstance(label, str):
+                result.append(label)
+            elif isinstance(label, dict):
+                result.append(label.get('name', str(label)))
+        return result
+    def _truncate_text(self, text: str, max_length: int) -> str:
+        """Truncate text to max length, preserving word boundaries."""
+        if len(text) <= max_length:
+            return text
+        truncated = text[:max_length]
+        # Try to break at word boundary
+        last_space = truncated.rfind(' ')
+        if last_space > max_length * 0.8:
+            truncated = truncated[:last_space]
+        return truncated + "..."
+    def parse_response(
+        self,
+        response: str,
+        schema: Dict[str, Any]
+    ) -> Tuple[Optional[str], float, str]:
+        """
+        Parse the LLM response to extract label, confidence, and reasoning.
+        Args:
+            response: Raw response from LLM
+            schema: Schema for validation
+        Returns:
+            Tuple of (label, confidence, reasoning) or (None, 0.0, "") on failure
+        """
+        try:
+            # Try to parse as JSON directly
+            data = self._extract_json(response)
+            if data:
+                label = data.get('label', '')
+                confidence = float(data.get('confidence', 0.5))
+                reasoning = data.get('reasoning', '')
+                # Validate label
+                valid_labels = self._get_labels_from_schema(schema)
+                if label in valid_labels:
+                    return label, min(1.0, max(0.0, confidence)), reasoning
+                # Try fuzzy matching
+                matched = self._fuzzy_match_label(label, valid_labels)
+                if matched:
+                    return matched, min(1.0, max(0.0, confidence)), reasoning
+            # Fallback: try to extract label from text
+            return self._extract_label_from_text(response, schema)
+        except Exception as e:
+            logger.warning(f"Error parsing response: {e}")
+            return None, 0.0, ""
+    def _extract_json(self, text: str) -> Optional[Dict[str, Any]]:
+        """Extract JSON from text, handling markdown code blocks."""
+        # Try direct parse
+        try:
+            return json.loads(text)
+        except json.JSONDecodeError:
+            pass
+        # Try to find JSON in code blocks
+        json_patterns = [
+            r'```json\s*(.*?)\s*```',
+            r'```\s*(.*?)\s*```',
+            r'\{[^{}]*"label"[^{}]*\}'
+        ]
+        for pattern in json_patterns:
+            match = re.search(pattern, text, re.DOTALL)
+            if match:
+                try:
+                    json_str = match.group(1) if match.lastindex else match.group(0)
+                    return json.loads(json_str)
+                except (json.JSONDecodeError, IndexError):
+                    continue
+        return None
+    def _fuzzy_match_label(self, label: str, valid_labels: List[str]) -> Optional[str]:
+        """Try to match label with case-insensitive comparison."""
+        label_lower = label.lower().strip()
+        for valid in valid_labels:
+            if valid.lower().strip() == label_lower:
+                return valid
+        return None
+    def _extract_label_from_text(
+        self,
+        text: str,
+        schema: Dict[str, Any]
+    ) -> Tuple[Optional[str], float, str]:
+        """Fallback: try to extract label directly from text."""
+        valid_labels = self._get_labels_from_schema(schema)
+        text_lower = text.lower()
+        for label in valid_labels:
+            # Look for label mentioned in text
+            if label.lower() in text_lower:
+                return label, 0.5, "Extracted from response text (low confidence)"
+        return None, 0.0, ""
+class MultiSelectPromptBuilder(ICLPromptBuilder):
+    """
+    Specialized prompt builder for multi-select (multi-label) tasks.
+    """
+    def _build_output_instructions(self, schema: Dict[str, Any]) -> str:
+        """Build output instructions for multi-select."""
+        labels = self._get_labels_from_schema(schema)
+        labels_json = json.dumps(labels)
+        return f"""
+## Output Format
+Respond with a JSON object containing:
+- `labels`: Array of selected labels (from: {labels_json})
+- `confidence`: Your overall confidence score from 0.0 to 1.0 — reflect actual certainty, not a fixed value
+- `reasoning`: Brief explanation for your choices
+Example response (the confidence value here is illustrative only — yours should reflect actual certainty):
+```json
+{{"labels": ["label1", "label2"], "confidence": 0.65, "reasoning": "The text exhibits both..."}}
+```
+Now provide your response as JSON:
+"""
+    def parse_response(
+        self,
+        response: str,
+        schema: Dict[str, Any]
+    ) -> Tuple[Optional[List[str]], float, str]:
+        """Parse multi-select response."""
+        try:
+            data = self._extract_json(response)
+            if data:
+                labels = data.get('labels', [])
+                confidence = float(data.get('confidence', 0.5))
+                reasoning = data.get('reasoning', '')
+                valid_labels = self._get_labels_from_schema(schema)
+                validated = [l for l in labels if l in valid_labels]
+                if validated:
+                    return validated, min(1.0, max(0.0, confidence)), reasoning
+            return None, 0.0, ""
+        except Exception as e:
+            logger.warning(f"Error parsing multi-select response: {e}")
+            return None, 0.0, ""

potato/ai/judge.py ADDED Viewed

	@@ -0,0 +1,265 @@

+"""
+LLM-as-Judge service for human-alignment.
+Produces a judge verdict (label + confidence + reasoning) for an annotation
+instance, given the schema (labels + description + an editable rubric) and,
+optionally, few-shot examples drawn from high-agreement human labels. The
+verdicts are compared against human labels elsewhere
+(``potato/server_utils/judge_alignment.py``) to measure and calibrate
+human↔judge agreement (Cohen's κ).
+This deliberately does NOT reuse ``ICLLabeler`` as the judge — ICL auto-labels
+from inter-annotator agreement, which would leak the gold labels we are trying
+to measure the judge against. We only borrow ICLLabeler's *example selection*
+for few-shot calibration, and we always exclude the instance being judged from
+its own example set.
+The judge call goes through the same ``AIEndpointFactory`` / ``BaseAIEndpoint``
+machinery as every other AI feature (mirrors ``icl_labeler.label_instance``),
+so it works with any configured provider.
+"""
+import hashlib
+import json
+import logging
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional
+logger = logging.getLogger(__name__)
+@dataclass
+class JudgePrediction:
+    """A single LLM-judge verdict for one instance + schema."""
+    instance_id: str
+    schema_name: str
+    predicted_label: str
+    confidence: float  # 0.0–1.0
+    reasoning: str = ""
+    model_name: str = ""
+    prompt_version: str = ""
+    examples_used: List[str] = field(default_factory=list)
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "instance_id": self.instance_id,
+            "schema_name": self.schema_name,
+            "predicted_label": self.predicted_label,
+            "confidence": self.confidence,
+            "reasoning": self.reasoning,
+            "model_name": self.model_name,
+            "prompt_version": self.prompt_version,
+            "examples_used": self.examples_used,
+        }
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]) -> "JudgePrediction":
+        return cls(
+            instance_id=data["instance_id"],
+            schema_name=data["schema_name"],
+            predicted_label=data.get("predicted_label", ""),
+            confidence=float(data.get("confidence", 0.0)),
+            reasoning=data.get("reasoning", ""),
+            model_name=data.get("model_name", ""),
+            prompt_version=data.get("prompt_version", ""),
+            examples_used=data.get("examples_used", []),
+        )
+def extract_labels(schema_info: Dict[str, Any]) -> List[str]:
+    """Return the allowed label names for a categorical schema.
+    Supports ``radio``/``select``/``multiselect`` (``labels`` list of
+    str|dict) and ``likert`` (1..size). Returns ``[]`` for unsupported types.
+    """
+    atype = schema_info.get("annotation_type", "")
+    if atype == "likert":
+        size = int(schema_info.get("size", 5))
+        return [str(i) for i in range(1, size + 1)]
+    labels = schema_info.get("labels", [])
+    out = []
+    for lab in labels:
+        if isinstance(lab, dict):
+            out.append(str(lab.get("name", "")))
+        else:
+            out.append(str(lab))
+    return [x for x in out if x]
+def compute_prompt_version(rubric: str, schema_name: str, few_shot: bool) -> str:
+    """Stable short hash identifying this judge configuration.
+    Editing the rubric (or toggling few-shot) yields a new version so the admin
+    report can track κ across prompt versions.
+    """
+    basis = f"{schema_name}␟{int(bool(few_shot))}␟{rubric or ''}"
+    return "v_" + hashlib.sha1(basis.encode("utf-8")).hexdigest()[:10]
+class JudgeService:
+    """Builds judge prompts and queries the configured AI endpoint."""
+    def __init__(self, config: Dict[str, Any]):
+        self.config = config or {}
+        self.judge_config = self.config.get("judge_alignment", {}) or {}
+        self._endpoint = None
+        self._endpoint_initialized = False
+    # ----- endpoint -------------------------------------------------------
+    def _get_endpoint(self):
+        if not self._endpoint_initialized:
+            self._endpoint_initialized = True
+            try:
+                from potato.ai.ai_endpoint import AIEndpointFactory
+                # The judge endpoint config lives under judge_alignment, but
+                # fall back to the task's ai_support so a single endpoint can
+                # serve both. Shape mirrors what AIEndpointFactory expects.
+                ai_support = self.judge_config.get("ai_support") or self.config.get("ai_support")
+                if not ai_support:
+                    logger.warning("Judge: no ai_support / judge_alignment.ai_support configured")
+                    return None
+                self._endpoint = AIEndpointFactory.create_endpoint({"ai_support": ai_support})
+            except Exception as e:
+                logger.error(f"Judge: failed to create endpoint: {e}")
+                self._endpoint = None
+        return self._endpoint
+    # ----- prompt ---------------------------------------------------------
+    def _schema_judge_config(self, schema_name: str) -> Dict[str, Any]:
+        per_schema = self.judge_config.get("schemas", {}) or {}
+        return per_schema.get(schema_name, {}) or {}
+    def get_rubric(self, schema_info: Dict[str, Any]) -> str:
+        """Editable rubric for a schema; falls back to its description."""
+        sc = self._schema_judge_config(schema_info.get("name", ""))
+        return sc.get("rubric") or schema_info.get("description", "") or ""
+    def build_prompt(
+        self,
+        schema_info: Dict[str, Any],
+        instance_text: str,
+        few_shot_examples: Optional[List[Dict[str, str]]] = None,
+    ) -> str:
+        """Compose the judge prompt.
+        few_shot_examples: list of {"text": ..., "label": ...} gold exemplars
+        (already excluding the target instance).
+        """
+        labels = extract_labels(schema_info)
+        rubric = self.get_rubric(schema_info)
+        parts = [
+            "You are an expert evaluator acting as an impartial judge.",
+            "Assign exactly one label to the item below, following the rubric.",
+            "",
+            f"Task: {schema_info.get('description', '')}".rstrip(),
+            f"Rubric: {rubric}".rstrip(),
+            "Allowed labels: " + ", ".join(labels) if labels else "",
+        ]
+        if few_shot_examples:
+            parts.append("\nExamples (item → correct label):")
+            for ex in few_shot_examples:
+                parts.append(f"- {_truncate(ex.get('text', ''))} → {ex.get('label', '')}")
+        parts.append("\nItem to judge:")
+        parts.append(_truncate(instance_text, 4000))
+        parts.append(
+            '\nRespond as JSON: {"label": <one of the allowed labels>, '
+            '"confidence": <0.0-1.0>, "reasoning": <one sentence>}.'
+        )
+        return "\n".join(p for p in parts if p != "")
+    # ----- judging --------------------------------------------------------
+    def judge_instance(
+        self,
+        instance_id: str,
+        schema_info: Dict[str, Any],
+        instance_text: str,
+        few_shot_examples: Optional[List[Dict[str, str]]] = None,
+        prompt_version: Optional[str] = None,
+    ) -> Optional[JudgePrediction]:
+        """Query the judge for one instance. Returns None on failure."""
+        endpoint = self._get_endpoint()
+        if endpoint is None:
+            return None
+        schema_name = schema_info.get("name", "")
+        valid_labels = extract_labels(schema_info)
+        rubric = self.get_rubric(schema_info)
+        if prompt_version is None:
+            prompt_version = compute_prompt_version(
+                rubric, schema_name, bool(few_shot_examples)
+            )
+        prompt = self.build_prompt(schema_info, instance_text, few_shot_examples)
+        try:
+            from pydantic import BaseModel
+            class JudgeVerdict(BaseModel):
+                label: str
+                confidence: float = 0.5
+                reasoning: str = ""
+            response = endpoint.query(prompt, JudgeVerdict)
+            if isinstance(response, str):
+                data = json.loads(response)
+            elif hasattr(response, "model_dump"):
+                data = response.model_dump()
+            elif hasattr(response, "dict"):
+                data = response.dict()
+            else:
+                data = response or {}
+        except Exception as e:
+            logger.error(f"Judge: query/parse failed for {instance_id}/{schema_name}: {e}")
+            return None
+        predicted = str(data.get("label", "")).strip()
+        try:
+            confidence = float(data.get("confidence", 0.5))
+        except (TypeError, ValueError):
+            confidence = 0.5
+        confidence = min(1.0, max(0.0, confidence))
+        reasoning = str(data.get("reasoning", ""))
+        if valid_labels and predicted not in valid_labels:
+            matched = _fuzzy_match_label(predicted, valid_labels)
+            if matched is None:
+                logger.warning(
+                    f"Judge: invalid label '{predicted}' for {instance_id}/{schema_name}"
+                )
+                return None
+            predicted = matched
+        return JudgePrediction(
+            instance_id=instance_id,
+            schema_name=schema_name,
+            predicted_label=predicted,
+            confidence=confidence,
+            reasoning=reasoning,
+            model_name=getattr(endpoint, "model", ""),
+            prompt_version=prompt_version,
+            examples_used=[e.get("id", "") for e in (few_shot_examples or []) if e.get("id")],
+        )
+def _truncate(text: str, limit: int = 300) -> str:
+    text = str(text or "")
+    return text if len(text) <= limit else text[:limit] + "…"
+def _fuzzy_match_label(predicted: str, valid_labels: List[str]) -> Optional[str]:
+    """Case-insensitive / prefix match a model label to an allowed label."""
+    if not predicted:
+        return None
+    low = predicted.lower().strip()
+    for lab in valid_labels:
+        if lab.lower() == low:
+            return lab
+    for lab in valid_labels:
+        ll = lab.lower()
+        if ll in low or low in ll:
+            return lab
+    return None

potato/ai/llm_active_learning.py ADDED Viewed

	@@ -0,0 +1,733 @@

+"""
+LLM Integration for Active Learning
+This module provides LLM-based active learning capabilities using VLLM endpoints.
+It implements confidence-based instance selection and prediction using large language
+models, with support for multiple confidence elicitation methods:
+- **logprobs**: Extract token-level log probabilities from VLLM/OpenAI-compatible
+  endpoints for calibrated confidence scores.
+- **verbalized**: Ask the LLM to self-report confidence on a 1-10 scale (default).
+- **consistency**: Query the same instance N times with temperature > 0 and use
+  agreement rate as confidence (works with any endpoint).
+References:
+    Tian et al. (2023) "Just Ask for Calibration: Strategies for Eliciting
+    Calibrated Confidence Scores from Language Models Fine-Tuned with Human
+    Feedback." EMNLP 2023.
+    Xiong et al. (2024) "Can LLMs Express Their Uncertainty? An Empirical
+    Evaluation of Confidence Elicitation in LLMs." ICLR 2024.
+"""
+import logging
+import math
+import time
+import json
+import requests
+from collections import Counter
+from typing import Dict, List, Optional, Tuple, Any
+from dataclasses import dataclass, field
+from concurrent.futures import ThreadPoolExecutor, as_completed
+import numpy as np
+from potato.active_learning_manager import TrainingMetrics
+def _loads_lenient(content: str):
+    """json.loads that tolerates markdown code fences and surrounding prose.
+    Many models (e.g. Gemma on vLLM) wrap JSON in ```json ... ``` fences even
+    when response_format=json_object is requested, which breaks a naive
+    json.loads(). Strip fences and, failing that, extract the first {...} block.
+    """
+    import re
+    if content is None:
+        raise json.JSONDecodeError("empty content", "", 0)
+    s = content.strip()
+    # Strip a leading ```json / ``` fence and trailing ```
+    s = re.sub(r"^```(?:json|JSON)?\s*", "", s)
+    s = re.sub(r"\s*```$", "", s).strip()
+    try:
+        return json.loads(s)
+    except json.JSONDecodeError:
+        # Fall back to the first balanced-looking {...} object in the text.
+        m = re.search(r"\{.*\}", s, re.DOTALL)
+        if m:
+            return json.loads(m.group(0))
+        raise
+@dataclass
+class LLMPrediction:
+    """Result of an LLM prediction."""
+    instance_id: str
+    predicted_label: str
+    confidence_score: float
+    raw_response: str
+    error_message: Optional[str] = None
+    confidence_method: str = "verbalized"
+@dataclass
+class LLMConfig:
+    """Configuration for LLM integration."""
+    endpoint_url: str
+    model_name: str
+    max_tokens: int = 512
+    temperature: float = 0.1
+    timeout: int = 30
+    batch_size: int = 10
+    retry_attempts: int = 3
+    retry_delay: float = 1.0
+    max_instances_per_request: int = 5
+    confidence_method: str = "verbalized"  # logprobs | verbalized | consistency
+    consistency_samples: int = 3
+class LLMActiveLearning:
+    """
+    LLM-based active learning implementation.
+    This class provides methods for:
+    - Querying LLMs for predictions and confidence scores
+    - Batch processing of instances
+    - Error handling and retry logic
+    - Integration with the active learning pipeline
+    """
+    def __init__(self, config: LLMConfig):
+        self.config = config
+        self.logger = logging.getLogger(__name__)
+        self.session = requests.Session()
+        # Configure session
+        self.session.timeout = config.timeout
+        # Test connection on initialization
+        self._test_connection()
+    def _test_connection(self):
+        """Test the connection to the LLM endpoint."""
+        try:
+            test_payload = {
+                "model": self.config.model_name,
+                "messages": [{"role": "user", "content": "Hello"}],
+                "max_tokens": 10,
+                "temperature": 0.1
+            }
+            response = self.session.post(
+                self.config.endpoint_url,
+                json=test_payload,
+                timeout=5
+            )
+            if response.status_code == 200:
+                self.logger.info(f"Successfully connected to LLM endpoint: {self.config.endpoint_url}")
+            else:
+                self.logger.warning(f"LLM endpoint returned status {response.status_code}: {response.text}")
+        except Exception as e:
+            self.logger.error(f"Failed to connect to LLM endpoint: {e}")
+            # Don't raise - allow fallback to traditional methods
+    def predict_instances(self, instances: List[Dict[str, Any]],
+                         annotation_instructions: str,
+                         schema_name: str,
+                         label_options: List[str]) -> List[LLMPrediction]:
+        """
+        Predict labels and confidence scores for instances using LLM.
+        Args:
+            instances: List of instances to predict
+            annotation_instructions: Instructions for the annotation task
+            schema_name: Name of the annotation schema
+            label_options: Available label options
+        Returns:
+            List of LLM predictions with confidence scores
+        """
+        if not instances:
+            return []
+        self.logger.info(f"Starting LLM prediction for {len(instances)} instances")
+        # Create prompts for each instance
+        prompts = self._create_prompts(instances, annotation_instructions, schema_name, label_options)
+        # Process in batches
+        all_predictions = []
+        for i in range(0, len(prompts), self.config.batch_size):
+            batch_prompts = prompts[i:i + self.config.batch_size]
+            batch_instances = instances[i:i + self.config.batch_size]
+            batch_predictions = self._process_batch(batch_prompts, batch_instances)
+            all_predictions.extend(batch_predictions)
+            # Small delay between batches to avoid overwhelming the endpoint
+            if i + self.config.batch_size < len(prompts):
+                time.sleep(0.1)
+        self.logger.info(f"Completed LLM prediction for {len(all_predictions)} instances")
+        return all_predictions
+    def _create_prompts(self, instances: List[Dict[str, Any]],
+                       annotation_instructions: str,
+                       schema_name: str,
+                       label_options: List[str]) -> List[str]:
+        """Create prompts for LLM prediction."""
+        prompts = []
+        # Create the base prompt template
+        base_prompt = self._create_base_prompt(annotation_instructions, schema_name, label_options)
+        for instance in instances:
+            # Extract text content
+            text_content = self._extract_text_content(instance)
+            # Create instance-specific prompt
+            prompt = f"{base_prompt}\n\nText to annotate:\n{text_content}\n\nPlease provide your prediction and confidence score."
+            prompts.append(prompt)
+        return prompts
+    def _create_base_prompt(self, annotation_instructions: str,
+                           schema_name: str,
+                           label_options: List[str]) -> str:
+        """Create the base prompt for LLM prediction."""
+        prompt = f"""You are an expert annotator for a text classification task.
+Task: {annotation_instructions}
+Schema: {schema_name}
+Available labels: {', '.join(label_options)}
+For each text, please:
+1. Analyze the text carefully
+2. Choose the most appropriate label from the available options
+3. Provide a confidence score from 1 to 10 (where 1 = very uncertain, 10 = very confident)
+Please respond in the following JSON format:
+{{
+    "label": "chosen_label",
+    "confidence": confidence_score,
+    "reasoning": "brief explanation of your choice"
+}}
+Example response:
+{{
+    "label": "{label_options[0] if label_options else 'example'}",
+    "confidence": 8,
+    "reasoning": "The text clearly expresses positive sentiment based on the language used."
+}}"""
+        return prompt
+    def _extract_text_content(self, instance: Dict[str, Any]) -> str:
+        """Extract text content from an instance."""
+        # Try common text field names
+        text_fields = ['text', 'content', 'message', 'sentence', 'document']
+        for field in text_fields:
+            if field in instance:
+                content = instance[field]
+                if isinstance(content, str):
+                    return content
+                elif isinstance(content, dict):
+                    # Handle nested text fields
+                    for nested_field in text_fields:
+                        if nested_field in content:
+                            return str(content[nested_field])
+        # Fallback: convert the entire instance to string
+        return str(instance)
+    def _process_batch(self, prompts: List[str], instances: List[Dict[str, Any]]) -> List[LLMPrediction]:
+        """Process a batch of prompts."""
+        predictions = []
+        # Use ThreadPoolExecutor for parallel processing within the batch
+        with ThreadPoolExecutor(max_workers=min(len(prompts), 5)) as executor:
+            future_to_index = {
+                executor.submit(self._predict_single, prompt, instances[i]): i
+                for i, prompt in enumerate(prompts)
+            }
+            for future in as_completed(future_to_index):
+                index = future_to_index[future]
+                try:
+                    prediction = future.result()
+                    predictions.append(prediction)
+                except Exception as e:
+                    self.logger.error(f"Error processing instance {index}: {e}")
+                    # Create error prediction
+                    error_prediction = LLMPrediction(
+                        instance_id=instances[index].get('id', f'instance_{index}'),
+                        predicted_label='',
+                        confidence_score=0.1,
+                        raw_response='',
+                        error_message=str(e)
+                    )
+                    predictions.append(error_prediction)
+        return predictions
+    def _predict_single(self, prompt: str, instance: Dict[str, Any]) -> LLMPrediction:
+        """Make a single prediction using the LLM.
+        Dispatches to the appropriate confidence method:
+        - logprobs: Extract token-level log probabilities
+        - consistency: Query N times, use agreement rate
+        - verbalized (default): Parse self-reported confidence from JSON
+        """
+        method = self.config.confidence_method
+        if method == "consistency":
+            return self._predict_consistency(prompt, instance)
+        elif method == "logprobs":
+            return self._predict_with_logprobs(prompt, instance)
+        else:
+            return self._predict_verbalized(prompt, instance)
+    def _predict_verbalized(self, prompt: str, instance: Dict[str, Any]) -> LLMPrediction:
+        """Original verbalized confidence method (1-10 scale)."""
+        instance_id = instance.get('id', 'unknown')
+        for attempt in range(self.config.retry_attempts):
+            try:
+                payload = {
+                    "model": self.config.model_name,
+                    "messages": [
+                        {"role": "system", "content": "You are a helpful assistant that provides structured JSON responses."},
+                        {"role": "user", "content": prompt}
+                    ],
+                    "max_tokens": self.config.max_tokens,
+                    "temperature": self.config.temperature,
+                    "response_format": {"type": "json_object"}
+                }
+                response = self.session.post(
+                    self.config.endpoint_url,
+                    json=payload,
+                    timeout=self.config.timeout
+                )
+                if response.status_code == 200:
+                    result = response.json()
+                    if 'choices' in result and len(result['choices']) > 0:
+                        content = result['choices'][0]['message']['content']
+                        try:
+                            parsed_response = _loads_lenient(content)
+                            predicted_label = parsed_response.get('label', '')
+                            confidence_score = parsed_response.get('confidence', 1)
+                            if not isinstance(confidence_score, (int, float)):
+                                confidence_score = 1
+                            else:
+                                confidence_score = max(1, min(10, confidence_score)) / 10.0
+                            return LLMPrediction(
+                                instance_id=instance_id,
+                                predicted_label=predicted_label,
+                                confidence_score=confidence_score,
+                                raw_response=content,
+                                confidence_method="verbalized"
+                            )
+                        except json.JSONDecodeError as e:
+                            self.logger.warning(f"Failed to parse JSON response for instance {instance_id}: {e}")
+                            return self._extract_from_raw_response(content, instance_id)
+                    else:
+                        raise Exception(f"Invalid response format: {result}")
+                else:
+                    raise Exception(f"HTTP {response.status_code}: {response.text}")
+            except Exception as e:
+                self.logger.warning(f"Attempt {attempt + 1} failed for instance {instance_id}: {e}")
+                if attempt < self.config.retry_attempts - 1:
+                    time.sleep(self.config.retry_delay * (attempt + 1))
+                else:
+                    return LLMPrediction(
+                        instance_id=instance_id,
+                        predicted_label='',
+                        confidence_score=0.1,
+                        raw_response='',
+                        error_message=f"All attempts failed: {e}",
+                        confidence_method="verbalized"
+                    )
+        return LLMPrediction(
+            instance_id=instance_id,
+            predicted_label='',
+            confidence_score=0.1,
+            raw_response='',
+            error_message="Unknown error",
+            confidence_method="verbalized"
+        )
+    def _predict_with_logprobs(self, prompt: str, instance: Dict[str, Any]) -> LLMPrediction:
+        """Extract confidence from token-level log probabilities.
+        Requests logprobs=True from VLLM/OpenAI-compatible endpoints and
+        computes confidence as exp(mean_logprob) over the label tokens.
+        Falls back to verbalized confidence if logprobs unavailable.
+        """
+        instance_id = instance.get('id', 'unknown')
+        for attempt in range(self.config.retry_attempts):
+            try:
+                payload = {
+                    "model": self.config.model_name,
+                    "messages": [
+                        {"role": "system", "content": "You are a helpful assistant that provides structured JSON responses."},
+                        {"role": "user", "content": prompt}
+                    ],
+                    "max_tokens": self.config.max_tokens,
+                    "temperature": self.config.temperature,
+                    "response_format": {"type": "json_object"},
+                    "logprobs": True,
+                    "top_logprobs": 5,
+                }
+                response = self.session.post(
+                    self.config.endpoint_url,
+                    json=payload,
+                    timeout=self.config.timeout
+                )
+                if response.status_code == 200:
+                    result = response.json()
+                    if 'choices' not in result or len(result['choices']) == 0:
+                        raise Exception(f"Invalid response format: {result}")
+                    choice = result['choices'][0]
+                    content = choice['message']['content']
+                    # Parse label from JSON content
+                    try:
+                        parsed = _loads_lenient(content)
+                    except json.JSONDecodeError:
+                        return self._extract_from_raw_response(content, instance_id)
+                    predicted_label = parsed.get('label', '')
+                    # Try to extract logprobs
+                    logprobs_data = choice.get('logprobs', {})
+                    token_logprobs = logprobs_data.get('content', [])
+                    if token_logprobs:
+                        # Compute mean logprob across all tokens
+                        log_probs = [
+                            t['logprob'] for t in token_logprobs
+                            if 'logprob' in t and t['logprob'] is not None
+                        ]
+                        if log_probs:
+                            mean_logprob = sum(log_probs) / len(log_probs)
+                            confidence_score = min(1.0, max(0.0, math.exp(mean_logprob)))
+                        else:
+                            # No valid logprobs, fall back to verbalized
+                            confidence_score = parsed.get('confidence', 5)
+                            if isinstance(confidence_score, (int, float)):
+                                confidence_score = max(1, min(10, confidence_score)) / 10.0
+                            else:
+                                confidence_score = 0.5
+                    else:
+                        # Endpoint didn't return logprobs, fall back to verbalized
+                        self.logger.debug(f"No logprobs returned for {instance_id}, using verbalized")
+                        confidence_score = parsed.get('confidence', 5)
+                        if isinstance(confidence_score, (int, float)):
+                            confidence_score = max(1, min(10, confidence_score)) / 10.0
+                        else:
+                            confidence_score = 0.5
+                    return LLMPrediction(
+                        instance_id=instance_id,
+                        predicted_label=predicted_label,
+                        confidence_score=confidence_score,
+                        raw_response=content,
+                        confidence_method="logprobs" if token_logprobs else "verbalized"
+                    )
+                else:
+                    raise Exception(f"HTTP {response.status_code}: {response.text}")
+            except Exception as e:
+                self.logger.warning(f"Logprobs attempt {attempt + 1} failed for {instance_id}: {e}")
+                if attempt < self.config.retry_attempts - 1:
+                    time.sleep(self.config.retry_delay * (attempt + 1))
+                else:
+                    return LLMPrediction(
+                        instance_id=instance_id,
+                        predicted_label='',
+                        confidence_score=0.1,
+                        raw_response='',
+                        error_message=f"All logprob attempts failed: {e}",
+                        confidence_method="logprobs"
+                    )
+        return LLMPrediction(
+            instance_id=instance_id,
+            predicted_label='',
+            confidence_score=0.1,
+            raw_response='',
+            error_message="Unknown error",
+            confidence_method="logprobs"
+        )
+    def _predict_consistency(self, prompt: str, instance: Dict[str, Any]) -> LLMPrediction:
+        """Consistency-based confidence: query N times, use agreement rate.
+        Works with any endpoint (including Anthropic, Ollama) that doesn't
+        support logprobs. Confidence = fraction of samples that agree on
+        the most common label.
+        """
+        instance_id = instance.get('id', 'unknown')
+        n_samples = self.config.consistency_samples
+        labels = []
+        raw_responses = []
+        for _ in range(n_samples):
+            try:
+                payload = {
+                    "model": self.config.model_name,
+                    "messages": [
+                        {"role": "system", "content": "You are a helpful assistant that provides structured JSON responses."},
+                        {"role": "user", "content": prompt}
+                    ],
+                    "max_tokens": self.config.max_tokens,
+                    "temperature": max(0.5, self.config.temperature),  # Need some randomness
+                    "response_format": {"type": "json_object"}
+                }
+                response = self.session.post(
+                    self.config.endpoint_url,
+                    json=payload,
+                    timeout=self.config.timeout
+                )
+                if response.status_code == 200:
+                    result = response.json()
+                    if 'choices' in result and len(result['choices']) > 0:
+                        content = result['choices'][0]['message']['content']
+                        raw_responses.append(content)
+                        try:
+                            parsed = _loads_lenient(content)
+                            labels.append(parsed.get('label', ''))
+                        except json.JSONDecodeError:
+                            pass
+            except Exception as e:
+                self.logger.debug(f"Consistency sample failed for {instance_id}: {e}")
+        if not labels:
+            return LLMPrediction(
+                instance_id=instance_id,
+                predicted_label='',
+                confidence_score=0.1,
+                raw_response='',
+                error_message="All consistency samples failed",
+                confidence_method="consistency"
+            )
+        # Most common label
+        label_counts = Counter(labels)
+        predicted_label, count = label_counts.most_common(1)[0]
+        confidence_score = count / len(labels)
+        return LLMPrediction(
+            instance_id=instance_id,
+            predicted_label=predicted_label,
+            confidence_score=confidence_score,
+            raw_response=raw_responses[0] if raw_responses else '',
+            confidence_method="consistency"
+        )
+    def _extract_from_raw_response(self, raw_response: str, instance_id: str) -> LLMPrediction:
+        """Extract prediction from raw response when JSON parsing fails."""
+        try:
+            # Try to find label and confidence in the raw text
+            lines = raw_response.lower().split('\n')
+            predicted_label = ''
+            confidence_score = 0.1
+            for line in lines:
+                if 'label' in line and ':' in line:
+                    label_part = line.split(':', 1)[1].strip().strip('"\'')
+                    if label_part:
+                        predicted_label = label_part
+                if 'confidence' in line and ':' in line:
+                    conf_part = line.split(':', 1)[1].strip()
+                    try:
+                        conf_value = float(conf_part)
+                        confidence_score = max(0.1, min(1.0, conf_value / 10.0))
+                    except ValueError:
+                        pass
+            return LLMPrediction(
+                instance_id=instance_id,
+                predicted_label=predicted_label,
+                confidence_score=confidence_score,
+                raw_response=raw_response
+            )
+        except Exception as e:
+            self.logger.error(f"Failed to extract from raw response for instance {instance_id}: {e}")
+            return LLMPrediction(
+                instance_id=instance_id,
+                predicted_label='',
+                confidence_score=0.1,
+                raw_response=raw_response,
+                error_message=f"Failed to extract prediction: {e}"
+            )
+    def calculate_confidence_distribution(self, predictions: List[LLMPrediction]) -> Dict[str, float]:
+        """Calculate confidence score distribution from predictions."""
+        if not predictions:
+            return {}
+        # Filter out predictions with errors
+        valid_predictions = [p for p in predictions if p.error_message is None]
+        if not valid_predictions:
+            return {}
+        confidence_scores = [p.confidence_score for p in valid_predictions]
+        # Create histogram bins
+        bins = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
+        hist, _ = np.histogram(confidence_scores, bins=bins)
+        # Convert to percentages
+        total = len(confidence_scores)
+        distribution = {}
+        for i, count in enumerate(hist):
+            bin_label = f"{bins[i]:.1f}-{bins[i+1]:.1f}"
+            distribution[bin_label] = (count / total) * 100 if total > 0 else 0
+        return distribution
+    def get_prediction_stats(self, predictions: List[LLMPrediction]) -> Dict[str, Any]:
+        """Get statistics about the predictions."""
+        if not predictions:
+            return {
+                "total_predictions": 0,
+                "successful_predictions": 0,
+                "error_rate": 0.0,
+                "average_confidence": 0.0,
+                "confidence_distribution": {}
+            }
+        total = len(predictions)
+        successful = len([p for p in predictions if p.error_message is None])
+        error_rate = (total - successful) / total if total > 0 else 0.0
+        valid_predictions = [p for p in predictions if p.error_message is None]
+        average_confidence = np.mean([p.confidence_score for p in valid_predictions]) if valid_predictions else 0.0
+        confidence_distribution = self.calculate_confidence_distribution(predictions)
+        return {
+            "total_predictions": total,
+            "successful_predictions": successful,
+            "error_rate": error_rate,
+            "average_confidence": average_confidence,
+            "confidence_distribution": confidence_distribution
+        }
+class MockLLMActiveLearning(LLMActiveLearning):
+    """
+    Mock LLM implementation for testing and development.
+    This class provides realistic mock responses for testing active learning
+    without requiring an actual LLM endpoint.
+    """
+    def __init__(self, config: LLMConfig):
+        super().__init__(config)
+        self.logger.info("Using Mock LLM for active learning")
+        # Mock response patterns
+        self._mock_responses = [
+            {"label": "positive", "confidence": 8, "reasoning": "Clear positive sentiment"},
+            {"label": "negative", "confidence": 7, "reasoning": "Negative tone detected"},
+            {"label": "neutral", "confidence": 6, "reasoning": "Balanced perspective"},
+            {"label": "positive", "confidence": 9, "reasoning": "Very positive language"},
+            {"label": "negative", "confidence": 5, "reasoning": "Somewhat negative"},
+            {"label": "neutral", "confidence": 4, "reasoning": "Mixed signals"},
+            {"label": "positive", "confidence": 3, "reasoning": "Uncertain positive"},
+            {"label": "negative", "confidence": 8, "reasoning": "Clearly negative"},
+            {"label": "neutral", "confidence": 7, "reasoning": "Neutral stance"},
+            {"label": "positive", "confidence": 6, "reasoning": "Moderately positive"}
+        ]
+        self._response_index = 0
+    def _test_connection(self):
+        """Mock connection test."""
+        self.logger.info("Mock LLM connection test successful")
+    def _predict_single(self, prompt: str, instance: Dict[str, Any]) -> LLMPrediction:
+        """Make a mock prediction."""
+        instance_id = instance.get('id', 'unknown')
+        # Simulate processing time
+        time.sleep(0.1)
+        # Get next mock response
+        mock_response = self._mock_responses[self._response_index % len(self._mock_responses)]
+        self._response_index += 1
+        # Add some randomness to confidence scores
+        confidence_variation = np.random.normal(0, 0.1)
+        confidence_score = max(0.1, min(1.0, (mock_response['confidence'] / 10.0) + confidence_variation))
+        return LLMPrediction(
+            instance_id=instance_id,
+            predicted_label=mock_response['label'],
+            confidence_score=confidence_score,
+            raw_response=json.dumps(mock_response)
+        )
+def create_llm_active_learning(config: Dict[str, Any]) -> LLMActiveLearning:
+    """
+    Factory function to create LLM active learning instance.
+    Args:
+        config: LLM configuration dictionary
+    Returns:
+        LLMActiveLearning: Configured LLM active learning instance
+    """
+    llm_config = LLMConfig(
+        endpoint_url=config.get('endpoint_url', ''),
+        model_name=config.get('model_name', ''),
+        max_tokens=config.get('max_tokens', 512),
+        temperature=config.get('temperature', 0.1),
+        timeout=config.get('timeout', 30),
+        batch_size=config.get('batch_size', 10),
+        retry_attempts=config.get('retry_attempts', 3),
+        retry_delay=config.get('retry_delay', 1.0),
+        max_instances_per_request=config.get('max_instances_per_request', 5),
+        confidence_method=config.get('confidence_method', 'verbalized'),
+        consistency_samples=config.get('consistency_samples', 3),
+    )
+    # Use mock implementation for testing or when endpoint is not available
+    if config.get('use_mock', False) or not llm_config.endpoint_url:
+        return MockLLMActiveLearning(llm_config)
+    else:
+        return LLMActiveLearning(llm_config)

potato/ai/ollama_endpoint.py ADDED Viewed

	@@ -0,0 +1,160 @@

+"""
+Ollama AI endpoint implementation.
+This module provides integration with Ollama for local LLM inference.
+"""
+import json
+from typing import Dict, List, Optional, Type
+import ollama
+from pydantic import BaseModel
+from .ai_endpoint import BaseAIEndpoint, AIEndpointRequestError, ModelCapabilities
+import re
+DEFAULT_MODEL = "llama3.2"
+class OllamaEndpoint(BaseAIEndpoint):
+    """Ollama endpoint for local LLM inference."""
+    # Capabilities declaration for text-based Ollama models
+    CAPABILITIES = ModelCapabilities(
+        text_generation=True,
+        vision_input=False,
+        bounding_box_output=False,
+        text_classification=True,
+        image_classification=False,
+        rationale_generation=True,
+        keyword_extraction=True,
+    )
+    def _initialize_client(self) -> None:
+        """Initialize the Ollama client."""
+        # Default timeout of 60 seconds for local inference (can be slower)
+        timeout = self.ai_config.get("timeout", 60)
+        host = self.ai_config.get("base_url", "http://localhost:11434")
+        # Create client with timeout
+        self.client = ollama.Client(host=host, timeout=timeout)
+        # Check if Ollama is available
+        try:
+            self.client.list()
+        except Exception as e:
+            raise AIEndpointRequestError(f"Failed to connect to Ollama: {e}")
+    def _get_default_model(self) -> str:
+        """Get the default Ollama model."""
+        return DEFAULT_MODEL
+    def query(self, prompt: str, output_format: Type[BaseModel]) -> str:
+        """
+        Send a query to Ollama and return the response.
+        Args:
+            prompt: The prompt to send to the model
+        Returns:
+            The model's response as a string
+        Raises:
+            AIEndpointRequestError: If the request fails
+        """
+        import logging
+        logger = logging.getLogger(__name__)
+        try:
+            logger.debug(f"[Ollama] Querying model: {self.model}")
+            logger.debug(f"[Ollama] Prompt (first 200 chars): {prompt[:200]}...")
+            options = {
+                'temperature': self.temperature,
+                'num_predict': self.max_tokens
+            }
+            # Think mode: configurable via ai_config['think'], defaults to False
+            # for fast structured output (thinking wastes tokens on JSON tasks)
+            think = self.ai_config.get('think', False)
+            response = self.client.chat(
+                model=self.model,
+                messages=[{'role': 'user', 'content': prompt}],
+                options=options,
+                format=output_format.model_json_schema(),
+                think=think,
+            )
+            # Log full response structure for debugging
+            logger.debug(f"[Ollama] Response type: {type(response)}")
+            logger.debug(f"[Ollama] Full response: {response}")
+            # Get the message object - handle both dict and object access
+            message = response.get('message') if hasattr(response, 'get') else getattr(response, 'message', None)
+            if message is None:
+                raise AIEndpointRequestError("No message in Ollama response")
+            # Get content - handle both dict and object access
+            content = message.get('content') if hasattr(message, 'get') else getattr(message, 'content', None)
+            # Some models put response in 'thinking' field - check that too
+            if not content and hasattr(message, 'thinking') and message.thinking:
+                logger.warning("[Ollama] Content empty but thinking field has data - model may need think=False")
+                # Try to extract JSON from thinking field as fallback
+                thinking_text = message.thinking
+                # Look for JSON in the thinking text
+                import re
+                json_match = re.search(r'\{[^{}]*\}', thinking_text)
+                if json_match:
+                    content = json_match.group(0)
+                    logger.debug(f"[Ollama] Extracted JSON from thinking: {content}")
+            logger.debug(f"[Ollama] Content type: {type(content)}")
+            logger.debug(f"[Ollama] Content value: {repr(content)[:200] if content else 'EMPTY'}")
+            # If content is already a dict (structured output), return it directly
+            if isinstance(content, dict):
+                logger.debug("[Ollama] Content is already a dict, returning directly")
+                return content
+            # Parse response using the base class's robust parser
+            # (handles truncated JSON, markdown blocks, plain text, etc.)
+            if content:
+                logger.debug(f"[Ollama] Response content (first 500 chars): {str(content)[:500]}")
+                return self.parseStringToJson(content)
+            else:
+                raise AIEndpointRequestError("Empty content from Ollama - try a different model or disable thinking mode")
+        except Exception as e:
+            logger.error(f"[Ollama] Request failed: {e}")
+            raise AIEndpointRequestError(f"Ollama request failed: {e}")
+    def chat_query(self, messages: List[Dict[str, str]]) -> str:
+        """Send a multi-turn chat to Ollama using native chat API."""
+        import logging
+        logger = logging.getLogger(__name__)
+        try:
+            options = {
+                'temperature': self.temperature,
+                'num_predict': self.max_tokens,
+            }
+            think = self.ai_config.get('think', False)
+            response = self.client.chat(
+                model=self.model,
+                messages=messages,
+                options=options,
+                think=think,
+            )
+            message = response.get('message') if hasattr(response, 'get') else getattr(response, 'message', None)
+            if message is None:
+                raise AIEndpointRequestError("No message in Ollama chat response")
+            content = message.get('content') if hasattr(message, 'get') else getattr(message, 'content', None)
+            return content or ""
+        except Exception as e:
+            logger.error(f"[Ollama] Chat request failed: {e}")
+            raise AIEndpointRequestError(f"Ollama chat request failed: {e}")

potato/ai/ollama_vision_endpoint.py ADDED Viewed

	@@ -0,0 +1,313 @@

+"""
+Ollama Vision AI Endpoint
+This module provides integration with Ollama vision models for local
+visual AI inference. Supports LLaVA, Llama 3.2 Vision, BakLLaVA, and Qwen-VL models.
+"""
+import base64
+import json
+import logging
+from typing import Any, Dict, List, Type, Union
+from pydantic import BaseModel
+from .ai_endpoint import AIEndpointRequestError, ImageData, ModelCapabilities
+from .visual_ai_endpoint import BaseVisualAIEndpoint
+logger = logging.getLogger(__name__)
+# Default vision model
+DEFAULT_MODEL = "llava:latest"
+# Models known to support vision (used only for warning suppression, not to restrict usage)
+# Any Ollama model can be used - this list just prevents "may not support vision" warnings
+# for models we know are vision-capable
+VISION_MODELS = [
+    # LLaVA family
+    "llava",
+    "llava-llama3",
+    "llava-phi3",
+    "bakllava",
+    # Llama Vision
+    "llama3.2-vision",
+    # Qwen Vision-Language
+    "qwen2.5-vl",
+    "qwen2-vl",
+    "qwen3-vl",
+    # Other vision models
+    "moondream",
+    "minicpm-v",
+    "gemma3",  # Gemma 3 has vision capabilities
+    "gemma4",  # Gemma 4 family (e.g. gemma4:e4b) supports vision
+]
+class OllamaVisionEndpoint(BaseVisualAIEndpoint):
+    """
+    Ollama Vision endpoint for multimodal local inference.
+    Supports vision-capable models like LLaVA, Llama 3.2 Vision, BakLLaVA.
+    Images are sent as base64 in the 'images' field.
+    Configuration options:
+    - model: Vision model to use (default: llava:latest)
+    - base_url: Ollama server URL (default: http://localhost:11434)
+    - timeout: Request timeout in seconds (default: 120)
+    - max_tokens: Maximum response tokens (default: 500)
+    - temperature: Sampling temperature (default: 0.1)
+    """
+    # Capabilities declaration for vision-capable Ollama models (LLaVA, Qwen-VL, etc.)
+    # Note: VLLMs can generate text about images but cannot do precise bounding box detection
+    # Keyword extraction is disabled because it doesn't apply to image content
+    CAPABILITIES = ModelCapabilities(
+        text_generation=True,
+        vision_input=True,
+        bounding_box_output=False,  # VLLMs are not reliable for precise bbox coordinates
+        text_classification=True,
+        image_classification=True,
+        rationale_generation=True,
+        keyword_extraction=False,  # Keywords don't apply to images
+    )
+    def _initialize_client(self) -> None:
+        """Initialize the Ollama client."""
+        try:
+            import ollama
+        except ImportError:
+            raise AIEndpointRequestError(
+                "ollama package is required. Install it with: pip install ollama"
+            )
+        timeout = self.ai_config.get("timeout", 120)  # Vision models can be slower
+        host = self.ai_config.get("base_url", "http://localhost:11434")
+        self.client = ollama.Client(host=host, timeout=timeout)
+        # Verify connection and model availability
+        try:
+            models = self.client.list()
+            logger.info(f"Connected to Ollama at {host}")
+            # Check if the specified model is a known vision model
+            # This is just informational - any model can be used
+            model_lower = self.model.lower()
+            is_known_vision_model = any(vm in model_lower for vm in VISION_MODELS)
+            if not is_known_vision_model:
+                logger.info(
+                    f"Model '{self.model}' not in known vision models list. "
+                    f"This is fine if it supports vision - proceeding anyway."
+                )
+        except Exception as e:
+            raise AIEndpointRequestError(f"Failed to connect to Ollama: {e}")
+    def _get_default_model(self) -> str:
+        """Get the default vision model."""
+        return DEFAULT_MODEL
+    def query(self, prompt: str, output_format: Type[BaseModel]) -> Any:
+        """
+        Standard text query (falls back to text-only mode).
+        For vision tasks, use query_with_image() instead.
+        """
+        try:
+            options = {
+                'temperature': self.temperature,
+                'num_predict': self.max_tokens
+            }
+            response = self.client.chat(
+                model=self.model,
+                messages=[{'role': 'user', 'content': prompt}],
+                options=options,
+                format=output_format.model_json_schema(),
+                think=False,
+            )
+            message = response.get('message') if hasattr(response, 'get') else getattr(response, 'message', None)
+            if message is None:
+                raise AIEndpointRequestError("No message in Ollama response")
+            content = message.get('content') if hasattr(message, 'get') else getattr(message, 'content', None)
+            if isinstance(content, dict):
+                return content
+            if content:
+                return self.parseStringToJson(content)
+            else:
+                raise AIEndpointRequestError("Empty content from Ollama")
+        except Exception as e:
+            raise AIEndpointRequestError(f"Ollama query failed: {e}")
+    def query_with_image(
+        self,
+        prompt: str,
+        image_data: Union[ImageData, List[ImageData]],
+        output_format: Type[BaseModel]
+    ) -> Any:
+        """
+        Send a query with image(s) to Ollama vision model.
+        Args:
+            prompt: Text prompt describing what to analyze
+            image_data: Single ImageData or list of ImageData
+            output_format: Pydantic model for structured output
+        Returns:
+            Parsed response according to output_format
+        Raises:
+            AIEndpointRequestError: If the request fails
+        """
+        try:
+            # Prepare images
+            images = [image_data] if isinstance(image_data, ImageData) else image_data
+            # Convert to base64 if needed
+            image_base64_list = []
+            for img in images:
+                b64_data = self._get_base64_image(img)
+                image_base64_list.append(b64_data)
+            # Build message with images
+            options = {
+                'temperature': self.temperature,
+                'num_predict': self.max_tokens
+            }
+            # Ollama expects images as a list of base64 strings
+            response = self.client.chat(
+                model=self.model,
+                messages=[{
+                    'role': 'user',
+                    'content': prompt,
+                    'images': image_base64_list
+                }],
+                options=options,
+                format=output_format.model_json_schema(),
+                think=False,
+            )
+            logger.debug(f"Ollama vision response type: {type(response)}")
+            # Extract content from response
+            message = response.get('message') if hasattr(response, 'get') else getattr(response, 'message', None)
+            if message is None:
+                raise AIEndpointRequestError("No message in Ollama vision response")
+            content = message.get('content') if hasattr(message, 'get') else getattr(message, 'content', None)
+            logger.debug(f"Ollama vision content type: {type(content)}")
+            # Parse response
+            if isinstance(content, dict):
+                return content
+            if content:
+                return self.parseStringToJson(content)
+            else:
+                raise AIEndpointRequestError("Empty content from Ollama vision model")
+        except AIEndpointRequestError:
+            raise
+        except Exception as e:
+            logger.error(f"Ollama vision query failed: {e}")
+            import traceback
+            logger.error(traceback.format_exc())
+            raise AIEndpointRequestError(f"Ollama vision query failed: {e}")
+    def _get_base64_image(self, image_data: ImageData) -> str:
+        """
+        Get base64-encoded image data.
+        Args:
+            image_data: ImageData object
+        Returns:
+            Base64-encoded image string (without data URL prefix)
+        """
+        if image_data.source == "base64":
+            # Already base64, just return the data
+            return image_data.data
+        elif image_data.source == "url":
+            # Download and convert to base64
+            downloaded = self.download_image_to_base64(image_data.data)
+            return downloaded.data
+        else:
+            raise AIEndpointRequestError(f"Unknown image source: {image_data.source}")
+    def analyze_image(
+        self,
+        image_path_or_url: str,
+        prompt: str,
+        output_format: Type[BaseModel] = None
+    ) -> Any:
+        """
+        Convenience method for analyzing a single image.
+        Args:
+            image_path_or_url: Path to image file or URL
+            prompt: Analysis prompt
+            output_format: Optional output format model
+        Returns:
+            Analysis result
+        """
+        # Prepare image data
+        if image_path_or_url.startswith(("http://", "https://")):
+            image_data = self.download_image_to_base64(image_path_or_url)
+        else:
+            image_data = self.encode_image_to_base64(image_path_or_url)
+        # Use a generic format if not specified
+        if output_format is None:
+            from .prompt.models_module import GeneralHintFormat
+            output_format = GeneralHintFormat
+        return self.query_with_image(prompt, image_data, output_format)
+    def describe_image(self, image_path_or_url: str) -> str:
+        """
+        Get a natural language description of an image.
+        Args:
+            image_path_or_url: Path to image file or URL
+        Returns:
+            Text description of the image
+        """
+        # Use a simple model that returns text
+        class DescriptionFormat(BaseModel):
+            description: str
+        result = self.analyze_image(
+            image_path_or_url,
+            "Describe this image in detail. What objects, people, or scenes do you see?",
+            DescriptionFormat
+        )
+        if isinstance(result, dict) and "description" in result:
+            return result["description"]
+        return str(result)
+    def health_check(self) -> bool:
+        """
+        Check if the Ollama vision model is available.
+        Returns:
+            True if model is ready, False otherwise
+        """
+        try:
+            # Try to list models
+            self.client.list()
+            return True
+        except Exception as e:
+            logger.error(f"Ollama vision health check failed: {e}")
+            return False

potato/ai/openai_endpoint.py ADDED Viewed

	@@ -0,0 +1,94 @@

+"""
+OpenAI AI endpoint implementation.
+This module provides integration with OpenAI's API for LLM inference.
+"""
+import os
+from typing import Dict, List
+from openai import OpenAI
+from .ai_endpoint import BaseAIEndpoint, AIEndpointRequestError, ModelCapabilities
+DEFAULT_MODEL = "gpt-4o-mini"
+class OpenAIEndpoint(BaseAIEndpoint):
+    """OpenAI endpoint for cloud-based LLM inference."""
+    # Capabilities declaration for text-based OpenAI models
+    CAPABILITIES = ModelCapabilities(
+        text_generation=True,
+        vision_input=False,
+        bounding_box_output=False,
+        text_classification=True,
+        image_classification=False,
+        rationale_generation=True,
+        keyword_extraction=True,
+    )
+    def _initialize_client(self) -> None:
+        """Initialize the OpenAI client."""
+        # OpenAI-compatible servers (vLLM, llama.cpp, etc.) ignore the key
+        # but the SDK rejects an empty string, so accept a placeholder.
+        api_key = self.ai_config.get("api_key") or os.environ.get(
+            "OPENAI_API_KEY", ""
+        )
+        base_url = self.ai_config.get("base_url")
+        if not api_key:
+            if base_url:
+                api_key = "EMPTY"  # non-empty placeholder for local servers
+            else:
+                raise AIEndpointRequestError("OpenAI API key is required")
+        # Default timeout of 30 seconds, configurable via ai_config
+        timeout = self.ai_config.get("timeout", 30)
+        client_kwargs = {"api_key": api_key, "timeout": timeout}
+        # Honor a custom base_url so this endpoint can target any
+        # OpenAI-compatible server (previously ignored -> always hit
+        # api.openai.com even when a local base_url was configured).
+        if base_url:
+            client_kwargs["base_url"] = base_url
+        self.client = OpenAI(**client_kwargs)
+    def _get_default_model(self) -> str:
+        """Get the default OpenAI model."""
+        return DEFAULT_MODEL
+    def query(self, prompt: str, output_format: dict) -> str:
+        """
+        Send a query to OpenAI and return the response.
+        Args:
+            prompt: The prompt to send to the model
+        Returns:
+            The model's response as a string
+        Raises:
+            AIEndpointRequestError: If the request fails
+        """
+        try:
+            response = self.client.chat.completions.create(
+                model=self.model,
+                messages=[{"role": "user", "content": prompt}],
+                max_tokens=self.max_tokens,
+                temperature=self.temperature,
+                text_format=output_format.model_json_schema(),
+            )
+            return response.choices[0].message.content
+        except Exception as e:
+            raise AIEndpointRequestError(f"OpenAI request failed: {e}")
+    def chat_query(self, messages: List[Dict[str, str]]) -> str:
+        """Send a multi-turn chat to OpenAI using native messages API."""
+        try:
+            response = self.client.chat.completions.create(
+                model=self.model,
+                messages=messages,
+                max_tokens=self.max_tokens,
+                temperature=self.temperature,
+            )
+            return response.choices[0].message.content
+        except Exception as e:
+            raise AIEndpointRequestError(f"OpenAI chat request failed: {e}")

potato/ai/openai_vision_endpoint.py ADDED Viewed

	@@ -0,0 +1,324 @@

+"""
+OpenAI Vision AI Endpoint
+This module provides integration with OpenAI's vision models (GPT-4o, GPT-4o-mini)
+for visual analysis and annotation assistance.
+"""
+import base64
+import logging
+from typing import Any, Dict, List, Type, Union
+from pydantic import BaseModel
+from .ai_endpoint import AIEndpointRequestError, ImageData, ModelCapabilities
+from .visual_ai_endpoint import BaseVisualAIEndpoint
+logger = logging.getLogger(__name__)
+DEFAULT_MODEL = "gpt-4o"
+class OpenAIVisionEndpoint(BaseVisualAIEndpoint):
+    """
+    OpenAI Vision endpoint for GPT-4o and GPT-4o-mini vision capabilities.
+    Supports both URL and base64 image inputs using the image_url content type.
+    Configuration options:
+    - model: Model to use (gpt-4o, gpt-4o-mini) (default: gpt-4o)
+    - api_key: OpenAI API key (can also use OPENAI_API_KEY env var)
+    - max_tokens: Maximum response tokens (default: 1000)
+    - temperature: Sampling temperature (default: 0.1)
+    - detail: Image detail level - 'low', 'high', or 'auto' (default: auto)
+    """
+    # Capabilities declaration for OpenAI vision models (GPT-4o, GPT-4o-mini)
+    # These models can understand images and generate text but bounding boxes are approximate
+    CAPABILITIES = ModelCapabilities(
+        text_generation=True,
+        vision_input=True,
+        bounding_box_output=False,  # GPT-4V bboxes are approximate, not precise
+        text_classification=True,
+        image_classification=True,
+        rationale_generation=True,
+        keyword_extraction=False,  # Keywords don't apply to images
+    )
+    def _initialize_client(self) -> None:
+        """Initialize the OpenAI client."""
+        try:
+            import openai
+        except ImportError:
+            raise AIEndpointRequestError(
+                "openai package is required. Install it with: pip install openai"
+            )
+        import os
+        api_key = self.ai_config.get("api_key") or os.environ.get("OPENAI_API_KEY")
+        if not api_key:
+            raise AIEndpointRequestError(
+                "OpenAI API key is required. Set it in config or OPENAI_API_KEY env var."
+            )
+        timeout = self.ai_config.get("timeout", 60)
+        self.detail = self.ai_config.get("detail", "auto")
+        self.client = openai.OpenAI(api_key=api_key, timeout=timeout)
+        logger.info(f"OpenAI Vision client initialized with model: {self.model}")
+    def _get_default_model(self) -> str:
+        """Get the default OpenAI vision model."""
+        return DEFAULT_MODEL
+    def query(self, prompt: str, output_format: Type[BaseModel]) -> Any:
+        """
+        Standard text query without images.
+        Args:
+            prompt: Text prompt
+            output_format: Pydantic model for structured output
+        Returns:
+            Parsed response
+        """
+        try:
+            response = self.client.chat.completions.create(
+                model=self.model,
+                messages=[{"role": "user", "content": prompt}],
+                max_tokens=self.max_tokens,
+                temperature=self.temperature,
+                response_format={"type": "json_object"},
+            )
+            content = response.choices[0].message.content
+            return self.parseStringToJson(content)
+        except Exception as e:
+            raise AIEndpointRequestError(f"OpenAI query failed: {e}")
+    def query_with_image(
+        self,
+        prompt: str,
+        image_data: Union[ImageData, List[ImageData]],
+        output_format: Type[BaseModel]
+    ) -> Any:
+        """
+        Send a query with image(s) to OpenAI vision model.
+        Args:
+            prompt: Text prompt describing what to analyze
+            image_data: Single ImageData or list of ImageData
+            output_format: Pydantic model for structured output
+        Returns:
+            Parsed response according to output_format
+        Raises:
+            AIEndpointRequestError: If the request fails
+        """
+        try:
+            # Prepare images
+            images = [image_data] if isinstance(image_data, ImageData) else image_data
+            # Build content array with text and images
+            content = [{"type": "text", "text": prompt}]
+            for img in images:
+                image_content = self._build_image_content(img)
+                content.append(image_content)
+            # Make request
+            response = self.client.chat.completions.create(
+                model=self.model,
+                messages=[{"role": "user", "content": content}],
+                max_tokens=self.max_tokens,
+                temperature=self.temperature,
+                response_format={"type": "json_object"},
+            )
+            response_content = response.choices[0].message.content
+            logger.debug(f"OpenAI vision response: {response_content[:500] if response_content else 'empty'}")
+            return self.parseStringToJson(response_content)
+        except AIEndpointRequestError:
+            raise
+        except Exception as e:
+            logger.error(f"OpenAI vision query failed: {e}")
+            import traceback
+            logger.error(traceback.format_exc())
+            raise AIEndpointRequestError(f"OpenAI vision query failed: {e}")
+    def _build_image_content(self, image_data: ImageData) -> Dict[str, Any]:
+        """
+        Build image content block for OpenAI API.
+        Args:
+            image_data: ImageData object
+        Returns:
+            Dict with type: "image_url" and image_url content
+        """
+        if image_data.source == "url":
+            # Direct URL reference
+            return {
+                "type": "image_url",
+                "image_url": {
+                    "url": image_data.data,
+                    "detail": self.detail
+                }
+            }
+        elif image_data.source == "base64":
+            # Data URL format
+            mime_type = image_data.mime_type or "image/jpeg"
+            data_url = f"data:{mime_type};base64,{image_data.data}"
+            return {
+                "type": "image_url",
+                "image_url": {
+                    "url": data_url,
+                    "detail": self.detail
+                }
+            }
+        else:
+            raise AIEndpointRequestError(f"Unknown image source: {image_data.source}")
+    def analyze_image(
+        self,
+        image_path_or_url: str,
+        prompt: str,
+        output_format: Type[BaseModel] = None
+    ) -> Any:
+        """
+        Convenience method for analyzing a single image.
+        Args:
+            image_path_or_url: Path to image file or URL
+            prompt: Analysis prompt
+            output_format: Optional output format model
+        Returns:
+            Analysis result
+        """
+        # Prepare image data - use URL directly if possible
+        if image_path_or_url.startswith(("http://", "https://")):
+            image_data = self.create_url_image_data(image_path_or_url)
+        else:
+            image_data = self.encode_image_to_base64(image_path_or_url)
+        # Use a generic format if not specified
+        if output_format is None:
+            from .prompt.models_module import GeneralHintFormat
+            output_format = GeneralHintFormat
+        return self.query_with_image(prompt, image_data, output_format)
+    def detect_objects(
+        self,
+        image_path_or_url: str,
+        labels: List[str] = None
+    ) -> Dict[str, Any]:
+        """
+        Detect objects in an image and return bounding boxes.
+        Args:
+            image_path_or_url: Path to image file or URL
+            labels: Optional list of labels to detect
+        Returns:
+            Dict with detections list
+        """
+        from .prompt.models_module import VisualDetectionFormat
+        labels_str = ", ".join(labels) if labels else "all visible objects"
+        prompt = f"""Analyze this image and detect objects. For each object, provide:
+1. The label (from: {labels_str})
+2. A bounding box with normalized coordinates (0-1 range)
+3. Confidence score (0-1)
+Return JSON with this structure:
+{{
+    "detections": [
+        {{
+            "label": "object_name",
+            "bbox": {{"x": 0.1, "y": 0.2, "width": 0.3, "height": 0.4}},
+            "confidence": 0.95
+        }}
+    ]
+}}
+Coordinates are normalized (0-1) where x,y is the top-left corner.
+Only include objects you can clearly identify with confidence > 0.5."""
+        # Prepare image
+        if image_path_or_url.startswith(("http://", "https://")):
+            image_data = self.create_url_image_data(image_path_or_url)
+        else:
+            image_data = self.encode_image_to_base64(image_path_or_url)
+        return self.query_with_image(prompt, image_data, VisualDetectionFormat)
+    def describe_region(
+        self,
+        image_path_or_url: str,
+        region: Dict[str, float],
+        labels: List[str] = None
+    ) -> Dict[str, Any]:
+        """
+        Describe or classify a specific region in an image.
+        Args:
+            image_path_or_url: Path to image file or URL
+            region: Dict with x, y, width, height (normalized 0-1)
+            labels: Optional list of possible labels
+        Returns:
+            Classification result with suggested label and confidence
+        """
+        labels_str = ", ".join(labels) if labels else "any appropriate category"
+        prompt = f"""Look at the region marked in this image:
+- Region: x={region['x']:.2f}, y={region['y']:.2f}, width={region['width']:.2f}, height={region['height']:.2f}
+(Coordinates are normalized 0-1, where 0,0 is top-left)
+Classify what you see in this region from these options: {labels_str}
+Return JSON:
+{{
+    "suggested_label": "label_name",
+    "confidence": 0.85,
+    "reasoning": "Brief explanation"
+}}"""
+        # Prepare image
+        if image_path_or_url.startswith(("http://", "https://")):
+            image_data = self.create_url_image_data(image_path_or_url)
+        else:
+            image_data = self.encode_image_to_base64(image_path_or_url)
+        class RegionClassificationFormat(BaseModel):
+            suggested_label: str
+            confidence: float
+            reasoning: str
+        return self.query_with_image(prompt, image_data, RegionClassificationFormat)
+    def health_check(self) -> bool:
+        """
+        Check if the OpenAI API is accessible.
+        Returns:
+            True if API is reachable, False otherwise
+        """
+        try:
+            # Simple models list check
+            self.client.models.list()
+            return True
+        except Exception as e:
+            logger.error(f"OpenAI health check failed: {e}")
+            return False

potato/ai/openrouter_endpoint.py ADDED Viewed

	@@ -0,0 +1,93 @@

+"""
+OpenRouter endpoint implementation.
+This module provides integration with OpenRouter's API for LLM inference.
+"""
+import requests
+from .ai_endpoint import BaseAIEndpoint, AIEndpointRequestError
+DEFAULT_MODEL = "openai/gpt-4o-mini"
+class OpenRouterEndpoint(BaseAIEndpoint):
+    """OpenRouter endpoint for cloud-based LLM inference."""
+    # Models that support structured output
+    STRUCTURED_OUTPUT_MODELS = {
+        "openai/gpt-4o",
+        "openai/gpt-4o-mini",
+        "openai/gpt-4-turbo",
+        "anthropic/claude-3-5-sonnet",
+        "deepseek/deepseek-r1:free"
+    }
+    def _initialize_client(self) -> None:
+        """Initialize the OpenAI client."""
+        api_key = self.ai_config.get("api_key", "")
+        if not api_key:
+            raise AIEndpointRequestError("OpenRouter API key is required")
+    def _get_default_model(self) -> str:
+        """Get the default OpenAI model."""
+        return DEFAULT_MODEL
+    def supports_structured_output(self) -> bool:
+        """Check if the current model supports structured output."""
+        model = self.model or DEFAULT_MODEL
+        return any(model.startswith(prefix.split('/')[0]) or model in self.STRUCTURED_OUTPUT_MODELS
+                   for prefix in self.STRUCTURED_OUTPUT_MODELS)
+    def query(self, prompt: str, output_format: dict) -> str:
+        """
+        Send a query to OpenRouter and return the response.
+        Args:
+            prompt: The prompt to send to the model (as messages list or string)
+            output_format: Pydantic model for structured output
+        Returns:
+            The model's response as a string
+        Raises:
+            AIEndpointRequestError: If the request fails
+        """
+        try:
+            url = "https://openrouter.ai/api/v1/chat/completions"
+            headers = {
+                "Authorization": f"Bearer {self.ai_config.get('api_key')}",
+                "Content-Type": "application/json"
+            }
+            messages = [{"role": "user", "content": prompt}]
+            schema = output_format.model_json_schema()
+            body = {
+                "model": self.model or DEFAULT_MODEL,
+                "max_tokens": self.max_tokens,
+                "temperature": self.temperature,
+            }
+            # Handle structured output based on model support
+            if self.supports_structured_output():
+                    body["messages"] = messages
+                    body["response_format"] = {
+                        "type": "json_schema",
+                        "json_schema": {
+                            "name": "response",
+                            "schema": schema,
+                            "strict": True
+                        }
+                    }
+            else:
+                # If model does not support structured format, just send raw prompt
+                body["messages"] = messages
+            r = requests.post(url, headers=headers, json=body)
+            if r.status_code >= 400:
+                raise AIEndpointRequestError(f"OpenRouter error {r.status_code}: {r.text}")
+            data = r.json()
+            if self.supports_structured_output():
+                return self.parseStringToJson(data["choices"][0]["message"]["content"])
+            return data["choices"][0]["message"]["content"]
+        except Exception as e:
+            raise AIEndpointRequestError(f"OpenRouter request failed: {e}")

potato/ai/prompt/image_annotation.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "detect": {
+    "prompt": "TASK: Detect objects in this image that match the specified labels.\n\nDescription: ${description}\nLabels to detect: ${labels}\nMinimum confidence: ${confidence_threshold}\n\nFor each detected object, provide:\n1. The label (must be from the provided labels list)\n2. Bounding box with normalized coordinates (0-1 range)\n3. Confidence score (0-1)\n\nCoordinate system:\n- x,y is the top-left corner of the box\n- x increases left to right (0 = left edge, 1 = right edge)\n- y increases top to bottom (0 = top edge, 1 = bottom edge)\n- width and height are also normalized (0-1)\n\nReturn JSON:\n{\n    \"detections\": [\n        {\n            \"label\": \"object_name\",\n            \"bbox\": {\"x\": 0.1, \"y\": 0.2, \"width\": 0.3, \"height\": 0.4},\n            \"confidence\": 0.95\n        }\n    ]\n}\n\nOnly include objects you can clearly identify with confidence >= ${confidence_threshold}.",
+    "output_format": "visual_detection",
+    "img": "/static/ai_assistant_img/detect.svg",
+    "name": "Detect"
+  },
+  "detection": {
+    "prompt": "TASK: Detect objects in this image that match the specified labels.\n\nDescription: ${description}\nLabels to detect: ${labels}\nMinimum confidence: ${confidence_threshold}\n\nFor each detected object, provide:\n1. The label (must be from the provided labels list)\n2. Bounding box with normalized coordinates (0-1 range)\n3. Confidence score (0-1)\n\nCoordinate system:\n- x,y is the top-left corner of the box\n- x increases left to right (0 = left edge, 1 = right edge)\n- y increases top to bottom (0 = top edge, 1 = bottom edge)\n- width and height are also normalized (0-1)\n\nReturn JSON:\n{\n    \"detections\": [\n        {\n            \"label\": \"object_name\",\n            \"bbox\": {\"x\": 0.1, \"y\": 0.2, \"width\": 0.3, \"height\": 0.4},\n            \"confidence\": 0.95\n        }\n    ]\n}\n\nOnly include objects you can clearly identify with confidence >= ${confidence_threshold}.",
+    "output_format": "visual_detection",
+    "img": "/static/ai_assistant_img/detect.svg",
+    "name": "Detect"
+  },
+  "pre_annotate": {
+    "prompt": "TASK: Pre-annotate this image by detecting all objects that match the available labels.\n\nDescription: ${description}\nAvailable labels: ${labels}\n\nDetect ALL instances of objects matching these labels. Be thorough - it's better to include uncertain detections (the human annotator will review them).\n\nFor each detection provide:\n1. Label from the available list\n2. Bounding box (normalized 0-1 coordinates)\n3. Confidence score\n\nReturn JSON:\n{\n    \"detections\": [\n        {\n            \"label\": \"label_name\",\n            \"bbox\": {\"x\": 0.0, \"y\": 0.0, \"width\": 0.0, \"height\": 0.0},\n            \"confidence\": 0.0\n        }\n    ]\n}\n\nInclude all potential detections, even with lower confidence. The annotator will verify.",
+    "output_format": "visual_detection",
+    "img": "/static/ai_assistant_img/auto.svg",
+    "name": "Auto"
+  },
+  "classification": {
+    "prompt": "TASK: Classify the specified region in this image.\n\nDescription: ${description}\nRegion: ${region}\nAvailable labels: ${labels}\n\nLook at the indicated region and determine which label best describes what you see there.\n\nReturn JSON:\n{\n    \"suggested_label\": \"label_name\",\n    \"confidence\": 0.85,\n    \"reasoning\": \"Brief explanation of why this label fits\"\n}",
+    "output_format": "visual_classification",
+    "img": "/static/ai_assistant_img/classify.svg",
+    "name": "Classify"
+  },
+  "hint": {
+    "prompt": "TASK: Provide a helpful hint for annotating this image WITHOUT revealing exact answers.\n\nAnnotation task: ${description}\nAvailable labels: ${labels}\n\nProvide guidance that helps the annotator without giving away:\n- Exact object locations\n- Specific label assignments\n\nGood hints:\n- Point out relevant visual features\n- Suggest areas that deserve attention\n- Note potential challenges or ambiguities\n- Remind about edge cases\n\nReturn JSON:\n{\n    \"hint\": \"Your helpful guidance here\",\n    \"suggestive_choice\": \"optional_focus_area\"\n}",
+    "output_format": "default_hint",
+    "img": "/static/ai_assistant_img/blub.svg",
+    "name": "Hint"
+  },
+  "keyword": {
+    "prompt": "TASK: Identify visual keywords/features associated with each label in this image.\n\nAnnotation task: ${description}\nAvailable labels: ${labels}\n\nFor each label, identify visual cues or features that would indicate its presence.\n\nReturn JSON:\n{\n    \"label_keywords\": [\n        {\n            \"label\": \"label_name\",\n            \"keywords\": [\"visual_feature_1\", \"visual_feature_2\"]\n        }\n    ]\n}",
+    "output_format": "default_keyword",
+    "img": "/static/ai_assistant_img/highlight.svg",
+    "name": "Keywords"
+  },
+  "rationale": {
+    "prompt": "TASK: Provide rationale for how each label might apply to this image.\n\nAnnotation task: ${description}\nAvailable labels: ${labels}\n\nFor each label, explain what evidence in the image supports or contradicts its application.\n\nReturn JSON:\n{\n    \"rationales\": [\n        {\n            \"label\": \"label_name\",\n            \"reasoning\": \"Explanation of visual evidence for/against this label\"\n        }\n    ]\n}",
+    "output_format": "default_rationale",
+    "img": "/static/ai_assistant_img/question.svg",
+    "name": "Rationale"
+  }
+}

potato/ai/prompt/likert.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "hint": {
+    "prompt": "TASK: Generate annotation guidance for Likert scale task.\n\nINPUT DETAILS:\n- Text to annotate: \"${text}\"\n- Annotation task: ${description}\n- Scale: ${min_label} (1) to ${max_label} (${size})\n- Scale points: 1, 2, 3, ..., ${size}\n\nINSTRUCTIONS:\n1. Analyze the text for features relevant to the annotation task\n2. Generate a helpful hint that guides thinking WITHOUT revealing the answer\n3. Suggest a scale position based on your analysis\n\nHINT REQUIREMENTS:\n- Focus on specific textual evidence (word choice, tone, structure)\n- Point out subtle indicators the annotator should notice\n- Be concrete and actionable, not generic\n- Guide analytical thinking without bias toward any scale position",
+    "output_format": "default_hint",
+    "img": "/static/ai_assistant_img/blub.svg"
+  },
+  "keyword": {
+    "prompt": "TASK: Extract key words/phrases that guide Likert scale annotation decisions.\n\nINPUT DETAILS:\n- Text: \"${text}\"\n- Annotation task: ${description}\n- Scale: ${min_label} (1) to ${max_label} (${size})\n- Scale range: 1, 2, 3, ..., ${size}\n\nOBJECTIVE: Identify 3-5 most significant words/phrases that directly indicate scale positioning. Output as JSON array format.\n\nSELECTION CRITERIA:\n- Words that signal intensity/degree (extremely, slightly, moderately)\n- Sentiment markers (positive/negative indicators)\n- Qualifying language (hedges, certainty markers)\n- Context-specific terminology relevant to the annotation task\n- Structural indicators (but, however, although)\n\nPRIORITIZE:\n1. Words with clear scale implications\n2. Phrases that distinguish between scale levels\n3. Contextual clues that affect interpretation\n4. Intensity modifiers and qualifiers",
+    "output_format": "default_keyword",
+    "img": "/static/ai_assistant_img/highlight.svg"
+  },
+  "rationale": {
+    "name": "Rationale",
+    "prompt": "TASK: Generate rationales for different positions on the Likert scale.\n\nINPUT DETAILS:\n- Text to annotate: \"${text}\"\n- Annotation task: ${description}\n- Scale: ${min_label} (1) to ${max_label} (${size})\n\nINSTRUCTIONS:\nProvide rationales for different scale positions (low, middle, high). Explain what evidence in the text could support each position. Be balanced and objective.\n\nOUTPUT FORMAT:\nReturn a JSON object with \"rationales\" array containing objects with \"label\" (scale position descriptor) and \"reasoning\" fields.\n\nEXAMPLE OUTPUT:\n{\"rationales\": [{\"label\": \"low (1-2)\", \"reasoning\": \"The negative tone and words like 'disappointing' suggest a low rating\"}, {\"label\": \"middle (3)\", \"reasoning\": \"Mixed signals with both positive and negative elements\"}, {\"label\": \"high (4-5)\", \"reasoning\": \"Strong positive language like 'excellent' supports a high rating\"}]}",
+    "output_format": "default_rationale",
+    "img": "/static/ai_assistant_img/question.svg"
+  }
+}

potato/ai/prompt/models_module.py ADDED Viewed

	@@ -0,0 +1,258 @@

+from typing import Optional, Type, Union, Dict, List
+from pydantic import BaseModel
+class GeneralHintFormat(BaseModel):
+    hint: str
+    suggestive_choice: Union[str, int]
+class LabelKeywords(BaseModel):
+    """Keywords/phrases associated with a specific label."""
+    label: str
+    keywords: List[str]
+class GeneralKeywordFormat(BaseModel):
+    """Simplified keyword format: list of label -> keywords mappings.
+    Example output:
+    {
+        "label_keywords": [
+            {"label": "positive", "keywords": ["great", "love it", "excellent"]},
+            {"label": "negative", "keywords": ["terrible", "awful"]}
+        ]
+    }
+    """
+    label_keywords: List[LabelKeywords]
+class GeneralRandomFormat(BaseModel):
+    """Deprecated: Use GeneralRationaleFormat instead."""
+    random: str
+class LabelRationale(BaseModel):
+    """Rationale/reasoning for why a specific label might apply."""
+    label: str
+    reasoning: str
+class GeneralRationaleFormat(BaseModel):
+    """Rationale format: explanations for how each label might apply to the text.
+    Example output:
+    {
+        "rationales": [
+            {"label": "positive", "reasoning": "The phrase 'excellent quality' suggests satisfaction"},
+            {"label": "negative", "reasoning": "The mention of 'delayed shipping' indicates frustration"}
+        ]
+    }
+    """
+    rationales: List[LabelRationale]
+# ============================================================================
+# Visual Annotation Output Formats
+# ============================================================================
+class BoundingBox(BaseModel):
+    """Normalized bounding box coordinates (0-1 range).
+    x, y: top-left corner position
+    width, height: box dimensions
+    All values are normalized to image dimensions (0-1).
+    """
+    x: float
+    y: float
+    width: float
+    height: float
+class Detection(BaseModel):
+    """Single object detection result.
+    Example:
+    {
+        "label": "person",
+        "bbox": {"x": 0.1, "y": 0.2, "width": 0.3, "height": 0.5},
+        "confidence": 0.95
+    }
+    """
+    label: str
+    bbox: BoundingBox
+    confidence: float
+class VisualDetectionFormat(BaseModel):
+    """Object detection results for an image.
+    Example output:
+    {
+        "detections": [
+            {"label": "car", "bbox": {"x": 0.1, "y": 0.2, "width": 0.3, "height": 0.2}, "confidence": 0.92},
+            {"label": "person", "bbox": {"x": 0.5, "y": 0.3, "width": 0.1, "height": 0.4}, "confidence": 0.87}
+        ]
+    }
+    """
+    detections: List[Detection]
+class VisualClassificationFormat(BaseModel):
+    """Classification result for an image or region.
+    Example output:
+    {
+        "suggested_label": "cat",
+        "confidence": 0.89,
+        "reasoning": "The image shows a feline with pointed ears and whiskers"
+    }
+    """
+    suggested_label: str
+    confidence: float
+    reasoning: Optional[str] = None
+class VideoSegment(BaseModel):
+    """Temporal segment in a video.
+    Times are in seconds.
+    """
+    start_time: float
+    end_time: float
+    suggested_label: str
+    confidence: float
+    description: Optional[str] = None
+class VideoSceneDetectionFormat(BaseModel):
+    """Scene/segment detection results for a video.
+    Example output:
+    {
+        "segments": [
+            {"start_time": 0.0, "end_time": 5.5, "suggested_label": "intro", "confidence": 0.9},
+            {"start_time": 5.5, "end_time": 15.0, "suggested_label": "action", "confidence": 0.85}
+        ]
+    }
+    """
+    segments: List[VideoSegment]
+class VideoKeyframe(BaseModel):
+    """Keyframe annotation for a video.
+    timestamp: Time in seconds
+    """
+    timestamp: float
+    suggested_label: str
+    confidence: float
+    reason: Optional[str] = None
+class VideoKeyframeDetectionFormat(BaseModel):
+    """Keyframe detection results for a video.
+    Example output:
+    {
+        "keyframes": [
+            {"timestamp": 2.5, "suggested_label": "scene_change", "confidence": 0.95, "reason": "Major visual transition"},
+            {"timestamp": 8.0, "suggested_label": "action_peak", "confidence": 0.82, "reason": "Key moment in action"}
+        ]
+    }
+    """
+    keyframes: List[VideoKeyframe]
+class TrackPosition(BaseModel):
+    """Object position in a single frame for tracking."""
+    frame_index: int
+    bbox: BoundingBox
+    confidence: float
+class ObjectTrack(BaseModel):
+    """Tracked object across multiple frames."""
+    track_id: int
+    label: str
+    positions: List[TrackPosition]
+class VideoTrackingSuggestionFormat(BaseModel):
+    """Object tracking suggestions for a video.
+    Example output:
+    {
+        "tracks": [
+            {
+                "track_id": 1,
+                "label": "person",
+                "positions": [
+                    {"frame_index": 0, "bbox": {"x": 0.1, "y": 0.2, "width": 0.15, "height": 0.3}, "confidence": 0.9},
+                    {"frame_index": 1, "bbox": {"x": 0.12, "y": 0.22, "width": 0.15, "height": 0.3}, "confidence": 0.88}
+                ]
+            }
+        ]
+    }
+    """
+    tracks: List[ObjectTrack]
+class FrameDetections(BaseModel):
+    """Detections for a single video frame."""
+    frame_index: int
+    detections: List[Detection]
+class MultiFrameDetectionFormat(BaseModel):
+    """Detection results across multiple video frames.
+    Used when running detection on sampled video frames.
+    """
+    frames: List[FrameDetections]
+# ============================================================================
+# Class Registry
+# ============================================================================
+# ============================================================================
+# Option Highlighting Output Format
+# ============================================================================
+class OptionHighlightFormat(BaseModel):
+    """LLM response for option highlighting.
+    Used to identify the most likely correct options for a discrete annotation task.
+    The highlighted options are shown at full opacity while others are dimmed.
+    Example output:
+    {
+        "highlighted_options": ["positive", "neutral"],
+        "confidence": 0.85
+    }
+    """
+    highlighted_options: List[str]  # Top-k most likely option names/values
+    confidence: Optional[float] = None  # Optional overall confidence score (0-1)
+CLASS_REGISTRY = {
+    # Text annotation formats
+    "default_hint": GeneralHintFormat,
+    "default_keyword": GeneralKeywordFormat,
+    "default_random": GeneralRandomFormat,  # Keep for backwards compatibility
+    "default_rationale": GeneralRationaleFormat,
+    # Option highlighting format
+    "option_highlight": OptionHighlightFormat,
+    # Visual annotation formats - Image
+    "visual_detection": VisualDetectionFormat,
+    "visual_classification": VisualClassificationFormat,
+    # Visual annotation formats - Video
+    "video_scene_detection": VideoSceneDetectionFormat,
+    "video_keyframe_detection": VideoKeyframeDetectionFormat,
+    "video_tracking_suggestion": VideoTrackingSuggestionFormat,
+    "multi_frame_detection": MultiFrameDetectionFormat,
+}

potato/ai/prompt/multiselect.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "hint": {
+    "prompt": "TASK: Generate guidance for multiple label selection.\n\nINPUT DETAILS:\n- Text to annotate: \"${text}\"\n- Annotation task: ${description}\n- Available labels: ${labels}\n\nINSTRUCTIONS:\n1. Analyze text for features that may correspond to multiple labels\n2. Guide toward identifying ALL applicable categories\n3. Focus on overlapping characteristics and comprehensive analysis\n\nHINT REQUIREMENTS:\n- Identify indicators for each potential label category\n- Point out that multiple selections may be appropriate\n- Guide systematic evaluation of all label options\n- Highlight overlapping or complementary features",
+    "output_format": "default_hint",
+    "img": "/static/ai_assistant_img/blub.svg"
+  },
+  "keyword": {
+    "prompt": "TASK: Extract words/phrases that indicate multiple applicable labels.\n\nINPUT DETAILS:\n- Text: \"${text}\"\n- Annotation task: ${description}\n- Available labels: ${labels}\n\nOBJECTIVE: Identify terms that support selection of multiple labels.\n\nSELECTION CRITERIA:\n- Words that indicate multiple categories simultaneously\n- Terms supporting different label aspects\n- Overlapping category indicators\n- Comprehensive feature markers\n- Multi-faceted descriptors",
+    "output_format": "default_keyword",
+    "img": "/static/ai_assistant_img/highlight.svg"
+  },
+  "rationale": {
+    "name": "Rationale",
+    "prompt": "TASK: Generate rationales explaining why each label might apply to this text.\n\nINPUT DETAILS:\n- Text to annotate: \"${text}\"\n- Annotation task: ${description}\n- Available labels: ${labels}\n\nCRITICAL REQUIREMENT:\nYou MUST provide a rationale for EVERY label listed above. Your output must have exactly one entry per label.\n\nINSTRUCTIONS:\nFor EACH available label (ALL of them), provide a brief rationale explaining what evidence in the text could support selecting that label. Since multiple labels can be selected, focus on independent evidence for each label. If a label doesn't apply, explain why.\n\nOUTPUT FORMAT:\nReturn a JSON object with \"rationales\" array containing one object per label, each with \"label\" and \"reasoning\" fields.\n\nEXAMPLE OUTPUT:\n{\"rationales\": [{\"label\": \"category1\", \"reasoning\": \"The text mentions X which relates to this category\"}, {\"label\": \"category2\", \"reasoning\": \"The phrase Y suggests this also applies\"}, {\"label\": \"category3\", \"reasoning\": \"No direct evidence for this label in the text\"}]}",
+    "output_format": "default_rationale",
+    "img": "/static/ai_assistant_img/question.svg"
+  }
+}

potato/ai/prompt/number.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "hint": {
+    "prompt": "TASK: Generate guidance for numerical value extraction/estimation.\n\nINPUT DETAILS:\n- Text to annotate: \"${text}\"\n- Annotation task: ${description}\n\nINSTRUCTIONS:\n1. Analyze the text for numerical clues or quantifiable elements\n2. Guide the annotator toward identifying the correct numerical value\n3. Focus on mathematical, statistical, or countable aspects\n\nHINT REQUIREMENTS:\n- Identify numerical indicators, quantities, or measurable elements\n- Point out calculation methods or counting strategies\n- Highlight context that affects numerical interpretation\n- Guide toward systematic analysis approach",
+    "output_format": "default_hint",
+    "img": "/static/ai_assistant_img/blub.svg"
+  },
+  "keyword": {
+    "prompt": "TASK: Extract words/phrases that contain or indicate numerical information.\n\nINPUT DETAILS:\n- Text: \"${text}\"\n- Annotation task: ${description}\n\nOBJECTIVE: Identify words/phrases that directly relate to the numerical answer.\n\nSELECTION CRITERIA:\n- Explicit numbers, quantities, or measurements\n- Words indicating amount, frequency, or degree\n- Mathematical or statistical terminology\n- Comparative language (more, less, double, half)\n- Time references, percentages, ratios",
+    "output_format": "default_keyword",
+    "img": "/static/ai_assistant_img/highlight.svg"
+  },
+  "rationale": {
+    "name": "Rationale",
+    "prompt": "TASK: Generate rationales for different approaches to determining the numerical answer.\n\nINPUT DETAILS:\n- Text to annotate: \"${text}\"\n- Annotation task: ${description}\n\nINSTRUCTIONS:\nProvide different rationales or approaches for determining the numerical value. Consider different interpretations or calculation methods if applicable.\n\nOUTPUT FORMAT:\nReturn a JSON object with \"rationales\" array containing objects with \"label\" (approach name) and \"reasoning\" fields.\n\nEXAMPLE OUTPUT:\n{\"rationales\": [{\"label\": \"literal count\", \"reasoning\": \"Counting explicit mentions gives X\"}, {\"label\": \"inclusive interpretation\", \"reasoning\": \"Including implicit references increases the count to Y\"}]}",
+    "output_format": "default_rationale",
+    "img": "/static/ai_assistant_img/question.svg"
+  }
+}

potato/ai/prompt/option_highlight.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "option_highlight": {
+    "prompt": "TASK: Identify the most likely correct options for an annotation task.\n\nINPUT DETAILS:\n- Content to annotate: \"${text}\"\n- Annotation task: ${description}\n- Available options: ${labels}\n- Number of options to highlight: ${top_k}\n\nINSTRUCTIONS:\n1. Analyze the content carefully in the context of the annotation task\n2. Consider which ${top_k} options are most likely to be correct based on:\n   - Direct evidence in the content\n   - Contextual clues and tone\n   - Domain knowledge relevant to the task\n3. Select exactly ${top_k} options (or fewer if fewer options exist)\n\nIMPORTANT:\n- Return ONLY the option names/values exactly as they appear in the available options list\n- Do not modify, paraphrase, or abbreviate the option names\n- If you're uncertain, choose options that seem most plausible given the content\n\nOUTPUT FORMAT:\nReturn a JSON object with:\n- \"highlighted_options\": array of ${top_k} option names from the available options\n- \"confidence\": your confidence score from 0.0 to 1.0\n\nEXAMPLE (if options are: positive, negative, neutral):\n{\"highlighted_options\": [\"positive\", \"neutral\"], \"confidence\": 0.75}",
+    "output_format": "option_highlight",
+    "img": "/static/ai_assistant_img/highlight.svg",
+    "name": "Option Highlight"
+  }
+}

potato/ai/prompt/radio.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "hint": {
+    "prompt": "TASK: Generate annotation guidance for single-choice selection.\n\nINPUT DETAILS:\n- Text to annotate: \"${text}\"\n- Annotation task: ${description}\n- Available labels: ${labels}\n\nINSTRUCTIONS:\n1. Analyze the text for features that distinguish between the available labels\n2. Generate a helpful hint that guides classification WITHOUT revealing the answer\n3. Focus on decision-making criteria between options\n\nHINT REQUIREMENTS:\n- Identify key textual indicators that differentiate between label options\n- Point out distinguishing features (style, content, context)\n- Guide analytical thinking toward the classification criteria\n- Be specific about what to examine, not generic",
+    "output_format": "default_hint",
+    "img": "/static/ai_assistant_img/blub.svg"
+  },
+  "keyword": {
+    "prompt": "TASK: Identify words or short phrases in the text that relate to each label.\n\nINPUT DETAILS:\n- Text: \"${text}\"\n- Annotation task: ${description}\n- Available labels: ${labels}\n\nINSTRUCTIONS:\nFor each label, find words or short phrases from the text that indicate or relate to that label. Only include words/phrases that actually appear in the text. Return 1-5 keywords per label. If no words relate to a label, return an empty list for that label.\n\nEXAMPLE OUTPUT FORMAT:\n{\"label_keywords\": [{\"label\": \"positive\", \"keywords\": [\"great\", \"love it\"]}, {\"label\": \"negative\", \"keywords\": [\"terrible\"]}]}",
+    "output_format": "default_keyword",
+    "img": "/static/ai_assistant_img/highlight.svg"
+  },
+  "rationale": {
+    "name": "Rationale",
+    "prompt": "TASK: Generate rationales explaining why each label might apply to this text.\n\nINPUT DETAILS:\n- Text to annotate: \"${text}\"\n- Annotation task: ${description}\n- Available labels: ${labels}\n\nCRITICAL REQUIREMENT:\nYou MUST provide a rationale for EVERY label listed above. Your output must have exactly one entry per label.\n\nINSTRUCTIONS:\nFor EACH available label (ALL of them), provide a brief rationale explaining what evidence in the text could support choosing that label. Be balanced and objective - present the case for each label fairly. If a label doesn't strongly apply, explain what would need to be present.\n\nOUTPUT FORMAT:\nReturn a JSON object with \"rationales\" array containing one object per label, each with \"label\" and \"reasoning\" fields.\n\nEXAMPLE (if labels are: positive, negative, neutral, mixed):\n{\"rationales\": [{\"label\": \"positive\", \"reasoning\": \"The phrase 'excellent service' suggests satisfaction\"}, {\"label\": \"negative\", \"reasoning\": \"No clear negative indicators present\"}, {\"label\": \"neutral\", \"reasoning\": \"The tone is more emotional than neutral\"}, {\"label\": \"mixed\", \"reasoning\": \"Would need both positive and negative elements\"}]}",
+    "output_format": "default_rationale",
+    "img": "/static/ai_assistant_img/question.svg"
+  }
+}