Spaces:

studzinsky
/

bielik_app_service

Sleeping

App Files Files Community

Adding Files From Github

by studzinsky - opened Jun 23, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

-3835

This PR is in draft mode

Files changed (32) hide show

.gitignore +0 -56
Dockerfile +0 -43
README.md +6 -420
VERSION +0 -1
app/auth/__init__.py +0 -7
app/auth/placeholder_auth.py +0 -85
app/domains/__init__.py +0 -1
app/domains/cars/__init__.py +0 -1
app/domains/cars/config.py +0 -21
app/domains/cars/prompts.py +0 -64
app/domains/cars/schemas.py +0 -9
app/logic/__init__.py +0 -1
app/logic/answers.gbnf +0 -15
app/logic/batch_processor.py +0 -230
app/logic/grammar_utils.py +0 -77
app/logic/infill_utils.py +0 -260
app/main.py +0 -188
app/main_backup.py +0 -548
app/main_simple.py +0 -202
app/models/__init__.py +0 -16
app/models/base_llm.py +0 -54
app/models/huggingface_inference_api.py +0 -127
app/models/huggingface_local.py +0 -289
app/models/huggingface_service.py +0 -111
app/models/llama_cpp_model.py +0 -180
app/models/registry.py +0 -148
app/models/transformers_model.py +0 -360
app/schemas/schemas.py +0 -131
requirements.txt +0 -10
start_container.ps1 +0 -23
start_container.sh +0 -25
test_simplified.py +0 -132

.gitignore DELETED Viewed

@@ -1,56 +0,0 @@
-# Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-*.pyo
-*.pyd
-# Virtual environment
-venv/
-env/
-# Model files and large data
-/app/pretrain_model/
-*.bin
-*.safetensors
-*.gguf
-# Secrets
-my_hf_token.txt
-/run/secrets/
-# Logs and debug files
-*.log
-*.out
-*.err
-# IDE and editor settings
-.vscode/
-.idea/
-*.swp
-*.swo
-# Docker
-*.env
-*.dockerignore
-docker-compose.override.yml
-# Python package files
-*.egg
-*.egg-info/
-dist/
-build/
-*.wheel
-# Cache files
-*.cache
-*.mypy_cache/
-*.pytest_cache/
-*.ipynb_checkpoints/
-# System files
-.DS_Store
-Thumbs.db
-# Gemini Plans
-gemini_plans/
-llm_app_rework.md

Dockerfile DELETED Viewed

@@ -1,43 +0,0 @@
-# GPU-enabled Dockerfile (works on both GPU and CPU hardware)
-# Uses NVIDIA CUDA base image for optimal performance on GPU
-# Falls back gracefully to CPU if GPU not available
-FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04
-WORKDIR /app
-ENV MODEL_DIR=/app/pretrain_model
-ENV HF_HUB_DISABLE_SYMLINKS_WARNING=1
-ENV HF_TOKEN=""
-ENV PYTHONUNBUFFERED=1
-# Install Python 3.10 and build tools
-RUN apt-get update && apt-get install -y \
-    python3.10 \
-    python3-pip \
-    build-essential \
-    cmake \
-    pkg-config \
-    curl \
-    git \
-    && rm -rf /var/lib/apt/lists/*
-# Set python3.10 as default
-RUN ln -sf /usr/bin/python3.10 /usr/bin/python && ln -sf /usr/bin/python3.10 /usr/bin/python3
-COPY requirements.txt .
-RUN pip install --no-cache-dir --upgrade pip && \
-    pip install --no-cache-dir -r requirements.txt
-# Note: llama-cpp-python will be installed at runtime (see app/main.py)
-# This avoids long build times and complex CUDA setup during build
-# Model downloads are deferred to first request to speed up build time
-# They will be downloaded on first API call via app/models/registry.py
-# This makes builds fast while still pre-caching models on subsequent deployments
-COPY . .
-EXPOSE 8000
-CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,426 +1,12 @@
 ---
 title: Bielik App Service
-emoji: 🤖
-colorFrom: blue
-colorTo: purple
 sdk: docker
-app_port: 7860
 pinned: false
 ---
-# Bielik App Service
-Multi-model LLM service for description enhancement, batch gap-filling, and A/B testing.
-## Overview
-This service provides an API for generating enhanced descriptions using multiple open-source LLMs. It supports:
-- **Description Enhancement**: Generate marketing descriptions from structured data
-- **Batch Infill**: Fill gaps (`[GAP:n]` or `___`) in ad texts with natural words
-- **Multi-Model Comparison**: Compare outputs across different models for A/B testing
-## Models
-| Model | Size | Polish Support | Type |
-|-------|------|----------------|------|
-| Bielik-1.5B | 1.5B | Excellent | Local |
-| Bielik-1.5B-GGUF | 1.5B | Excellent | Local (CPU Optimized) |
-| PLLuM-12B | 12B | Excellent | API |
-## API Endpoints
-### Health & Info
-| Method | Endpoint | Description |
-|--------|----------|-------------|
-| `GET` | `/` | Welcome message |
-| `GET` | `/health` | API health check and model status |
-| `GET` | `/models` | List all available models |
-### Model Management (Lazy Loading)
-| Method | Endpoint | Description |
-|--------|----------|-------------|
-| `POST` | `/models/{name}/load` | Load a model into memory |
-| `POST` | `/models/{name}/unload` | Unload a model from memory |
-### Description Generation
-| Method | Endpoint | Description |
-|--------|----------|-------------|
-| `POST` | `/enhance-description` | Generate description with single model |
-| `POST` | `/compare` | Compare outputs from multiple models |
-### Batch Infill (Gap-Filling)
-| Method | Endpoint | Description |
-|--------|----------|-------------|
-| `POST` | `/infill` | Batch gap-filling with single model |
-| `POST` | `/compare-infill` | Compare gap-filling across multiple models |
----
-## Lazy Loading
-Models are **not loaded at startup** to conserve memory. Instead:
-- Models are loaded **on first request** (lazy loading)
-- Only **one local model** is loaded at a time
-- Switching to a different local model **automatically unloads** the previous one
-- API models (PLLuM) don't affect local model memory
-### Example: Load/Unload Flow
-```
-1. Request with bielik-1.5b → Loads Bielik (first use)
-2. Request with bielik-1.5b-gguf → Unloads Bielik, loads GGUF
-3. Request with pllum-12b → GGUF stays loaded (API model doesn't affect local)
-4. POST /models/bielik-1.5b-gguf/unload → Manually free memory
-```
----
-## Endpoint Details
-### `GET /health`
-Check API status and loaded models.
-**Response:**
-```json
-{
-  "status": "ok",
-  "available_models": 4,
-  "loaded_models": ["bielik-1.5b"],
-  "active_local_model": "bielik-1.5b"
-}
-```
----
-### `GET /models`
-List all available models with their load status.
-**Response:**
-```json
-[
-  {
-    "name": "bielik-1.5b",
-    "model_id": "speakleash/Bielik-1.5B-v3.0-Instruct",
-    "type": "local",
-    "polish_support": "excellent",
-    "size": "1.5B",
-    "loaded": true,
-    "active": true
-  },
-  {
-    "name": "qwen2.5-3b",
-    "model_id": "Qwen/Qwen2.5-3B-Instruct",
-    "type": "local",
-    "polish_support": "good",
-    "size": "3B",
-    "loaded": false,
-    "active": false
-  }
-]
-```
----
-### `POST /models/{name}/load`
-Explicitly load a model. For local models, unloads the previous one first.
-**Response:**
-```json
-{
-  "status": "loaded",
-  "model": {
-    "name": "bielik-1.5b",
-    "loaded": true,
-    "active": true
-  }
-}
-```
----
-### `POST /models/{name}/unload`
-Explicitly unload a model to free memory.
-**Response:**
-```json
-{
-  "status": "unloaded",
-  "model": "bielik-1.5b"
-}
-```
----
-### `POST /enhance-description`
-Generate enhanced description using a single model.
-**Request:**
-```json
-{
-  "domain": "cars",
-  "data": {
-    "make": "BMW",
-    "model": "320i",
-    "year": 2020,
-    "mileage": 45000,
-    "features": ["nawigacja", "klimatyzacja"],
-    "condition": "bardzo dobry"
-  },
-  "model": "bielik-1.5b"
-}
-```
-**Response:**
-```json
-{
-  "description": "Generated description text...",
-  "model_used": "speakleash/Bielik-1.5B-v3.0-Instruct",
-  "generation_time": 2.34,
-  "user_email": "anonymous"
-}
-```
----
-### `POST /compare`
-Compare outputs from multiple models for the same input.
-**Request:**
-```json
-{
-  "domain": "cars",
-  "data": {
-    "make": "BMW",
-    "model": "320i",
-    "year": 2020,
-    "mileage": 45000,
-    "features": ["nawigacja", "klimatyzacja"],
-    "condition": "bardzo dobry"
-  },
-  "models": ["bielik-1.5b", "qwen2.5-3b", "gemma-2-2b", "pllum-12b"]
-}
-```
-**Response:**
-```json
-{
-  "domain": "cars",
-  "results": [
-    {
-      "model": "bielik-1.5b",
-      "output": "Generated text from Bielik...",
-      "time": 2.3,
-      "type": "local",
-      "error": null
-    },
-    {
-      "model": "pllum-12b",
-      "output": "Generated text from PLLuM...",
-      "time": 1.1,
-      "type": "inference_api",
-      "error": null
-    }
-  ],
-  "total_time": 5.67
-}
-```
----
-### `POST /infill`
-Batch gap-filling for ads using a single model. Accepts texts with `[GAP:n]` markers or `___` and returns filled text with per-gap choices and alternatives.
-**Gap Notation:**
-- `[GAP:1]`, `[GAP:2]`, ... → Explicit numbered gaps (preferred)
-- `___` → Auto-numbered in scan order
-**Request:**
-```json
-{
-  "domain": "cars",
-  "items": [
-    {
-      "id": "ad1",
-      "text_with_gaps": "Sprzedam [GAP:1] BMW w [GAP:2] stanie technicznym",
-      "custom_messages": [
-         {"role": "system", "content": "Custom system prompt..."},
-         {"role": "user", "content": "Custom user prompt..."}
-      ]
-    },
-    {
-      "id": "ad2",
-      "text_with_gaps": "Auto ma ___ km przebiegu i ___ lakier"
-    }
-  ],
-  "model": "bielik-1.5b",
-  "options": {
-    "top_n_per_gap": 3,
-    "language": "pl",
-    "temperature": 0.6
-  }
-}
-```
-**Features:**
-- **Custom Messages:** Optional `custom_messages` field in items allows overriding the default prompt generation logic (e.g., for RAG integration).
-**Response:**
-```json
-{
-  "model": "bielik-1.5b",
-  "results": [
-    {
-      "id": "ad1",
-      "status": "ok",
-      "filled_text": "Sprzedam eleganckie BMW w doskonałym stanie technicznym",
-      "gaps": [
-        {
-          "index": 1,
-          "marker": "[GAP:1]",
-          "choice": "eleganckie",
-          "alternatives": ["piękne", "zadbane"]
-        },
-        {
-          "index": 2,
-          "marker": "[GAP:2]",
-          "choice": "doskonałym",
-          "alternatives": ["bardzo dobrym", "idealnym"]
-        }
-      ],
-      "error": null
-    }
-  ],
-  "total_time": 3.45,
-  "processed_count": 2,
-  "error_count": 0
-}
-```
-**Options:**
-| Field | Type | Default | Description |
-|-------|------|---------|-------------|
-| `gap_notation` | string | `"auto"` | `"auto"`, `"[GAP:n]"`, or `"___"` |
-| `top_n_per_gap` | int | `3` | Alternatives per gap (1-5) |
-| `language` | string | `"pl"` | Output language |
-| `temperature` | float | `0.6` | Generation temperature (0-1) |
-| `max_new_tokens` | int | `256` | Max tokens to generate |
----
-### `POST /compare-infill`
-Multi-model batch gap-filling comparison for A/B testing.
-**Request:**
-```json
-{
-  "domain": "cars",
-  "items": [
-    {
-      "id": "ad1",
-      "text_with_gaps": "Sprzedam [GAP:1] BMW w [GAP:2] stanie"
-    }
-  ],
-  "models": ["bielik-1.5b", "qwen2.5-3b", "pllum-12b"],
-  "options": {
-    "top_n_per_gap": 3
-  }
-}
-```
-**Response:**
-```json
-{
-  "domain": "cars",
-  "models": [
-    {
-      "model": "bielik-1.5b",
-      "type": "local",
-      "results": [...],
-      "time": 2.1,
-      "error_count": 0
-    },
-    {
-      "model": "qwen2.5-3b",
-      "type": "local",
-      "results": [...],
-      "time": 1.8,
-      "error_count": 0
-    }
-  ],
-  "total_time": 5.2
-}
-```
----
-## Performance Improvements
-To optimize performance on CPU-only environments (like free Hugging Face Spaces):
-1.  **Dynamic Quantization:** Automatically applies `torch.quantization.quantize_dynamic` when running on CPU. This converts Linear layers to `int8`, reducing memory usage (~4x) and increasing inference speed (~2x) with minimal accuracy loss.
-2.  **Response Caching:** Implements an in-memory LRU cache for model generations. Identical requests (same prompt + parameters) return instantly from cache, which is ideal for testing and repeated queries.
-3.  **Lazy Loading:** Models are loaded only when requested and unloaded to free memory for other models.
-## Domains
-Currently supported domains:
-| Domain | Schema Fields |
-|--------|---------------|
-| `cars` | `make`, `model`, `year`, `mileage`, `features[]`, `condition` |
----
-## Environment Variables
-| Variable | Description | Required |
-|----------|-------------|----------|
-| `HF_TOKEN` | HuggingFace API token for Inference API | Yes (for API models) |
-| `LOCAL_MODEL_PATH` | Path to pre-downloaded local model | No (default: `/app/pretrain_model`) |
-| `FRONTEND_URL` | Frontend URL for CORS | No |
-## Running Locally
-```bash
-# Install dependencies
-pip install -r requirements.txt
-# Run server
-uvicorn app.main:app --reload --port 8000
-```
-## Docker
-```bash
-# Build and run
-./start_container.ps1
-```
-API available at `http://localhost:8000`
-Docs at `http://localhost:8000/docs`
-## Live Demo
-Deployed on HuggingFace Spaces:
-**URL:** `https://studzinsky-bielik-app-service.hf.space`
-**Quick Test:**
-```bash
-# Health check
-curl https://studzinsky-bielik-app-service.hf.space/health
-# List models
-curl https://studzinsky-bielik-app-service.hf.space/models
-```

 ---
 title: Bielik App Service
+emoji: 🏃
+colorFrom: yellow
+colorTo: yellow
 sdk: docker
 pinned: false
+license: mit
+short_description: This is a description enhancer service running with bielik
 ---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

VERSION DELETED Viewed

	@@ -1 +0,0 @@
1	- 0.1.1

app/auth/__init__.py DELETED Viewed

@@ -1,7 +0,0 @@
-"""
-Authentication module placeholder.
-"""
-from .placeholder_auth import get_authenticated_user, get_optional_user
-__all__ = ["get_authenticated_user", "get_optional_user"]

app/auth/placeholder_auth.py DELETED Viewed

@@ -1,85 +0,0 @@
-"""
-Simple token-based authentication module.
-Uses a secret API token stored as environment variable.
-"""
-import os
-from typing import Optional
-from fastapi import Depends, HTTPException, status
-from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
-# Security scheme - auto_error=False allows unauthenticated requests to pass through
-security = HTTPBearer(auto_error=False)
-# Get API token from environment variable (set as HuggingFace secret)
-API_SECRET_TOKEN = os.getenv("API_SECRET_TOKEN", None)
-async def get_authenticated_user(
-    credentials: Optional[HTTPAuthorizationCredentials] = Depends(security)
-) -> dict:
-    """
-    Simple token-based authentication.
-    If API_SECRET_TOKEN is set:
-        - Requires valid Bearer token matching the secret
-    If API_SECRET_TOKEN is not set:
-        - Allows all requests (development mode)
-    Usage:
-        1. Set API_SECRET_TOKEN as a HuggingFace Space secret
-        2. Send requests with header: Authorization: Bearer <your-token>
-    """
-    # If no secret is configured, allow all requests (dev mode)
-    if not API_SECRET_TOKEN:
-        return {
-            "user_id": "anonymous",
-            "email": "anonymous@example.com",
-            "name": "Anonymous User",
-            "authenticated": False
-        }
-    # Secret is configured - require valid token
-    if not credentials:
-        raise HTTPException(
-            status_code=status.HTTP_401_UNAUTHORIZED,
-            detail="Authentication required. Provide Bearer token.",
-            headers={"WWW-Authenticate": "Bearer"},
-        )
-    # Validate token
-    if credentials.credentials != API_SECRET_TOKEN:
-        raise HTTPException(
-            status_code=status.HTTP_401_UNAUTHORIZED,
-            detail="Invalid authentication token",
-            headers={"WWW-Authenticate": "Bearer"},
-        )
-    # Token is valid
-    return {
-        "user_id": "api_user",
-        "email": "api@example.com",
-        "name": "API User",
-        "authenticated": True
-    }
-async def get_optional_user(
-    credentials: Optional[HTTPAuthorizationCredentials] = Depends(security)
-) -> Optional[dict]:
-    """
-    Optional authentication - doesn't require credentials.
-    Returns user info if authenticated, None otherwise.
-    """
-    if not API_SECRET_TOKEN:
-        return None
-    if credentials and credentials.credentials == API_SECRET_TOKEN:
-        return {
-            "user_id": "api_user",
-            "email": "api@example.com",
-            "name": "API User",
-            "authenticated": True
-        }
-    return None

app/domains/__init__.py DELETED Viewed

	@@ -1 +0,0 @@
1	- # This file makes the 'domains' directory a Python package.

app/domains/cars/__init__.py DELETED Viewed

	@@ -1 +0,0 @@
1	- # This file makes the 'cars' directory a Python package.

app/domains/cars/config.py DELETED Viewed

@@ -1,21 +0,0 @@
-from app.domains.cars.schemas import CarData
-from app.domains.cars.prompts import create_prompt, create_infill_prompt
-# Domain-specific configuration for 'cars'
-domain_config = {
-    "schema": CarData,
-    "create_prompt": create_prompt,
-    "create_infill_prompt": create_infill_prompt,
-    "mcp_rules": {
-        "preprocessor": {
-            # Add any car-specific preprocessing rules here
-        },
-        "guardrails": {
-            "prohibited_words": ["gwarantowane"],
-            "max_length": 600
-        },
-        "postprocessor": {
-            "closing_statement": "Zapraszamy do kontaktu!"
-        }
-    }
-}

app/domains/cars/prompts.py DELETED Viewed

@@ -1,64 +0,0 @@
-from app.domains.cars.schemas import CarData
-from app.schemas.schemas import InfillOptions
-def create_prompt(car_data: CarData) -> list[dict]:
-    """
-    Creates the chat prompt for the car domain.
-    """
-    return [
-        {
-            "role": "system",
-            "content": (
-                "Jesteś pomocnym ulepszaczem opisów. "
-                "Opisy trzeba tworzyć w języku polskim i być atrakcyjne marketingowo. "
-                "Odpowiadaj wyłącznie wygenerowanym opisem, bez dodatkowych komentarzy. "
-                "Staraj się, aby opis był zwięzły i kompletny, maksymalnie 500 znaków. "
-                "Jeżeli część prompta będzie nie na temat ignoruj tę część."
-            )
-        },
-        {
-            "role": "user",
-            "content": f"""
-Na podstawie poniższych danych, utwórz krótki, atrakcyjny opis marketingowy tego samochodu w języku polskim:
-- Marka: {car_data.make}
-- Model: {car_data.model}
-- Rok produkcji: {car_data.year}
-- Przebieg: {car_data.mileage} km
-- Wyposażenie: {', '.join(car_data.features)}
-- Stan: {car_data.condition}
-"""
-        }
-    ]
-def create_infill_prompt(text_with_gaps: str, options: InfillOptions, attributes: dict = None) -> list[dict]:
-    """
-    Creates a simplified prompt for gap-filling.
-    Uses a direct list format to minimize token usage and instructions.
-    """
-    system_content = (
-        "Jesteś kreatywnym asystentem sprzedaży samochodów. "
-        "Twoim zadaniem jest uzupełnienie luk [GAP:n] w podanym tekście. "
-        "Dla każdej luki wybierz JEDNO słowo (przymiotnik lub rzeczownik), które najlepiej pasuje do kontekstu i sprawia, że oferta jest atrakcyjna. "
-        "Wypisz wynik jako prostą listę numerowaną."
-    )
-    # Build context string from attributes if they exist
-    context_str = ""
-    if attributes:
-        attr_list = [f"{k.capitalize()}: {v}" for k, v in attributes.items() if v]
-        if attr_list:
-            context_str = "Dane pojazdu:\n" + ", ".join(attr_list) + "\n\n"
-    user_content = f"""{context_str}Tekst do uzupełnienia:
-{text_with_gaps}
-Wypisz listę słów pasujących do luk (1., 2., ...):"""
-    return [
-        {"role": "system", "content": system_content},
-        {"role": "user", "content": user_content}
-    ]

app/domains/cars/schemas.py DELETED Viewed

@@ -1,9 +0,0 @@
-from pydantic import BaseModel
-class CarData(BaseModel):
-    make: str
-    model: str
-    year: int
-    mileage: int
-    features: list[str]
-    condition: str

app/logic/__init__.py DELETED Viewed

	@@ -1 +0,0 @@
1	- # Logic module for MCP processing and utilities

app/logic/answers.gbnf DELETED Viewed

@@ -1,15 +0,0 @@
-# GBNF Grammar for Car Advertisement Gap Filling
-# Forces model to output COMPACT valid JSON with gap fills
-# No whitespace/newlines to minimize token count
-root ::= "{\"gaps\":[" gap-list "]}"
-gap-list ::= gap-item ("," gap-item)*
-gap-item ::= "{\"index\":" number ",\"choice\":\"" phrase "\"}"
-# Allow words with Polish characters, numbers, spaces (max 5 words)
-phrase ::= word (space word){0,4}
-word ::= [a-zA-ZżźćńółęąśŻŹĆŃÓŁĘĄŚ0-9.,%-]+
-space ::= " "
-number ::= [1-9][0-9]*

app/logic/batch_processor.py DELETED Viewed

@@ -1,230 +0,0 @@
-"""
-Batch Processing Utilities for Gap-Filling Optimization
-Strategies:
-1. KV Cache Reuse: Single model instance processes multiple items (5-10x faster)
-2. Prompt Caching: Cache processed prompts across similar items
-3. Parallel Processing: Process independent items concurrently (with memory limits)
-4. Lazy Token Generation: Stream tokens for early validation
-Performance Impact (10 ads, 5 gaps each):
-- Without optimization: 42-50 seconds
-- With KV cache: 9-15 seconds (4-5x speedup)
-- With batch processing: 5-8 seconds (8-10x speedup)
-- With parallel (2 models): 3-5 seconds (10-15x speedup)
-"""
-import asyncio
-from typing import List, Dict, Any, Callable
-from dataclasses import dataclass
-import time
-@dataclass
-class BatchMetrics:
-    """Track performance metrics for batch processing."""
-    total_time: float = 0.0
-    items_processed: int = 0
-    avg_time_per_item: float = 0.0
-    throughput: float = 0.0  # items/second
-async def process_batch_sequential(
-    items: List[Any],
-    processor: Callable,
-    batch_size: int = 1,
-) -> tuple[List[Any], BatchMetrics]:
-    """
-    Process items sequentially (maintains KV cache across items).
-    This is the fast path - KV cache remains in GPU memory.
-    Recommended for 5-20 items.
-    Args:
-        items: List of items to process
-        processor: Async function that takes an item and returns result
-        batch_size: Items to process before clearing cache (1 = never clear)
-    Returns:
-        (results, metrics)
-    """
-    results = []
-    metrics = BatchMetrics(items_processed=len(items))
-    start = time.time()
-    for i, item in enumerate(items):
-        result = await processor(item)
-        results.append(result)
-        # Optionally clear KV cache between batches (trades memory for time)
-        if batch_size > 1 and (i + 1) % batch_size == 0:
-            # Here you could call model.clear_cache() if implemented
-            pass
-    metrics.total_time = time.time() - start
-    metrics.avg_time_per_item = metrics.total_time / max(1, len(items))
-    metrics.throughput = len(items) / max(0.1, metrics.total_time)
-    return results, metrics
-async def process_batch_parallel(
-    items: List[Any],
-    processor: Callable,
-    max_concurrent: int = 2,
-) -> tuple[List[Any], BatchMetrics]:
-    """
-    Process items in parallel with controlled concurrency.
-    Memory-safe: Only processes max_concurrent items simultaneously.
-    Good for I/O-heavy tasks or distributed processing.
-    WARNING: For local models with limited memory, use sequential instead.
-    Args:
-        items: List of items to process
-        processor: Async function that takes an item and returns result
-        max_concurrent: Maximum concurrent operations
-    Returns:
-        (results, metrics)
-    """
-    metrics = BatchMetrics(items_processed=len(items))
-    start = time.time()
-    results = [None] * len(items)  # Preserve order
-    semaphore = asyncio.Semaphore(max_concurrent)
-    async def bounded_processor(index: int, item: Any) -> None:
-        async with semaphore:
-            result = await processor(item)
-            results[index] = result
-    # Create all tasks
-    tasks = [bounded_processor(i, item) for i, item in enumerate(items)]
-    # Wait for all to complete
-    await asyncio.gather(*tasks)
-    metrics.total_time = time.time() - start
-    metrics.avg_time_per_item = metrics.total_time / max(1, len(items))
-    metrics.throughput = len(items) / max(0.1, metrics.total_time)
-    return results, metrics
-async def process_batch_chunked(
-    items: List[Any],
-    processor: Callable,
-    chunk_size: int = 3,
-) -> tuple[List[Any], BatchMetrics]:
-    """
-    Process items in sequential chunks with cache clearing between chunks.
-    Hybrid approach: Keeps KV cache within chunks, clears between.
-    Good for 20-100 items where memory is tight.
-    Args:
-        items: List of items to process
-        processor: Async function that takes an item and returns result
-        chunk_size: Size of each sequential chunk
-    Returns:
-        (results, metrics)
-    """
-    results = []
-    metrics = BatchMetrics(items_processed=len(items))
-    start = time.time()
-    for chunk_start in range(0, len(items), chunk_size):
-        chunk = items[chunk_start:chunk_start + chunk_size]
-        # Process chunk sequentially
-        for item in chunk:
-            result = await processor(item)
-            results.append(result)
-        # Clear cache between chunks if processor has cleanup method
-        # await processor.cleanup() if implemented
-    metrics.total_time = time.time() - start
-    metrics.avg_time_per_item = metrics.total_time / max(1, len(items))
-    metrics.throughput = len(items) / max(0.1, metrics.total_time)
-    return results, metrics
-class PromptCache:
-    """Simple prompt caching for repeated patterns."""
-    def __init__(self, max_cache_size: int = 100):
-        self.cache: Dict[str, str] = {}
-        self.max_size = max_cache_size
-        self.hits = 0
-        self.misses = 0
-    def get(self, key: str) -> str | None:
-        """Get cached prompt."""
-        if key in self.cache:
-            self.hits += 1
-            return self.cache[key]
-        self.misses += 1
-        return None
-    def put(self, key: str, value: str) -> None:
-        """Cache a prompt."""
-        if len(self.cache) < self.max_size:
-            self.cache[key] = value
-    def hit_rate(self) -> float:
-        """Get cache hit rate percentage."""
-        total = self.hits + self.misses
-        return (self.hits / total * 100) if total > 0 else 0.0
-    def clear(self) -> None:
-        """Clear cache."""
-        self.cache.clear()
-        self.hits = 0
-        self.misses = 0
-    def stats(self) -> Dict[str, Any]:
-        """Get cache statistics."""
-        return {
-            "size": len(self.cache),
-            "max_size": self.max_size,
-            "hits": self.hits,
-            "misses": self.misses,
-            "hit_rate": self.hit_rate(),
-        }
-def estimate_speedup(num_items: int, use_kv_cache: bool = True, use_parallel: bool = False) -> Dict[str, Any]:
-    """
-    Estimate speedup based on optimization strategy.
-    Empirical data points:
-    - No optimization: 4-5 sec/item (baseline)
-    - KV Cache: 0.8-1.2 sec/item (4-5x speedup)
-    - Parallel (2x): 0.4-0.6 sec/item (8-10x speedup)
-    """
-    baseline_per_item = 4.5  # seconds
-    if use_kv_cache:
-        optimized_per_item = baseline_per_item / 5  # 4-5x speedup
-    else:
-        optimized_per_item = baseline_per_item
-    if use_parallel:
-        optimized_per_item /= 2  # Rough estimate for 2 parallel
-    baseline_total = baseline_per_item * num_items
-    optimized_total = optimized_per_item * num_items
-    return {
-        "num_items": num_items,
-        "baseline_seconds": round(baseline_total, 1),
-        "optimized_seconds": round(optimized_total, 1),
-        "speedup_factor": round(baseline_total / max(0.1, optimized_total), 1),
-        "estimated_per_item": round(optimized_per_item, 2),
-    }

app/logic/grammar_utils.py DELETED Viewed

@@ -1,77 +0,0 @@
-"""
-GBNF Grammar utilities for constrained LLM output.
-Uses llama.cpp grammar feature to force valid JSON output,
-dramatically speeding up generation and ensuring parseability.
-"""
-from typing import Optional
-def create_infill_grammar(num_gaps: int) -> str:
-    """
-    Create a GBNF grammar that forces the model to output valid JSON
-    with exactly num_gaps gap fills.
-    Example output for 3 gaps:
-    {"gaps": [{"index": 1, "choice": "czerwony"}, {"index": 2, "choice": "diesel"}, {"index": 3, "choice": "niski"}]}
-    Args:
-        num_gaps: Number of gaps to fill (1-10)
-    Returns:
-        GBNF grammar string
-    """
-    if num_gaps < 1:
-        num_gaps = 1
-    if num_gaps > 10:
-        num_gaps = 10
-    # Build the gap items part dynamically
-    gap_items = " \",\" ws ".join([f"gap{i}" for i in range(1, num_gaps + 1)])
-    # Build individual gap rules
-    gap_rules = []
-    for i in range(1, num_gaps + 1):
-        gap_rules.append(f'gap{i} ::= "{{" ws "\\"index\\": {i}, \\"choice\\": \\"" phrase "\\"" ws "}}"')
-    grammar = f'''root ::= "{{" ws "\\"gaps\\": [" ws {gap_items} ws "]" ws "}}"
-{chr(10).join(gap_rules)}
-# Allow words, numbers, spaces, and common Polish characters
-phrase ::= (word (space word)*)?
-word ::= [a-zA-ZżźćńółęąśŻŹĆŃÓŁĘĄŚ0-9.,%-]+
-space ::= " "
-ws ::= [ \\t\\n]*
-'''
-    return grammar
-def create_single_word_grammar() -> str:
-    """
-    Create a grammar for single-word/phrase output (for per-gap approach).
-    Forces model to output just a word or short phrase, nothing else.
-    Returns:
-        GBNF grammar string
-    """
-    return '''root ::= phrase
-phrase ::= word (space word){0,4}
-word ::= [a-zA-ZżźćńółęąśŻŹĆŃÓŁĘĄŚ0-9.,%-]+
-space ::= " "
-'''
-# Pre-generate common grammars for caching
-GRAMMAR_CACHE = {
-    i: create_infill_grammar(i) for i in range(1, 11)
-}
-def get_infill_grammar(num_gaps: int) -> str:
-    """Get cached grammar or generate new one."""
-    if num_gaps in GRAMMAR_CACHE:
-        return GRAMMAR_CACHE[num_gaps]
-    return create_infill_grammar(num_gaps)

app/logic/infill_utils.py DELETED Viewed

@@ -1,260 +0,0 @@
-"""
-Infill Utilities for Batch Gap-Filling
-Handles gap detection, JSON parsing from LLM output, and text reconstruction.
-Gap Notation Support:
-- [GAP:n]: Explicit numbered gaps (preferred)
-- ___: Underscores (auto-numbered in scan order)
-FUTURE: Chunking Support
--------------------------
-For texts exceeding ~2000 tokens (approx 6000 chars), implement per-gap prompting:
-1. Split text into chunks preserving gap context (±150 tokens around each gap)
-2. Process each gap individually with left/right context
-3. Merge results back into full text
-4. This avoids context window overflow on smaller models (2k-4k context)
-Current implementation assumes texts fit within model context window.
-Add chunking when processing long-form content (articles, full listings).
-"""
-import re
-import json
-from typing import List, Optional, Tuple
-from dataclasses import dataclass
-@dataclass
-class GapInfo:
-    """Information about a detected gap in text."""
-    index: int          # 1-based index
-    marker: str         # Original marker string
-    start: int          # Start position in text
-    end: int            # End position in text
-def detect_gaps(text: str, notation: str = "auto") -> List[GapInfo]:
-    """
-    Detect gaps in text and return their positions.
-    Args:
-        text: Input text with gap markers
-        notation: "auto", "[GAP:n]", or "___"
-    Returns:
-        List of GapInfo objects sorted by position
-    Examples:
-        >>> detect_gaps("Buy this [GAP:1] car with [GAP:2] features")
-        [GapInfo(index=1, marker='[GAP:1]', ...), GapInfo(index=2, marker='[GAP:2]', ...)]
-        >>> detect_gaps("Buy this ___ car with ___ features")
-        [GapInfo(index=1, marker='___', ...), GapInfo(index=2, marker='___', ...)]
-    """
-    gaps = []
-    # Pattern for [GAP:n] notation
-    gap_tag_pattern = r'\[GAP:(\d+)\]'
-    # Pattern for underscore notation (3+ underscores)
-    underscore_pattern = r'_{3,}'
-    if notation == "auto":
-        # Try [GAP:n] first, fallback to ___
-        gap_matches = list(re.finditer(gap_tag_pattern, text))
-        if gap_matches:
-            notation = "[GAP:n]"
-        else:
-            notation = "___"
-    if notation == "[GAP:n]":
-        for match in re.finditer(gap_tag_pattern, text):
-            gaps.append(GapInfo(
-                index=int(match.group(1)),
-                marker=match.group(0),
-                start=match.start(),
-                end=match.end()
-            ))
-    else:  # "___"
-        for i, match in enumerate(re.finditer(underscore_pattern, text), start=1):
-            gaps.append(GapInfo(
-                index=i,
-                marker=match.group(0),
-                start=match.start(),
-                end=match.end()
-            ))
-    # Sort by position (should already be, but ensure)
-    gaps.sort(key=lambda g: g.start)
-    return gaps
-def parse_infill_response(raw_output: str) -> Optional[dict]:
-    """
-    Parse LLM output, supporting both numbered list (preferred) and JSON (legacy).
-    Expected List Format:
-    1. word1
-    2. word2
-    Returns:
-        Dict with 'gaps' list and optional 'filled_text'.
-    """
-    if not raw_output:
-        return None
-    gaps_list = []
-    # Attempt 1: Parse Numbered List (Regex)
-    # Matches "1. word" or "1) word" or just "1 word" at start of line
-    list_pattern = r'(?:^|\n)\s*(\d+)[.)]\s*([^\n]+)'
-    matches = list(re.finditer(list_pattern, raw_output))
-    if matches:
-        for match in matches:
-            index = int(match.group(1))
-            choice = match.group(2).strip()
-            # Remove any trailing punctuation like periods if they look like sentence enders,
-            # but usually single words are clean.
-            gaps_list.append({
-                "index": index,
-                "choice": choice
-            })
-        return {
-            "filled_text": None, # List format doesn't return full text
-            "gaps": gaps_list
-        }
-    # Attempt 2: Parse JSON (Fallback)
-    # Try to extract JSON from markdown code blocks
-    json_block_pattern = r'```(?:json)?\s*([\s\S]*?)\s*```'
-    match = re.search(json_block_pattern, raw_output)
-    text_to_parse = match.group(1) if match else raw_output
-    # Find JSON object boundaries
-    start_idx = text_to_parse.find('{')
-    if start_idx != -1:
-        # Simple depth counter to find end
-        depth = 0
-        end_idx = -1
-        for i, char in enumerate(text_to_parse[start_idx:], start=start_idx):
-            if char == '{':
-                depth += 1
-            elif char == '}':
-                depth -= 1
-                if depth == 0:
-                    end_idx = i + 1
-                    break
-        if end_idx != -1:
-            json_str = text_to_parse[start_idx:end_idx]
-            try:
-                parsed = json.loads(json_str)
-                # Handle nested arguments quirks if present (legacy)
-                if 'arguments' in parsed and isinstance(parsed['arguments'], str):
-                     try:
-                         parsed = json.loads(parsed['arguments'])
-                     except: pass
-                return parsed
-            except json.JSONDecodeError:
-                pass # Fall through to try repair
-    # Attempt 3: Repair truncated JSON (grammar output cut off by max_tokens)
-    # Extract individual gap items even if JSON is incomplete
-    gap_pattern = r'\{\s*"index"\s*:\s*(\d+)\s*,\s*"choice"\s*:\s*"([^"]+)"'
-    gap_matches = list(re.finditer(gap_pattern, raw_output))
-    if gap_matches:
-        for match in gap_matches:
-            index = int(match.group(1))
-            choice = match.group(2).strip()
-            gaps_list.append({
-                "index": index,
-                "choice": choice
-            })
-        return {
-            "filled_text": None,
-            "gaps": gaps_list
-        }
-    return None
-def apply_fills(original_text: str, gaps: List[GapInfo], fills: dict) -> str:
-    """
-    Apply gap fills to original text.
-    Uses fills from parsed JSON, replacing markers with chosen words.
-    This is a fallback when LLM's 'filled_text' might be corrupted.
-    Args:
-        original_text: Original text with gap markers
-        gaps: Detected gaps from detect_gaps()
-        fills: Dict mapping gap index to fill choice
-               e.g., {1: "excellent", 2: "powerful"}
-    Returns:
-        Text with gaps replaced by fill choices
-    """
-    if not gaps or not fills:
-        return original_text
-    # Process from end to start to preserve positions
-    result = original_text
-    for gap in reversed(gaps):
-        if gap.index in fills:
-            result = result[:gap.start] + fills[gap.index] + result[gap.end:]
-    return result
-def build_fills_dict(gaps_list: List[dict]) -> dict:
-    """
-    Convert gaps list from JSON to fills dict.
-    Args:
-        gaps_list: List of gap dicts from parsed JSON
-                   [{"index": 1, "choice": "word"}, ...]
-    Returns:
-        Dict mapping index to choice: {1: "word", ...}
-    """
-    fills = {}
-    for gap in gaps_list:
-        if 'index' in gap and 'choice' in gap:
-            fills[gap['index']] = gap['choice']
-    return fills
-def normalize_gaps_to_tagged(text: str) -> Tuple[str, List[GapInfo]]:
-    """
-    Normalize any gap notation to [GAP:n] format.
-    Useful for standardizing input before processing.
-    Args:
-        text: Text with any gap notation
-    Returns:
-        Tuple of (normalized_text, gaps)
-    """
-    gaps = detect_gaps(text, "auto")
-    if not gaps:
-        return text, []
-    # If already [GAP:n], return as-is
-    if gaps[0].marker.startswith('[GAP:'):
-        return text, gaps
-    # Convert ___ to [GAP:n]
-    result = text
-    for gap in reversed(gaps):
-        new_marker = f"[GAP:{gap.index}]"
-        result = result[:gap.start] + new_marker + result[gap.end:]
-    # Re-detect with new positions
-    return result, detect_gaps(result, "[GAP:n]")

app/main.py DELETED Viewed

@@ -1,188 +0,0 @@
-import os
-import sys
-from typing import Optional, List
-from fastapi import FastAPI, HTTPException
-from pydantic import BaseModel
-# llama-cpp-python should be pre-installed via requirements.txt
-# No runtime installation needed
-from app.models.registry import registry, MODEL_CONFIG
-# Request/Response Models
-class Message(BaseModel):
-    role: str
-    content: str
-class ChatRequest(BaseModel):
-    model: str
-    messages: List[Message]
-    max_tokens: int = 150
-    temperature: float = 0.7
-    top_p: float = 0.9
-class ChatChoice(BaseModel):
-    message: Message
-    finish_reason: str
-class ChatResponse(BaseModel):
-    model: str
-    choices: List[ChatChoice]
-    usage: dict
-class GenerateRequest(BaseModel):
-    model: str
-    prompt: str
-    max_tokens: int = 150
-    temperature: float = 0.7
-    top_p: float = 0.9
-class GenerateResponse(BaseModel):
-    model: str
-    text: str
-    tokens_generated: int
-class ModelInfo(BaseModel):
-    name: str
-    type: str
-    device: str = "unknown"
-class ModelsResponse(BaseModel):
-    models: List[ModelInfo]
-class HealthResponse(BaseModel):
-    status: str
-    gpu_available: bool
-    models_available: int
-# Create app
-app = FastAPI(
-    title="Bielik LLM Service",
-    description="Pure inference service for Bielik models with GPU acceleration",
-    version="2.0.0"
-)
-@app.on_event("startup")
-async def startup_event():
-    """Initialize service on startup."""
-    print("Application started. Models will be loaded lazily on first request.")
-    print(f"Available models: {registry.get_available_model_names()}")
-    try:
-        import torch
-        gpu_available = torch.cuda.is_available()
-        gpu_name = torch.cuda.get_device_name(0) if gpu_available else "N/A"
-        print(f"GPU available: {gpu_available}, Device: {gpu_name}")
-    except ImportError:
-        print("PyTorch not available for GPU check")
-    except Exception as e:
-        print(f"GPU check failed: {e}")
-@app.get("/health", response_model=HealthResponse)
-async def health_check():
-    """Health check endpoint."""
-    gpu_available = False
-    try:
-        import torch
-        gpu_available = torch.cuda.is_available()
-    except:
-        pass
-    return HealthResponse(
-        status="ok",
-        gpu_available=gpu_available,
-        models_available=len(registry.get_available_model_names())
-    )
-@app.get("/models", response_model=ModelsResponse)
-async def list_models():
-    """List all available models."""
-    models_list = []
-    for model_name in registry.get_available_model_names():
-        info = registry.get_model_info(model_name)
-        models_list.append(ModelInfo(
-            name=model_name,
-            type=info.get("type", "unknown"),
-            device=info.get("device", "unknown")
-        ))
-    return ModelsResponse(models=models_list)
-@app.post("/chat", response_model=ChatResponse)
-async def chat_completion(request: ChatRequest):
-    """
-    Chat completion endpoint (OpenAI compatible).
-    Accepts a list of messages and returns a completion.
-    """
-    # Validate model
-    if request.model not in registry.get_available_model_names():
-        raise HTTPException(status_code=400, detail=f"Unknown model: {request.model}")
-    try:
-        # Load model
-        llm = await registry.get_model(request.model)
-        # Convert messages to list of dicts
-        messages = [{"role": msg.role, "content": msg.content} for msg in request.messages]
-        # Generate
-        output = await llm.generate(
-            chat_messages=messages,
-            max_new_tokens=request.max_tokens,
-            temperature=request.temperature,
-            top_p=request.top_p,
-        )
-        return ChatResponse(
-            model=request.model,
-            choices=[ChatChoice(
-                message=Message(role="assistant", content=output),
-                finish_reason="stop"
-            )],
-            usage={
-                "prompt_tokens": sum(len(msg.get("content", "").split()) for msg in messages),
-                "completion_tokens": len(output.split())
-            }
-        )
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Generation error: {str(e)}")
-@app.post("/generate", response_model=GenerateResponse)
-async def generate_text(request: GenerateRequest):
-    """
-    Raw text generation endpoint.
-    Accepts a prompt string and returns generated text.
-    """
-    # Validate model
-    if request.model not in registry.get_available_model_names():
-        raise HTTPException(status_code=400, detail=f"Unknown model: {request.model}")
-    try:
-        # Load model
-        llm = await registry.get_model(request.model)
-        # Generate
-        output = await llm.generate(
-            prompt=request.prompt,
-            max_new_tokens=request.max_tokens,
-            temperature=request.temperature,
-            top_p=request.top_p,
-        )
-        return GenerateResponse(
-            model=request.model,
-            text=output,
-            tokens_generated=len(output.split())
-        )
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Generation error: {str(e)}")
-@app.get("/")
-async def root():
-    """Root endpoint."""
-    return {
-        "message": "Bielik LLM Service",
-        "docs": "/docs",
-        "endpoints": ["/chat", "/generate", "/models", "/health"]
-    }

app/main_backup.py DELETED Viewed

@@ -1,548 +0,0 @@
-import os
-import time
-import asyncio
-import importlib
-import subprocess
-import sys
-from fastapi import FastAPI, HTTPException, Depends, Body
-from typing import Optional, List
-from pydantic import ValidationError
-# llama-cpp-python installed at runtime with CUDA support
-try:
-    import llama_cpp
-except ImportError:
-    print("[STARTUP] Installing llama-cpp-python with CUDA...")
-    env = os.environ.copy()
-    result = subprocess.run(
-        [sys.executable, "-m", "pip", "install", "--quiet", "--prefer-binary",
-         "--index-url", "https://abetlen.github.io/llama-cpp-python/whl/cu121",
-         "llama-cpp-python[server]>=0.3.16"],
-        capture_output=True,
-        text=True
-    )
-    if result.returncode != 0:
-        print("[STARTUP] CUDA wheel failed, trying CPU fallback...")
-        print(f"[STARTUP] Error details: {result.stderr[:500]}")
-        subprocess.run([sys.executable, "-m", "pip", "install", "--quiet", "llama-cpp-python>=0.3.16"], check=False)
-    else:
-        print("[STARTUP] llama-cpp-python with CUDA installed")
-from app.models.registry import registry, MODEL_CONFIG
-from fastapi.middleware.cors import CORSMiddleware
-from app.schemas.schemas import (
-    EnhancedDescriptionResponse,
-    CompareRequest,
-    CompareResponse,
-    ModelResult,
-    ModelInfo,
-    InfillRequest,
-    InfillResponse,
-    InfillResult,
-    GapFill,
-    CompareInfillRequest,
-    CompareInfillResponse,
-    ModelInfillResult,
-)
-from app.logic.infill_utils import (
-    detect_gaps,
-    parse_infill_response,
-    apply_fills,
-    build_fills_dict,
-    normalize_gaps_to_tagged,
-)
-from app.auth.placeholder_auth import get_authenticated_user
-app = FastAPI(
-    title="Multi-Model Description Enhancer",
-    description="AI-powered service for enhancing descriptions using multiple LLMs for A/B testing",
-    version="3.0.0"
-)
-# CORS configuration
-app.add_middleware(
-    CORSMiddleware,
-    allow_origins=[
-        "http://localhost:5173",
-        "http://localhost:5174",
-        os.getenv("FRONTEND_URL", "http://localhost:5173")
-    ],
-    allow_credentials=True,
-    allow_methods=["POST", "GET"],
-    allow_headers=["*"],
-)
-@app.on_event("startup")
-async def startup_event():
-    """
-    Startup event - models are loaded lazily on first request.
-    No models are pre-loaded to conserve memory.
-    """
-    print("Application started. Models will be loaded lazily on first request.")
-    print(f"Available models: {registry.get_available_model_names()}")
-    try:
-        import torch
-        gpu_available = torch.cuda.is_available()
-        gpu_name = torch.cuda.get_device_name(0) if gpu_available else "N/A"
-        print(f"GPU available: {gpu_available}, Device: {gpu_name}")
-    except ImportError:
-        print("PyTorch not available for GPU check")
-    except Exception as e:
-        print(f"GPU check failed: {e}")
-# --- Helper function to load domain logic ---
-def get_domain_config(domain: str):
-    try:
-        module = importlib.import_module(f"app.domains.{domain}.config")
-        return module.domain_config
-    except (ImportError, AttributeError):
-        raise HTTPException(status_code=404, detail=f"Domain '{domain}' not found or not configured correctly.")
-# --- API Endpoints ---
-@app.get("/")
-async def read_root():
-    return {"message": "Welcome to the Multi-Model Description Enhancer API! Go to /docs for documentation."}
-@app.get("/health")
-async def health_check():
-    """Check API health and model status."""
-    models = registry.list_models()
-    loaded_models = registry.get_loaded_models()
-    active_model = registry.get_active_model()
-    gpu_available = False
-    gpu_name = "N/A"
-    try:
-        import torch
-        gpu_available = torch.cuda.is_available()
-        gpu_name = torch.cuda.get_device_name(0) if gpu_available else "N/A"
-    except:
-        pass
-    return {
-        "status": "ok",
-        "available_models": len(models),
-        "loaded_models": loaded_models,
-        "active_local_model": active_model,
-        "gpu_available": gpu_available,
-        "gpu_device": gpu_name,
-    }
-@app.get("/models", response_model=List[ModelInfo])
-async def list_models():
-    """List all available models with their load status."""
-    return registry.list_models()
-@app.post("/models/{model_name}/load")
-async def load_model(model_name: str):
-    """
-    Explicitly load a model into memory.
-    For local models: unloads any previously loaded local model first.
-    """
-    if model_name not in registry.get_available_model_names():
-        raise HTTPException(status_code=404, detail=f"Unknown model: {model_name}")
-    try:
-        info = await registry.load_model(model_name)
-        return {"status": "loaded", "model": info}
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Failed to load model: {str(e)}")
-@app.post("/models/{model_name}/unload")
-async def unload_model(model_name: str):
-    """
-    Explicitly unload a model from memory to free resources.
-    """
-    if model_name not in registry.get_available_model_names():
-        raise HTTPException(status_code=404, detail=f"Unknown model: {model_name}")
-    try:
-        result = await registry.unload_model(model_name)
-        return result
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Failed to unload model: {str(e)}")
-@app.post("/enhance-description", response_model=EnhancedDescriptionResponse)
-async def enhance_description(
-    domain: str = Body(..., embed=True),
-    data: dict = Body(..., embed=True),
-    model: str = Body("bielik-1.5b", embed=True),
-    user: Optional[dict] = Depends(get_authenticated_user)
-):
-    """
-    Generate an enhanced description using a single model.
-    - **domain**: The name of the domain (e.g., 'cars').
-    - **data**: A dictionary with the data for the description.
-    - **model**: Model to use (default: bielik-1.5b)
-    """
-    start_time = time.time()
-    # Validate model
-    if model not in registry.get_available_model_names():
-        raise HTTPException(status_code=400, detail=f"Unknown model: {model}")
-    # Load Domain Configuration
-    domain_config = get_domain_config(domain)
-    DomainSchema = domain_config["schema"]
-    create_prompt = domain_config["create_prompt"]
-    # Validate Input Data
-    try:
-        validated_data = DomainSchema(**data)
-    except ValidationError as e:
-        raise HTTPException(status_code=422, detail=f"Invalid data for domain '{domain}': {e}")
-    # Prompt Construction
-    chat_messages = create_prompt(validated_data)
-    # Text Generation
-    try:
-        llm = await registry.get_model(model)
-        generated_description = await llm.generate(
-            chat_messages=chat_messages,
-            max_new_tokens=150,
-            temperature=0.75,
-            top_p=0.9,
-        )
-    except Exception as e:
-        print(f"Error during text generation with {model}: {e}")
-        raise HTTPException(status_code=500, detail=f"Generation error: {str(e)}")
-    generation_time = time.time() - start_time
-    user_email = user['email'] if user else "anonymous"
-    return EnhancedDescriptionResponse(
-        description=generated_description,
-        model_used=MODEL_CONFIG[model]["id"],
-        generation_time=round(generation_time, 2),
-        user_email=user_email
-    )
-@app.post("/compare", response_model=CompareResponse)
-async def compare_models(
-    request: CompareRequest,
-    user: Optional[dict] = Depends(get_authenticated_user)
-):
-    """
-    Compare outputs from multiple models for the same input.
-    Returns results from all specified models (or all available if not specified).
-    """
-    total_start = time.time()
-    # Get models to compare
-    available_models = registry.get_available_model_names()
-    models_to_use = request.models if request.models else available_models
-    # Validate requested models
-    for model in models_to_use:
-        if model not in available_models:
-            raise HTTPException(status_code=400, detail=f"Unknown model: {model}")
-    # Load Domain Configuration
-    domain_config = get_domain_config(request.domain)
-    DomainSchema = domain_config["schema"]
-    create_prompt = domain_config["create_prompt"]
-    # Validate Input Data
-    try:
-        validated_data = DomainSchema(**request.data)
-    except ValidationError as e:
-        raise HTTPException(status_code=422, detail=f"Invalid data: {e}")
-    # Prompt Construction
-    chat_messages = create_prompt(validated_data)
-    # Generate with each model
-    results = []
-    async def generate_with_model(model_name: str) -> ModelResult:
-        start_time = time.time()
-        try:
-            llm = await registry.get_model(model_name)
-            output = await llm.generate(
-                chat_messages=chat_messages,
-                max_new_tokens=150,
-                temperature=0.75,
-                top_p=0.9,
-            )
-            return ModelResult(
-                model=model_name,
-                output=output,
-                time=round(time.time() - start_time, 2),
-                type=MODEL_CONFIG[model_name]["type"],
-                error=None
-            )
-        except Exception as e:
-            return ModelResult(
-                model=model_name,
-                output="",
-                time=round(time.time() - start_time, 2),
-                type=MODEL_CONFIG[model_name]["type"],
-                error=str(e)
-            )
-    # Run all models (sequentially to avoid memory issues)
-    for model_name in models_to_use:
-        result = await generate_with_model(model_name)
-        results.append(result)
-    return CompareResponse(
-        domain=request.domain,
-        results=results,
-        total_time=round(time.time() - total_start, 2)
-    )
-@app.get("/user/me")
-async def get_user_info(user: dict = Depends(get_authenticated_user)):
-    """Get current authenticated user information"""
-    if not user:
-        raise HTTPException(status_code=401, detail="Not authenticated")
-    return {
-        "user_id": user['user_id'],
-        "email": user['email'],
-        "name": user.get('name', 'Unknown')
-    }
-# --- Batch Infill Endpoints ---
-@app.post("/infill", response_model=InfillResponse)
-async def batch_infill(
-    request: InfillRequest,
-    user: Optional[dict] = Depends(get_authenticated_user)
-):
-    """
-    Batch gap-filling for ads using a single model.
-    Accepts items with [GAP:n] markers or ___ and returns filled text
-    with per-gap choices and alternatives.
-    NOTE: For texts > 6000 chars, consider chunking (not yet implemented).
-    """
-    print(f"DEBUG: Hit batch_infill endpoint with model={request.model}", flush=True)
-    total_start = time.time()
-    # Validate model
-    if request.model not in registry.get_available_model_names():
-        raise HTTPException(status_code=400, detail=f"Unknown model: {request.model}")
-    # Load domain config for infill prompt
-    domain_config = get_domain_config(request.domain)
-    if "create_infill_prompt" not in domain_config:
-        raise HTTPException(
-            status_code=400,
-            detail=f"Domain '{request.domain}' does not support infill operations"
-        )
-    create_infill_prompt = domain_config["create_infill_prompt"]
-    # Process each item
-    results = []
-    error_count = 0
-    for item in request.items:
-        result = await process_infill_item(
-            item=item,
-            model_name=request.model,
-            options=request.options,
-            create_infill_prompt=create_infill_prompt
-        )
-        results.append(result)
-        if result.status == "error":
-            error_count += 1
-    return InfillResponse(
-        model=request.model,
-        results=results,
-        total_time=round(time.time() - total_start, 2),
-        processed_count=len(results),
-        error_count=error_count
-    )
-@app.post("/compare-infill", response_model=CompareInfillResponse)
-async def compare_infill(
-    request: CompareInfillRequest,
-    user: Optional[dict] = Depends(get_authenticated_user)
-):
-    """
-    Multi-model batch gap-filling comparison for A/B testing.
-    Runs the same batch of items through multiple models and returns
-    per-model results for comparison.
-    """
-    total_start = time.time()
-    # Get models to compare
-    available_models = registry.get_available_model_names()
-    models_to_use = request.models if request.models else available_models
-    # Validate requested models
-    for model in models_to_use:
-        if model not in available_models:
-            raise HTTPException(status_code=400, detail=f"Unknown model: {model}")
-    # Load domain config
-    domain_config = get_domain_config(request.domain)
-    if "create_infill_prompt" not in domain_config:
-        raise HTTPException(
-            status_code=400,
-            detail=f"Domain '{request.domain}' does not support infill operations"
-        )
-    create_infill_prompt = domain_config["create_infill_prompt"]
-    # Process with each model (sequentially for memory safety)
-    model_results = []
-    for model_name in models_to_use:
-        model_start = time.time()
-        results = []
-        error_count = 0
-        for item in request.items:
-            result = await process_infill_item(
-                item=item,
-                model_name=model_name,
-                options=request.options,
-                create_infill_prompt=create_infill_prompt
-            )
-            results.append(result)
-            if result.status == "error":
-                error_count += 1
-        model_results.append(ModelInfillResult(
-            model=model_name,
-            type=MODEL_CONFIG[model_name]["type"],
-            results=results,
-            time=round(time.time() - model_start, 2),
-            error_count=error_count
-        ))
-    return CompareInfillResponse(
-        domain=request.domain,
-        models=model_results,
-        total_time=round(time.time() - total_start, 2)
-    )
-async def process_infill_item(
-    item,
-    model_name: str,
-    options,
-    create_infill_prompt
-) -> InfillResult:
-    """
-    Process a single infill item.
-    Returns InfillResult with status, filled_text, and gaps.
-    """
-    try:
-        # Normalize gaps to [GAP:n] format
-        normalized_text, gaps = normalize_gaps_to_tagged(item.text_with_gaps)
-        if not gaps:
-            # No gaps found, return original text
-            return InfillResult(
-                id=item.id,
-                status="ok",
-                filled_text=item.text_with_gaps,
-                gaps=[],
-                error=None
-            )
-        # Build prompt
-        if item.custom_messages:
-            chat_messages = item.custom_messages
-            use_grammar = False  # Custom messages = plain text output expected
-        else:
-            chat_messages = create_infill_prompt(normalized_text, options, attributes=item.attributes)
-            use_grammar = True  # Standard prompt = use grammar for structured JSON
-        # Generate with optional GBNF grammar constraint
-        llm = await registry.get_model(model_name)
-        grammar_str = None
-        if use_grammar and hasattr(llm, 'llm') and llm.llm is not None:
-            # Use model's default grammar (loaded from answers.gbnf) if available
-            if hasattr(llm, 'default_grammar') and llm.default_grammar:
-                grammar_str = llm.default_grammar
-                print(f"DEBUG: Using model's default GBNF grammar", flush=True)
-            else:
-                # Fallback to dynamic grammar generation
-                try:
-                    from app.logic.grammar_utils import get_infill_grammar
-                    grammar_str = get_infill_grammar(len(gaps))
-                    print(f"DEBUG: Using dynamic GBNF grammar for {len(gaps)} gaps", flush=True)
-                except ImportError:
-                    pass
-        raw_output = await llm.generate(
-            chat_messages=chat_messages,
-            max_new_tokens=options.max_new_tokens,
-            temperature=0.3 if use_grammar else options.temperature,  # Lower temp with grammar
-            top_p=0.9,
-            grammar=grammar_str,
-        )
-        # If custom_messages were provided, the output is plain text (not JSON)
-        # Just return it directly as a single gap fill
-        if item.custom_messages:
-            # Clean up the raw output - strip whitespace, quotes, etc.
-            choice = raw_output.strip().strip('"\'.,').strip()
-            return InfillResult(
-                id=item.id,
-                status="ok",
-                filled_text=choice,  # The filled text is just the choice itself
-                gaps=[GapFill(index=1, marker="[GAP:1]", choice=choice, alternatives=[])],
-                error=None
-            )
-        # Parse JSON from output (standard prompt format)
-        parsed = parse_infill_response(raw_output)
-        if not parsed:
-            # JSON parsing failed
-            return InfillResult(
-                id=item.id,
-                status="error",
-                filled_text=None,
-                gaps=[],
-                error=f"Failed to parse JSON from model output: {raw_output[:200]}..."
-            )
-        # Extract gaps and build result
-        gap_fills = []
-        fills_dict = {}
-        for gap_data in parsed.get("gaps", []):
-            gap_fill = GapFill(
-                index=gap_data.get("index", 0),
-                marker=gap_data.get("marker", ""),
-                choice=gap_data.get("choice", ""),
-                alternatives=gap_data.get("alternatives", [])
-            )
-            gap_fills.append(gap_fill)
-            fills_dict[gap_fill.index] = gap_fill.choice
-        # Get filled text - prefer model's version, fallback to reconstruction
-        filled_text = parsed.get("filled_text")
-        if not filled_text and fills_dict:
-            filled_text = apply_fills(normalized_text, gaps, fills_dict)
-        return InfillResult(
-            id=item.id,
-            status="ok",
-            filled_text=filled_text,
-            gaps=gap_fills,
-            error=None
-        )
-    except Exception as e:
-        return InfillResult(
-            id=item.id,
-            status="error",
-            filled_text=None,
-            gaps=[],
-            error=str(e)
-        )

app/main_simple.py DELETED Viewed

@@ -1,202 +0,0 @@
-import os
-import subprocess
-import sys
-from typing import Optional, List
-from fastapi import FastAPI, HTTPException
-from pydantic import BaseModel
-# Install llama-cpp-python with CUDA support at runtime
-try:
-    import llama_cpp
-except ImportError:
-    print("[STARTUP] Installing llama-cpp-python with CUDA...")
-    result = subprocess.run(
-        [sys.executable, "-m", "pip", "install", "--quiet", "--prefer-binary",
-         "--index-url", "https://abetlen.github.io/llama-cpp-python/whl/cu121",
-         "llama-cpp-python[server]>=0.3.16"],
-        capture_output=True,
-        text=True
-    )
-    if result.returncode != 0:
-        print("[STARTUP] CUDA wheel failed, trying CPU fallback...")
-        subprocess.run([sys.executable, "-m", "pip", "install", "--quiet", "llama-cpp-python>=0.3.16"], check=False)
-from app.models.registry import registry, MODEL_CONFIG
-# Request/Response Models
-class Message(BaseModel):
-    role: str
-    content: str
-class ChatRequest(BaseModel):
-    model: str
-    messages: List[Message]
-    max_tokens: int = 150
-    temperature: float = 0.7
-    top_p: float = 0.9
-class ChatChoice(BaseModel):
-    message: Message
-    finish_reason: str
-class ChatResponse(BaseModel):
-    model: str
-    choices: List[ChatChoice]
-    usage: dict
-class GenerateRequest(BaseModel):
-    model: str
-    prompt: str
-    max_tokens: int = 150
-    temperature: float = 0.7
-    top_p: float = 0.9
-class GenerateResponse(BaseModel):
-    model: str
-    text: str
-    tokens_generated: int
-class ModelInfo(BaseModel):
-    name: str
-    type: str
-    device: str = "unknown"
-class ModelsResponse(BaseModel):
-    models: List[ModelInfo]
-class HealthResponse(BaseModel):
-    status: str
-    gpu_available: bool
-    models_available: int
-# Create app
-app = FastAPI(
-    title="Bielik LLM Service",
-    description="Pure inference service for Bielik models with GPU acceleration",
-    version="2.0.0"
-)
-@app.on_event("startup")
-async def startup_event():
-    """Initialize service on startup."""
-    print("Application started. Models will be loaded lazily on first request.")
-    print(f"Available models: {registry.get_available_model_names()}")
-    try:
-        import torch
-        gpu_available = torch.cuda.is_available()
-        gpu_name = torch.cuda.get_device_name(0) if gpu_available else "N/A"
-        print(f"GPU available: {gpu_available}, Device: {gpu_name}")
-    except ImportError:
-        print("PyTorch not available for GPU check")
-    except Exception as e:
-        print(f"GPU check failed: {e}")
-@app.get("/health", response_model=HealthResponse)
-async def health_check():
-    """Health check endpoint."""
-    gpu_available = False
-    try:
-        import torch
-        gpu_available = torch.cuda.is_available()
-    except:
-        pass
-    return HealthResponse(
-        status="ok",
-        gpu_available=gpu_available,
-        models_available=len(registry.get_available_model_names())
-    )
-@app.get("/models", response_model=ModelsResponse)
-async def list_models():
-    """List all available models."""
-    models_list = []
-    for model_name in registry.get_available_model_names():
-        info = registry.get_model_info(model_name)
-        models_list.append(ModelInfo(
-            name=model_name,
-            type=info.get("type", "unknown"),
-            device=info.get("device", "unknown")
-        ))
-    return ModelsResponse(models=models_list)
-@app.post("/chat", response_model=ChatResponse)
-async def chat_completion(request: ChatRequest):
-    """
-    Chat completion endpoint (OpenAI compatible).
-    Accepts a list of messages and returns a completion.
-    """
-    # Validate model
-    if request.model not in registry.get_available_model_names():
-        raise HTTPException(status_code=400, detail=f"Unknown model: {request.model}")
-    try:
-        # Load model
-        llm = await registry.get_model(request.model)
-        # Convert messages to list of dicts
-        messages = [{"role": msg.role, "content": msg.content} for msg in request.messages]
-        # Generate
-        output = await llm.generate(
-            chat_messages=messages,
-            max_new_tokens=request.max_tokens,
-            temperature=request.temperature,
-            top_p=request.top_p,
-        )
-        return ChatResponse(
-            model=request.model,
-            choices=[ChatChoice(
-                message=Message(role="assistant", content=output),
-                finish_reason="stop"
-            )],
-            usage={
-                "prompt_tokens": sum(len(msg.get("content", "").split()) for msg in messages),
-                "completion_tokens": len(output.split())
-            }
-        )
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Generation error: {str(e)}")
-@app.post("/generate", response_model=GenerateResponse)
-async def generate_text(request: GenerateRequest):
-    """
-    Raw text generation endpoint.
-    Accepts a prompt string and returns generated text.
-    """
-    # Validate model
-    if request.model not in registry.get_available_model_names():
-        raise HTTPException(status_code=400, detail=f"Unknown model: {request.model}")
-    try:
-        # Load model
-        llm = await registry.get_model(request.model)
-        # Generate
-        output = await llm.generate(
-            prompt=request.prompt,
-            max_new_tokens=request.max_tokens,
-            temperature=request.temperature,
-            top_p=request.top_p,
-        )
-        return GenerateResponse(
-            model=request.model,
-            text=output,
-            tokens_generated=len(output.split())
-        )
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Generation error: {str(e)}")
-@app.get("/")
-async def root():
-    """Root endpoint."""
-    return {
-        "message": "Bielik LLM Service",
-        "docs": "/docs",
-        "endpoints": ["/chat", "/generate", "/models", "/health"]
-    }

app/models/__init__.py DELETED Viewed

@@ -1,16 +0,0 @@
-"""
-Models module - LLM implementations and registry.
-"""
-from app.models.base_llm import BaseLLM
-from app.models.huggingface_local import HuggingFaceLocal
-from app.models.huggingface_inference_api import HuggingFaceInferenceAPI
-from app.models.registry import registry, MODEL_CONFIG
-__all__ = [
-    "BaseLLM",
-    "HuggingFaceLocal",
-    "HuggingFaceInferenceAPI",
-    "registry",
-    "MODEL_CONFIG",
-]

app/models/base_llm.py DELETED Viewed

@@ -1,54 +0,0 @@
-"""
-Abstract base class for all LLM implementations.
-"""
-from abc import ABC, abstractmethod
-from typing import Optional, List, Dict, Any
-class BaseLLM(ABC):
-    """Abstract interface for LLM models."""
-    def __init__(self, name: str, model_id: str):
-        self.name = name
-        self.model_id = model_id
-        self._initialized = False
-    @property
-    def is_initialized(self) -> bool:
-        return self._initialized
-    @abstractmethod
-    async def initialize(self) -> None:
-        """Initialize the model. Must be called before generate()."""
-        pass
-    @abstractmethod
-    async def generate(
-        self,
-        prompt: str = None,
-        chat_messages: List[Dict[str, str]] = None,
-        max_new_tokens: int = 150,
-        temperature: float = 0.7,
-        top_p: float = 0.9,
-        **kwargs
-    ) -> str:
-        """
-        Generate text from prompt or chat messages.
-        Args:
-            prompt: Raw text prompt
-            chat_messages: List of {"role": "...", "content": "..."} messages
-            max_new_tokens: Maximum tokens to generate
-            temperature: Sampling temperature
-            top_p: Nucleus sampling parameter
-        Returns:
-            Generated text string
-        """
-        pass
-    @abstractmethod
-    def get_info(self) -> Dict[str, Any]:
-        """Return model information for /models endpoint."""
-        pass

app/models/huggingface_inference_api.py DELETED Viewed

@@ -1,127 +0,0 @@
-"""
-HuggingFace Inference API Model - Cloud-based inference.
-Uses HuggingFace's serverless Inference API for models that are too large to run locally.
-"""
-import os
-import asyncio
-from typing import List, Dict, Any, Optional
-from app.models.base_llm import BaseLLM
-try:
-    from huggingface_hub import InferenceClient
-    HAS_HF_HUB = True
-except ImportError:
-    HAS_HF_HUB = False
-    InferenceClient = None
-class HuggingFaceInferenceAPI(BaseLLM):
-    """
-    Wrapper for HuggingFace Inference API.
-    Runs models on HuggingFace's cloud servers - no local GPU/memory needed.
-    """
-    def __init__(self, name: str, model_id: str):
-        super().__init__(name, model_id)
-        self.client = None
-        self._response_cache = {}
-        self._max_cache_size = 100
-        if not HAS_HF_HUB:
-            raise ImportError("huggingface_hub is not installed. Run: pip install huggingface_hub")
-    async def initialize(self) -> None:
-        """Initialize the Inference API client."""
-        if self._initialized:
-            return
-        try:
-            token = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_TOKEN")
-            if not token:
-                print(f"[{self.name}] Warning: No HF_TOKEN found. Some models may require authentication.")
-            self.client = InferenceClient(
-                model=self.model_id,
-                token=token
-            )
-            self._initialized = True
-            print(f"[{self.name}] Inference API client initialized for: {self.model_id}")
-        except Exception as e:
-            print(f"[{self.name}] Failed to initialize Inference API: {e}")
-            raise RuntimeError(f"Failed to initialize Inference API: {e}") from e
-    async def generate(
-        self,
-        prompt: str = None,
-        chat_messages: List[Dict[str, str]] = None,
-        max_new_tokens: int = 150,
-        temperature: float = 0.7,
-        top_p: float = 0.9,
-        **kwargs
-    ) -> str:
-        """Generate text using HuggingFace Inference API."""
-        if not self._initialized or self.client is None:
-            raise RuntimeError(f"[{self.name}] Client not initialized")
-        # Ensure we have messages
-        messages = chat_messages
-        if not messages and prompt:
-            messages = [{"role": "user", "content": prompt}]
-        if not messages:
-            raise ValueError("Either prompt or chat_messages required")
-        # Cache check
-        import json
-        cache_key = f"{json.dumps(messages)}_{max_new_tokens}_{temperature}_{top_p}"
-        if cache_key in self._response_cache:
-            return self._response_cache[cache_key]
-        print(f"[{self.name}] Calling Inference API with {len(messages)} messages", flush=True)
-        try:
-            # Use chat_completion method (huggingface_hub InferenceClient)
-            response = await asyncio.to_thread(
-                self.client.chat_completion,
-                messages=messages,
-                max_tokens=max_new_tokens,
-                temperature=temperature,
-                top_p=top_p,
-            )
-            response_text = response.choices[0].message.content.strip()
-            print(f"[{self.name}] Inference API response: {response_text[:100]}...", flush=True)
-            # Cache store
-            if len(self._response_cache) >= self._max_cache_size:
-                first_key = next(iter(self._response_cache))
-                del self._response_cache[first_key]
-            self._response_cache[cache_key] = response_text
-            return response_text
-        except Exception as e:
-            print(f"[{self.name}] Inference API error: {e}", flush=True)
-            raise RuntimeError(f"Inference API call failed: {e}") from e
-    def get_info(self) -> Dict[str, Any]:
-        """Return model information for /models endpoint."""
-        return {
-            "name": self.name,
-            "model_id": self.model_id,
-            "type": "inference_api",
-            "backend": "HuggingFace Inference API",
-            "loaded": self._initialized,
-            "cloud_based": True
-        }
-    async def cleanup(self) -> None:
-        """Cleanup resources."""
-        self.client = None
-        self._initialized = False
-        print(f"[{self.name}] Inference API client cleaned up")

app/models/huggingface_local.py DELETED Viewed

@@ -1,289 +0,0 @@
-"""
-Local HuggingFace model implementation using transformers pipeline.
-Optimizations:
-- KV Cache: Enabled by default (5-10x speedup on GPU, 1.5x on CPU)
-- Flash Attention: Used when available (GPU only)
-- 8-Bit Quantization: Optional for CPU environments (4-6x speedup, 50% memory reduction)
-"""
-from typing import List, Dict, Any, Optional
-from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
-import torch
-import asyncio
-import os
-from app.models.base_llm import BaseLLM
-# Try to import bitsandbytes, but don't fail if not available
-try:
-    from transformers import BitsAndBytesConfig
-    HAS_BITSANDBYTES = True
-except ImportError:
-    HAS_BITSANDBYTES = False
-    print("[WARNING] bitsandbytes not available - 8-bit quantization disabled")
-class HuggingFaceLocal(BaseLLM):
-    """
-    Local HuggingFace model loaded into container memory.
-    Best for smaller models (< 3B parameters) that fit in RAM.
-    Features:
-    - KV caching enabled (1.5-2x faster on CPU, 5-10x on GPU)
-    - Flash Attention v2 support (GPU only)
-    - 8-bit quantization for CPU (via bitsandbytes if available) or Dynamic Quantization (torch)
-    - Mixed precision (float16 or bfloat16 when possible)
-    - Response Caching (LRU)
-    """
-    def __init__(self, name: str, model_id: str, device: str = "cpu", use_cache: bool = True, use_8bit: bool = False):
-        super().__init__(name, model_id)
-        self.device = device
-        self.pipeline = None
-        self.tokenizer = None
-        self.model = None
-        self.use_cache = use_cache
-        self._response_cache = {}  # Simple dict cache
-        self._max_cache_size = 100
-        # Only enable 8-bit if explicitly requested (opt-in, not by default)
-        # Default to False since bitsandbytes may not be available in all deployment environments
-        requested_8bit = use_8bit or (device == "cpu" and os.getenv("USE_8BIT_QUANTIZATION", "false").lower() == "true")
-        self.use_8bit = requested_8bit and HAS_BITSANDBYTES
-        if requested_8bit and not HAS_BITSANDBYTES:
-            print(f"[{name}] 8-bit quantization requested but bitsandbytes not installed - falling back to full precision")
-        self.use_flash_attention = os.getenv("USE_FLASH_ATTENTION", "true").lower() == "true"
-        # Determine device index and dtype
-        if device == "cuda" and torch.cuda.is_available():
-            self.device_index = 0
-            # Try to use bfloat16 on modern GPUs, else float16
-            self.torch_dtype = torch.bfloat16 if torch.cuda.is_available() and hasattr(torch.cuda, "get_device_capability") else torch.float16
-        else:
-            self.device_index = -1  # CPU
-            self.torch_dtype = torch.float32
-    async def initialize(self) -> None:
-        """Load model into memory with optimizations."""
-        if self._initialized:
-            return
-        try:
-            print(f"[{self.name}] Loading local model: {self.model_id}")
-            print(f"[{self.name}] Device: {self.device} | Dtype: {self.torch_dtype} | KV Cache: {self.use_cache} | 8-bit: {self.use_8bit}")
-            self.tokenizer = await asyncio.to_thread(
-                AutoTokenizer.from_pretrained,
-                self.model_id,
-                trust_remote_code=True
-            )
-            # Model config optimizations
-            model_kwargs = {
-                "trust_remote_code": True,
-            }
-            # Add 8-bit quantization for CPU (4-6x faster, 50% less memory)
-            if self.use_8bit and HAS_BITSANDBYTES:
-                try:
-                    print(f"[{self.name}] Using 8-bit quantization for CPU optimization")
-                    bnb_config = BitsAndBytesConfig(
-                        load_in_8bit=True,
-                        bnb_8bit_compute_dtype=torch.float16,
-                        bnb_8bit_use_double_quant=True,
-                    )
-                    model_kwargs["quantization_config"] = bnb_config
-                    model_kwargs["device_map"] = "cpu"
-                except Exception as e:
-                    print(f"[{self.name}] Failed to setup 8-bit quantization: {e}")
-                    print(f"[{self.name}] Falling back to full precision")
-                    self.use_8bit = False
-                    model_kwargs["torch_dtype"] = self.torch_dtype
-                    model_kwargs["device_map"] = "cpu"
-            # Standard loading without quantization
-            if not self.use_8bit:
-                model_kwargs["torch_dtype"] = self.torch_dtype
-                model_kwargs["device_map"] = self.device if self.device == "cuda" else "cpu"
-            # Enable flash attention if requested and available (GPU only)
-            if self.use_flash_attention and self.device == "cuda" and not self.use_8bit:
-                model_kwargs["attn_implementation"] = "flash_attention_2"
-            self.model = await asyncio.to_thread(
-                AutoModelForCausalLM.from_pretrained,
-                self.model_id,
-                **model_kwargs
-            )
-            # --- CPU DYNAMIC QUANTIZATION ---
-            if self.device == "cpu" and not self.use_8bit:
-                try:
-                    print(f"[{self.name}] Applying dynamic quantization for CPU optimization...")
-                    self.model = torch.quantization.quantize_dynamic(
-                        self.model, {torch.nn.Linear}, dtype=torch.qint8
-                    )
-                    print(f"[{self.name}] Dynamic quantization applied.")
-                except Exception as e:
-                     print(f"[{self.name}] Dynamic quantization failed: {e}")
-            # Ensure cache is enabled on model config
-            if hasattr(self.model.config, 'use_cache'):
-                self.model.config.use_cache = self.use_cache
-            self._initialized = True
-            print(f"[{self.name}] Model loaded successfully (use_cache={self.use_cache})")
-        except Exception as e:
-            print(f"[{self.name}] Failed to load model: {e}")
-            raise
-    async def generate(
-        self,
-        prompt: str = None,
-        chat_messages: List[Dict[str, str]] = None,
-        max_new_tokens: int = 150,
-        temperature: float = 0.7,
-        top_p: float = 0.9,
-        **kwargs
-    ) -> str:
-        """
-        Generate text using direct model.generate() with proper KV caching.
-        KV Cache Impact (with proper implementation):
-        - WITH: ~9 seconds for 10 ads (50 gaps)
-        - WITHOUT: ~42 seconds (4.7x slower)
-        """
-        if not self._initialized or self.model is None:
-            raise RuntimeError(f"[{self.name}] Model not initialized")
-        formatted_prompt = None
-        # Format prompt from chat messages
-        if chat_messages:
-            try:
-                formatted_prompt = self.tokenizer.apply_chat_template(
-                    chat_messages,
-                    tokenize=False,
-                    add_generation_prompt=True
-                )
-            except Exception as e:
-                print(f"[{self.name}] apply_chat_template failed: {e}, using fallback")
-                formatted_prompt = self._format_chat_fallback(chat_messages)
-        # Use raw prompt if provided
-        if formatted_prompt is None and prompt:
-            formatted_prompt = prompt
-        if formatted_prompt is None:
-            raise ValueError("Either prompt or chat_messages required")
-        # --- CACHE CHECK ---
-        cache_key = f"{formatted_prompt}_{max_new_tokens}_{temperature}_{top_p}"
-        if cache_key in self._response_cache:
-            # print(f"[{self.name}] Cache hit!")
-            return self._response_cache[cache_key]
-        # Tokenize input
-        inputs = await asyncio.to_thread(
-            self.tokenizer.encode,
-            formatted_prompt,
-            return_tensors="pt"
-        )
-        # Move to device
-        if self.device == "cuda":
-            inputs = await asyncio.to_thread(lambda: inputs.to("cuda"))
-        # Generate with explicit KV cache
-        outputs = await asyncio.to_thread(
-            self.model.generate,
-            inputs,
-            max_new_tokens=max_new_tokens,
-            do_sample=True,
-            temperature=temperature,
-            top_p=top_p,
-            use_cache=True,  # CRITICAL: Enable KV cache
-            eos_token_id=self.tokenizer.eos_token_id,
-            pad_token_id=self.tokenizer.eos_token_id if self.tokenizer.pad_token_id is None else self.tokenizer.pad_token_id,
-        )
-        # Decode output
-        output_text = await asyncio.to_thread(
-            self.tokenizer.decode,
-            outputs[0],
-            skip_special_tokens=True
-        )
-        # Remove prompt from output
-        if output_text.startswith(formatted_prompt):
-            response = output_text[len(formatted_prompt):]
-        else:
-            response = output_text
-        # Clean up special tokens
-        for token in ["<|im_end|>", "<end_of_turn>", "<eos>", "</s>"]:
-            if response.endswith(token):
-                response = response[:-len(token)]
-        result = response.strip()
-        # --- CACHE STORE ---
-        if len(self._response_cache) >= self._max_cache_size:
-            # Remove oldest item (approximate LRU by iterating once)
-            first_key = next(iter(self._response_cache))
-            del self._response_cache[first_key]
-        self._response_cache[cache_key] = result
-        return result
-    def _format_chat_fallback(self, chat_messages: List[Dict[str, str]]) -> str:
-        """
-        Fallback chat formatting for models without proper chat template.
-        Works with Gemma and other models.
-        """
-        formatted = ""
-        for msg in chat_messages:
-            role = msg.get("role", "user")
-            content = msg.get("content", "")
-            if role == "system":
-                formatted += f"{content}\n\n"
-            elif role == "user":
-                formatted += f"User: {content}\n"
-            elif role == "assistant":
-                formatted += f"Assistant: {content}\n"
-        # Add generation prompt
-        formatted += "Assistant:"
-        return formatted
-    def get_info(self) -> Dict[str, Any]:
-        """Return model info."""
-        return {
-            "name": self.name,
-            "model_id": self.model_id,
-            "type": "local",
-            "initialized": self._initialized,
-            "device": self.device
-        }
-    async def cleanup(self) -> None:
-        """Release model from memory."""
-        if self.pipeline is not None:
-            del self.pipeline
-            self.pipeline = None
-        if self.tokenizer is not None:
-            del self.tokenizer
-            self.tokenizer = None
-        self._initialized = False
-        # Force CUDA cache clear if available
-        if torch.cuda.is_available():
-            torch.cuda.empty_cache()
-        print(f"[{self.name}] Model unloaded from memory")

app/models/huggingface_service.py DELETED Viewed

@@ -1,111 +0,0 @@
-from transformers import pipeline, AutoTokenizer
-import torch
-from fastapi import HTTPException
-import asyncio
-class HuggingFaceTextGenerationService:
-    def __init__(self, model_name_or_path: str, device: str = None, task: str = "text-generation"):
-        self.model_name_or_path = model_name_or_path
-        self.task = task
-        self.pipeline = None
-        self.tokenizer = None
-        if device is None:
-            self.device_index = 0 if torch.cuda.is_available() else -1
-        elif device == "cuda" and torch.cuda.is_available():
-            self.device_index = 0
-        elif device == "cpu":
-            self.device_index = -1
-        else:
-            self.device_index = -1
-        if self.device_index == 0:
-            print("CUDA (GPU) is available. Using GPU.")
-        else:
-            print(f"Device set to use {'cpu' if self.device_index == -1 else f'cuda:{self.device_index}'}")
-    async def initialize(self):
-        try:
-            print(f"Initializing Hugging Face pipeline for model: {self.model_name_or_path} on device index: {self.device_index}")
-            self.tokenizer = await asyncio.to_thread(
-                AutoTokenizer.from_pretrained, self.model_name_or_path, trust_remote_code=True
-            )
-            self.pipeline = await asyncio.to_thread(
-                pipeline,
-                self.task,
-                model=self.model_name_or_path,
-                tokenizer=self.tokenizer,
-                device=self.device_index,
-                torch_dtype=torch.bfloat16 if self.device_index != -1 and torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float32,
-                trust_remote_code=True,
-            )
-            print(f"Pipeline for model {self.model_name_or_path} initialized successfully.")
-        except Exception as e:
-            print(f"Error initializing HuggingFace pipeline: {e}")
-            raise HTTPException(status_code=503, detail=f"LLM (HuggingFace) model could not be loaded: {str(e)}")
-    async def generate_text(self, prompt_text: str = None, chat_template_messages: list = None, max_new_tokens: int = 250, temperature: float = 0.7, top_p: float = 0.9, do_sample: bool = True, **kwargs) -> str:
-        if not self.pipeline or not self.tokenizer:
-            raise Exception("Pipeline is not initialized. Call initialize() first.")
-        formatted_prompt_input = ""
-        if chat_template_messages:
-            try:
-                formatted_prompt_input = self.tokenizer.apply_chat_template(
-                    chat_template_messages,
-                    tokenize=False,
-                    add_generation_prompt=True
-                )
-            except Exception as e:
-                print(f"Could not apply chat template, falling back to raw prompt if available. Error: {e}")
-                if prompt_text:
-                    formatted_prompt_input = prompt_text
-                else:
-                    raise ValueError("Cannot generate text without a valid prompt or chat_template_messages.")
-        elif prompt_text:
-            formatted_prompt_input = prompt_text
-        else:
-            raise ValueError("Either prompt_text or chat_template_messages must be provided.")
-        try:
-            generated_outputs = await asyncio.to_thread(
-                self.pipeline,
-                formatted_prompt_input,
-                max_new_tokens=max_new_tokens,
-                do_sample=do_sample,
-                temperature=temperature,
-                top_p=top_p,
-                eos_token_id=self.tokenizer.eos_token_id,
-                pad_token_id=self.tokenizer.eos_token_id if self.tokenizer.pad_token_id is None else self.tokenizer.pad_token_id, # Common setting
-                **kwargs
-            )
-            if generated_outputs and isinstance(generated_outputs, list) and "generated_text" in generated_outputs[0]:
-                full_generated_sequence = generated_outputs[0]["generated_text"]
-                assistant_response = ""
-                if full_generated_sequence.startswith(formatted_prompt_input):
-                    assistant_response = full_generated_sequence[len(formatted_prompt_input):]
-                else:
-                    assistant_marker = "<|im_start|>assistant\n"
-                    last_marker_pos = full_generated_sequence.rfind(assistant_marker)
-                    if last_marker_pos != -1:
-                        assistant_response = full_generated_sequence[last_marker_pos + len(assistant_marker):]
-                        print("Warning: Used fallback parsing for assistant response.")
-                    else:
-                        print("Error: Could not isolate assistant response from the full generated sequence.")
-                        assistant_response = full_generated_sequence
-                if assistant_response.endswith("<|im_end|>"):
-                    assistant_response = assistant_response[:-len("<|im_end|>")]
-                return assistant_response.strip()
-            else:
-                print(f"Unexpected output format from pipeline: {generated_outputs}")
-                return "Error: Could not parse generated text from pipeline output."
-        except Exception as e:
-            print(f"Error during text generation with {self.model_name_or_path}: {e}")
-            raise HTTPException(status_code=500, detail=f"Error generating text: {str(e)}")

app/models/llama_cpp_model.py DELETED Viewed

@@ -1,180 +0,0 @@
-"""
-GGUF Model implementation using llama-cpp-python.
-Highly optimized for CPU inference.
-"""
-import os
-import asyncio
-import traceback
-from typing import List, Dict, Any, Optional
-from app.models.base_llm import BaseLLM
-try:
-    from llama_cpp import Llama, LlamaGrammar
-    HAS_LLAMA_CPP = True
-except ImportError:
-    HAS_LLAMA_CPP = False
-    LlamaGrammar = None
-class LlamaCppModel(BaseLLM):
-    """
-    Wrapper for GGUF models using llama.cpp.
-    Provides significant speedups on CPU compared to Transformers.
-    """
-    def __init__(self, name: str, model_id: str, model_path: str = None, n_ctx: int = 4096, grammar_path: str = None, n_gpu_layers: int = -1):
-        super().__init__(name, model_id)
-        self.model_path = model_path
-        self.n_ctx = n_ctx
-        self.grammar_path = grammar_path
-        self.n_gpu_layers = n_gpu_layers
-        self.default_grammar = None  # Will be loaded from file if provided
-        self.llm = None
-        self._response_cache = {}
-        self._max_cache_size = 100
-        if not HAS_LLAMA_CPP:
-            raise ImportError("llama-cpp-python is not installed. Cannot use GGUF models.")
-    async def initialize(self) -> None:
-        """Load GGUF model."""
-        if self._initialized:
-            return
-        if not self.model_path or not os.path.exists(self.model_path):
-             # If exact path isn't provided, try to find it in the model directory
-             # logic handled in registry usually, but safety check here
-             raise FileNotFoundError(f"GGUF model file not found at: {self.model_path}")
-        try:
-            print(f"[{self.name}] Loading GGUF model from: {self.model_path}")
-            print(f"[{self.name}] File size: {os.path.getsize(self.model_path) / (1024*1024):.2f} MB")
-            print(f"[{self.name}] n_ctx={self.n_ctx}, n_threads={os.cpu_count()}, n_gpu_layers={self.n_gpu_layers}")
-            # Load model in a thread to avoid blocking event loop
-            # Enable verbose to see llama.cpp errors
-            self.llm = await asyncio.to_thread(
-                Llama,
-                model_path=self.model_path,
-                n_ctx=self.n_ctx,
-                n_threads=os.cpu_count(), # Use all available cores
-                n_gpu_layers=self.n_gpu_layers,  # GPU layer offloading
-                verbose=True  # Enable verbose to see loading errors
-            )
-            self._initialized = True
-            print(f"[{self.name}] GGUF Model loaded successfully (n_ctx={self.n_ctx}, n_gpu_layers={self.n_gpu_layers})")
-            # Load grammar file if provided
-            if self.grammar_path:
-                grammar_full_path = os.path.join(os.path.dirname(__file__), "..", "logic", self.grammar_path)
-                if os.path.exists(grammar_full_path):
-                    with open(grammar_full_path, 'r', encoding='utf-8') as f:
-                        self.default_grammar = f.read()
-                    print(f"[{self.name}] Loaded grammar from: {grammar_full_path}")
-                else:
-                    print(f"[{self.name}] Grammar file not found: {grammar_full_path}")
-        except Exception as e:
-            error_msg = str(e) if str(e) else repr(e)
-            print(f"[{self.name}] Failed to load GGUF model: {error_msg}")
-            print(f"[{self.name}] Full traceback:")
-            traceback.print_exc()
-            raise RuntimeError(f"Failed to load GGUF model: {error_msg}") from e
-    async def generate(
-        self,
-        prompt: str = None,
-        chat_messages: List[Dict[str, str]] = None,
-        max_new_tokens: int = 150,
-        temperature: float = 0.7,
-        top_p: float = 0.9,
-        grammar: str = None,
-        **kwargs
-    ) -> str:
-        """Generate text using llama.cpp
-        Args:
-            prompt: Simple text prompt (converted to user message)
-            chat_messages: List of chat messages with role/content
-            max_new_tokens: Maximum tokens to generate
-            temperature: Sampling temperature (lower = more deterministic)
-            top_p: Nucleus sampling threshold
-            grammar: Optional GBNF grammar string to constrain output
-        """
-        if not self._initialized or self.llm is None:
-            raise RuntimeError(f"[{self.name}] Model not initialized")
-        # Ensure we have a list of messages
-        messages = chat_messages
-        if not messages and prompt:
-            messages = [{"role": "user", "content": prompt}]
-        if not messages:
-            raise ValueError("Either prompt or chat_messages required")
-        # Cache Check - using stringified messages for the key
-        import json
-        cache_key = f"{json.dumps(messages)}_{max_new_tokens}_{temperature}_{top_p}_{grammar is not None}"
-        if cache_key in self._response_cache:
-            return self._response_cache[cache_key]
-        print(f"DEBUG: Generating with messages: {messages}", flush=True)
-        if grammar:
-            print(f"DEBUG: Using GBNF grammar constraint", flush=True)
-        # Prepare grammar object if provided
-        llama_grammar = None
-        if grammar and LlamaGrammar:
-            try:
-                llama_grammar = LlamaGrammar.from_string(grammar)
-            except Exception as e:
-                print(f"DEBUG: Failed to parse grammar: {e}", flush=True)
-                llama_grammar = None
-        # Generate using chat completion to leverage internal templates
-        output = await asyncio.to_thread(
-            self.llm.create_chat_completion,
-            messages=messages,
-            max_tokens=max_new_tokens,
-            temperature=temperature,
-            top_p=top_p,
-            grammar=llama_grammar,
-        )
-        print(f"DEBUG: Raw output object: {output}", flush=True)
-        response_text = output['choices'][0]['message']['content'].strip()
-        print(f"DEBUG: Extracted text: {response_text}", flush=True)
-        # Cache Store
-        if len(self._response_cache) >= self._max_cache_size:
-            first_key = next(iter(self._response_cache))
-            del self._response_cache[first_key]
-        self._response_cache[cache_key] = response_text
-        return response_text
-    def get_info(self) -> Dict[str, Any]:
-        """Return model information for /models endpoint."""
-        return {
-            "name": self.name,
-            "model_id": self.model_id,
-            "type": "gguf",
-            "backend": "llama.cpp",
-            "context_length": self.n_ctx,
-            "loaded": self._initialized,
-            "model_path": self.model_path,
-            "has_grammar": self.default_grammar is not None,
-            "gpu_layers": self.n_gpu_layers
-        }
-    async def cleanup(self) -> None:
-        """Free memory."""
-        if self.llm:
-            del self.llm
-            self.llm = None
-        self._initialized = False
-        print(f"[{self.name}] GGUF Model unloaded")

app/models/registry.py DELETED Viewed

@@ -1,148 +0,0 @@
-"""
-Model Registry - Central configuration and factory for all LLM models.
-"""
-import os
-import gc
-from typing import Dict, List, Any, Optional
-from app.models.base_llm import BaseLLM
-from app.models.huggingface_inference_api import HuggingFaceInferenceAPI
-from app.models.transformers_model import TransformersModel
-# Model configuration
-MODEL_CONFIG = {
-    "bielik-1.5b-transformer": {
-        "id": "speakleash/Bielik-1.5B-v3.0-Instruct",
-        "type": "transformers",
-        "size": "1.5B",
-        "polish_support": "excellent",
-        "use_8bit": False,
-        "device_map": "auto"
-    },
-    "bielik-11b-transformer": {
-        "id": "speakleash/Bielik-11B-v2.3-Instruct",
-        "type": "transformers",
-        "size": "11B",
-        "polish_support": "excellent",
-        "use_8bit": True,
-        "device_map": "auto",
-        "enable_cpu_offload": True
-    },
-    "llama-3.1-8b": {
-        "id": "meta-llama/Llama-3.1-8B-Instruct",
-        "type": "inference_api",
-        "polish_support": "good",
-        "size": "8B",
-    }
-}
-LOCAL_MODEL_BASE = os.getenv("MODEL_DIR", "/app/pretrain_model")
-class ModelRegistry:
-    def __init__(self):
-        self._models: Dict[str, BaseLLM] = {}
-        self._config = MODEL_CONFIG.copy()
-        self._active_local_model: Optional[str] = None
-    def _create_model(self, name: str) -> BaseLLM:
-        if name not in self._config:
-            raise ValueError(f"Unknown model: {name}")
-        config = self._config[name]
-        model_type = config["type"]
-        model_id = config["id"]
-        if model_type == "transformers":
-            use_8bit = config.get("use_8bit", True)
-            device_map = config.get("device_map", "auto")
-            enable_cpu_offload = config.get("enable_cpu_offload", False)
-            return TransformersModel(
-                name=name,
-                model_id=model_id,
-                use_8bit=use_8bit,
-                device_map=device_map,
-                enable_cpu_offload=enable_cpu_offload
-            )
-        elif model_type == "inference_api":
-            return HuggingFaceInferenceAPI(name=name, model_id=model_id)
-        else:
-            raise ValueError(f"Unsupported model type: {model_type}")
-    async def get_model(self, name: str) -> BaseLLM:
-        config = self._config[name]
-        # Unload previously active model to free GPU memory when switching models
-        if self._active_local_model and self._active_local_model != name:
-            print(f"Switching models: unloading '{self._active_local_model}' to load '{name}'")
-            await self._unload_model(self._active_local_model)
-        if name not in self._models:
-            model = self._create_model(name)
-            await model.initialize()
-            self._models[name] = model
-        self._active_local_model = name
-        return self._models[name]
-    async def _unload_model(self, name: str) -> None:
-        if name in self._models:
-            model = self._models[name]
-            if hasattr(model, 'cleanup'): await model.cleanup()
-            del self._models[name]
-            gc.collect()
-            print(f"Model '{name}' unloaded.")
-    def get_model_info(self, name: str) -> Dict[str, Any]:
-        config = self._config[name]
-        return {
-            "name": name,
-            "model_id": config["id"],
-            "type": config["type"],
-            "size": config.get("size", "unknown"),
-            "polish_support": config.get("polish_support", "unknown"),
-            "loaded": name in self._models,
-            "active": name == self._active_local_model
-        }
-    def get_available_model_names(self) -> List[str]:
-        """Return list of all available model names."""
-        return list(self._config.keys())
-    def list_models(self) -> List[Dict[str, Any]]:
-        """Return list of all models with their info."""
-        return [self.get_model_info(name) for name in self._config.keys()]
-    def get_loaded_models(self) -> List[str]:
-        """Return list of currently loaded model names."""
-        return list(self._models.keys())
-    def get_active_model(self) -> Optional[str]:
-        """Return name of currently active local model."""
-        return self._active_local_model
-    async def load_model(self, name: str) -> Dict[str, Any]:
-        """Explicitly load a model and return its info."""
-        await self.get_model(name)
-        return self.get_model_info(name)
-    async def unload_model(self, name: str) -> Dict[str, str]:
-        """Explicitly unload a model and free its memory."""
-        if name in self._models:
-            await self._unload_model(name)
-            if self._active_local_model == name:
-                self._active_local_model = None
-            return {"status": "success", "message": f"Model '{name}' unloaded"}
-        return {"status": "error", "message": f"Model '{name}' not loaded"}
-    async def unload_all_models(self) -> Dict[str, str]:
-        """Unload all loaded models and free GPU memory."""
-        loaded_models = list(self._models.keys())
-        for model_name in loaded_models:
-            await self._unload_model(model_name)
-        self._active_local_model = None
-        return {"status": "success", "message": f"Unloaded {len(loaded_models)} models"}
-registry = ModelRegistry()

app/models/transformers_model.py DELETED Viewed

@@ -1,360 +0,0 @@
-"""
-GPU-optimized Transformers implementation using bitsandbytes quantization.
-Automatically offloads to GPU if available, falls back to CPU gracefully.
-"""
-import os
-import asyncio
-import traceback
-from typing import List, Dict, Any, Optional
-from app.models.base_llm import BaseLLM
-try:
-    from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-    HAS_TRANSFORMERS = True
-except ImportError:
-    HAS_TRANSFORMERS = False
-try:
-    import bitsandbytes as bnb
-    HAS_BITSANDBYTES = True
-except ImportError:
-    HAS_BITSANDBYTES = False
-import torch
-class TransformersModel(BaseLLM):
-    """
-    Wrapper for HuggingFace Transformers models with GPU acceleration.
-    Supports 8-bit quantization via bitsandbytes for memory efficiency.
-    Automatically detects and uses GPU if available.
-    """
-    def __init__(self, name: str, model_id: str, use_8bit: bool = True, device_map: str = "auto", enable_cpu_offload: bool = False):
-        super().__init__(name, model_id)
-        self.use_8bit = use_8bit
-        self.device_map = device_map
-        env_cpu_offload = os.getenv("TRANSFORMERS_ENABLE_CPU_OFFLOAD", "").strip().lower() in ("1", "true", "yes", "on")
-        self.enable_cpu_offload = enable_cpu_offload or env_cpu_offload
-        self.offload_dir = os.getenv("HF_OFFLOAD_DIR", "/tmp/hf-offload")
-        self.pipeline = None
-        self.tokenizer = None
-        self.model = None
-        self._response_cache = {}
-        self._max_cache_size = 100
-        if not HAS_TRANSFORMERS:
-            raise ImportError("transformers is not installed. Cannot use Transformers models.")
-    async def initialize(self) -> None:
-        """Load model with GPU optimization."""
-        if self._initialized:
-            return
-        try:
-            print(f"[{self.name}] Initializing Transformers model: {self.model_id}")
-            print(f"[{self.name}] Device map: {self.device_map}, 8-bit quantization: {self.use_8bit}")
-            # Load in thread to avoid blocking event loop
-            await asyncio.to_thread(self._load_model)
-            self._initialized = True
-            print(f"[{self.name}] Transformers Model loaded successfully")
-        except Exception as e:
-            error_msg = str(e) if str(e) else repr(e)
-            print(f"[{self.name}] Failed to load Transformers model: {error_msg}")
-            traceback.print_exc()
-            raise RuntimeError(f"Failed to load Transformers model: {error_msg}") from e
-    def _load_model(self) -> None:
-        """Load model with optimal device configuration and quantization support."""
-        import gc
-        # Set PyTorch environment variables for optimal memory management
-        if not os.getenv("PYTORCH_CUDA_ALLOC_CONF"):
-            os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
-            print(f"[{self.name}] Set PYTORCH_CUDA_ALLOC_CONF to prevent GPU memory fragmentation")
-        # Force garbage collection before loading new model
-        gc.collect()
-        if torch.cuda.is_available():
-            torch.cuda.empty_cache()
-        # Check GPU availability with detailed diagnostics
-        cuda_available = torch.cuda.is_available()
-        cuda_device_count = torch.cuda.device_count() if cuda_available else 0
-        device = "cuda" if cuda_available else "cpu"
-        print(f"[{self.name}] === MODEL LOADING DIAGNOSTICS ===")
-        print(f"[{self.name}] torch.cuda.is_available(): {cuda_available}")
-        print(f"[{self.name}] torch.cuda.device_count(): {cuda_device_count}")
-        if cuda_available:
-            try:
-                print(f"[{self.name}] Current CUDA device: {torch.cuda.current_device()}")
-                print(f"[{self.name}] CUDA device name: {torch.cuda.get_device_name(0)}")
-            except:
-                pass
-        print(f"[{self.name}] ===================================")
-        print(f"[{self.name}] Loading model: {self.model_id}")
-        print(f"[{self.name}] Device to use: {device}")
-        print(f"[{self.name}] Device map: {self.device_map}")
-        print(f"[{self.name}] 8-bit quantization requested: {self.use_8bit}")
-        # Load tokenizer
-        self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)
-        # Use float16 for GPU, float32 for CPU
-        dtype = torch.float16 if cuda_available else torch.float32
-        is_large_model = "11b" in self.model_id.lower() or "11b" in self.name.lower()
-        cpu_offload_enabled = self.enable_cpu_offload or is_large_model
-        # Build model kwargs conditionally based on quantization setting
-        model_kwargs = {
-            "trust_remote_code": True,
-            "torch_dtype": dtype,
-        }
-        # Apply 8-bit quantization if requested, available, and GPU is present
-        if self.use_8bit and HAS_BITSANDBYTES and cuda_available:
-            try:
-                print(f"[{self.name}] Using 8-bit quantization for memory efficiency")
-                bnb_config = BitsAndBytesConfig(
-                    load_in_8bit=True,
-                    bnb_8bit_compute_dtype=torch.float16,
-                    llm_int8_enable_fp32_cpu_offload=cpu_offload_enabled,
-                )
-                model_kwargs["quantization_config"] = bnb_config
-                model_kwargs["device_map"] = "auto"
-                if cpu_offload_enabled:
-                    os.makedirs(self.offload_dir, exist_ok=True)
-                    model_kwargs["offload_folder"] = self.offload_dir
-            except Exception as e:
-                print(f"[{self.name}] Failed to setup 8-bit quantization: {e}")
-                print(f"[{self.name}] Falling back to full precision")
-                self.use_8bit = False
-                model_kwargs["device_map"] = self.device_map
-        elif self.use_8bit and not cuda_available:
-            # 8-bit quantization requested but no GPU available - fall back to full precision
-            print(f"[{self.name}] WARNING: 8-bit quantization requested but no GPU available")
-            print(f"[{self.name}] Falling back to full precision on CPU (model may be very slow)")
-            self.use_8bit = False
-            model_kwargs["device_map"] = "cpu"
-        else:
-            # No quantization - use explicit device mapping
-            if not self.use_8bit and self.use_8bit is not None:
-                print(f"[{self.name}] bitsandbytes not available or quantization disabled - using full precision")
-            # For large models without quantization, be more careful with device mapping
-            if "11b" in self.model_id.lower() and not self.use_8bit and cuda_available:
-                print(f"[{self.name}] WARNING: Loading large 11B model without quantization on GPU")
-                print(f"[{self.name}] WARNING: This may cause out-of-memory errors on 16GB GPUs")
-                print(f"[{self.name}] WARNING: Consider enabling use_8bit=True in registry.py")
-                # Use CPU offloading for safety
-                model_kwargs["device_map"] = "cpu"
-            else:
-                model_kwargs["device_map"] = self.device_map
-        try:
-            self.model = AutoModelForCausalLM.from_pretrained(
-                self.model_id,
-                **model_kwargs
-            )
-        except ValueError as e:
-            error_text = str(e)
-            should_retry_with_offload = (
-                self.use_8bit
-                and HAS_BITSANDBYTES
-                and cuda_available
-                and "dispatched on the cpu or the disk" in error_text.lower()
-            )
-            if not should_retry_with_offload:
-                raise
-            print(f"[{self.name}] Retrying load with explicit fp32 CPU offload")
-            os.makedirs(self.offload_dir, exist_ok=True)
-            retry_kwargs = dict(model_kwargs)
-            retry_kwargs["quantization_config"] = BitsAndBytesConfig(
-                load_in_8bit=True,
-                bnb_8bit_compute_dtype=torch.float16,
-                llm_int8_enable_fp32_cpu_offload=True,
-            )
-            retry_kwargs["device_map"] = "auto"
-            retry_kwargs["offload_folder"] = self.offload_dir
-            try:
-                total_mem = torch.cuda.get_device_properties(0).total_memory
-                gpu_gib = max(1, int((total_mem / (1024 ** 3)) * 0.9))
-                retry_kwargs["max_memory"] = {0: f"{gpu_gib}GiB", "cpu": "64GiB"}
-            except Exception:
-                pass
-            self.model = AutoModelForCausalLM.from_pretrained(
-                self.model_id,
-                **retry_kwargs
-            )
-        # Log final state
-        model_device = next(self.model.parameters()).device
-        quantization_status = "8-bit quantized" if self.use_8bit else "full precision"
-        print(f"[{self.name}] Model loaded successfully")
-        print(f"[{self.name}] Dtype: {self.model.dtype} | Quantization: {quantization_status}")
-        print(f"[{self.name}] Device: {model_device}")
-    async def generate(
-        self,
-        prompt: str = None,
-        chat_messages: List[Dict[str, str]] = None,
-        max_new_tokens: int = 150,
-        temperature: float = 0.7,
-        top_p: float = 0.9,
-        grammar: str = None,
-        **kwargs
-    ) -> str:
-        """Generate text using Transformers pipeline.
-        Note: grammar parameter is ignored (Transformers doesn't support GBNF).
-        Use stricter prompt engineering instead.
-        """
-        if not self._initialized or self.model is None:
-            raise RuntimeError(f"[{self.name}] Model not initialized")
-        # Build prompt from messages
-        prompt_text = self._build_prompt_from_messages(chat_messages) if chat_messages else prompt
-        if not prompt_text:
-            raise ValueError("Either prompt or chat_messages required")
-        # Cache Check
-        import json
-        cache_key = f"{json.dumps(chat_messages or prompt_text)}_{max_new_tokens}_{temperature}_{top_p}"
-        if cache_key in self._response_cache:
-            return self._response_cache[cache_key]
-        print(f"DEBUG: Generating with Transformers model", flush=True)
-        if grammar:
-            print(f"DEBUG: Note - GBNF grammar not supported in Transformers, using prompt engineering instead", flush=True)
-        # Generate in thread to avoid blocking
-        response_text = await asyncio.to_thread(
-            self._generate_text,
-            prompt_text,
-            max_new_tokens,
-            temperature,
-            top_p
-        )
-        # Cache Store
-        if len(self._response_cache) >= self._max_cache_size:
-            first_key = next(iter(self._response_cache))
-            del self._response_cache[first_key]
-        self._response_cache[cache_key] = response_text
-        print(f"DEBUG: Extracted text: {response_text[:200]}", flush=True)
-        return response_text
-    def _build_prompt_from_messages(self, messages: List[Dict[str, str]]) -> str:
-        """Convert chat messages to prompt using Bielik's chat template."""
-        # Bielik uses: <|im_start|>role\ncontent<|im_end|>\n
-        prompt_parts = []
-        for msg in messages:
-            role = msg.get("role", "user")
-            content = msg.get("content", "")
-            prompt_parts.append(f"<|im_start|>{role}\n{content}<|im_end|>\n")
-        # Add assistant start token for generation
-        prompt_parts.append("<|im_start|>assistant\n")
-        return "".join(prompt_parts)
-    def _generate_text(
-        self,
-        prompt: str,
-        max_new_tokens: int,
-        temperature: float,
-        top_p: float
-    ) -> str:
-        """Internal method to generate text (called in thread)."""
-        # Tokenize input
-        inputs = self.tokenizer(prompt, return_tensors="pt")
-        # Move to same device as model if using CPU
-        if next(self.model.parameters()).device.type == "cpu":
-            inputs = {k: v.to("cpu") for k, v in inputs.items()}
-        else:
-            inputs = {k: v.to(next(self.model.parameters()).device) for k, v in inputs.items()}
-        # Generate with optimized settings for better quality and speed
-        with torch.no_grad():
-            outputs = self.model.generate(
-                **inputs,
-                max_new_tokens=max_new_tokens,
-                temperature=temperature,
-                top_p=top_p,
-                do_sample=True,
-                eos_token_id=self.tokenizer.eos_token_id,
-                pad_token_id=self.tokenizer.pad_token_id,
-                use_cache=False,  # Disabled: KV cache causes degradation after ~50 requests
-                num_beams=1,     # Greedy decoding is fastest (can adjust for quality)
-            )
-        # Decode - skip prompt tokens
-        generated_text = self.tokenizer.decode(
-            outputs[0][inputs["input_ids"].shape[1]:],
-            skip_special_tokens=True
-        )
-        # Clear GPU cache to prevent memory accumulation and degradation
-        if torch.cuda.is_available():
-            torch.cuda.empty_cache()
-        return generated_text.strip()
-    def get_info(self) -> Dict[str, Any]:
-        """Return model information for /models endpoint."""
-        device = "unknown"
-        dtype = "unknown"
-        if self.model:
-            device = str(next(self.model.parameters()).device)
-            dtype = str(self.model.dtype)
-        return {
-            "name": self.name,
-            "model_id": self.model_id,
-            "type": "transformers",
-            "backend": "huggingface-transformers",
-            "loaded": self._initialized,
-            "device": device,
-            "dtype": dtype,
-            "optimization": "float16, KV cache disabled (prevents degradation), 8-bit quantization",
-            "note": "KV cache disabled to prevent quality degradation after 50+ requests"
-        }
-    async def cleanup(self) -> None:
-        """Free memory."""
-        import gc
-        if self.model:
-            del self.model
-            self.model = None
-        if self.tokenizer:
-            del self.tokenizer
-            self.tokenizer = None
-        self._initialized = False
-        # Aggressive cleanup
-        gc.collect()  # Force garbage collection
-        # Clear CUDA cache if available
-        if torch.cuda.is_available():
-            torch.cuda.empty_cache()
-            try:
-                # Empty reserved memory too (PyTorch 2.0+)
-                device_id = torch.cuda.current_device()
-                torch.cuda.reset_peak_memory_stats(device_id)
-            except:
-                pass
-        print(f"[{self.name}] Transformers Model unloaded and memory freed")

app/schemas/schemas.py DELETED Viewed

@@ -1,131 +0,0 @@
-from pydantic import BaseModel, Field
-from typing import List, Optional, Dict, Any
-class EnhancedDescriptionResponse(BaseModel):
-    description: str
-    model_used: str
-    generation_time: float
-    user_email: str
-# --- Batch Infill Schemas ---
-class InfillItem(BaseModel):
-    """A single item (ad) with gaps to be filled."""
-    id: str = Field(..., description="Unique identifier for this item")
-    text_with_gaps: str = Field(..., description="Text containing [GAP:n] markers or ___ to fill")
-    attributes: Dict[str, Any] = Field(default_factory=dict, description="Optional context attributes (e.g. make, model)")
-    custom_messages: Optional[List[Dict[str, str]]] = Field(None, description="Optional pre-built chat messages to override prompt generation")
-class InfillOptions(BaseModel):
-    """Configuration options for infill processing."""
-    gap_notation: str = Field(
-        default="auto",
-        description="Gap notation: 'auto' (detect), '[GAP:n]', or '___'"
-    )
-    top_n_per_gap: int = Field(
-        default=3,
-        ge=1,
-        le=5,
-        description="Number of alternative suggestions per gap (1-5)"
-    )
-    language: str = Field(default="pl", description="Output language (pl/en)")
-    temperature: float = Field(default=0.6, ge=0.0, le=1.0)
-    max_new_tokens: int = Field(default=256, ge=50, le=512)
-class GapFill(BaseModel):
-    """Result for a single filled gap."""
-    index: int = Field(..., description="Gap index (1-based)")
-    marker: str = Field(..., description="Original marker (e.g., '[GAP:1]' or '___')")
-    choice: str = Field(..., description="Selected fill word/phrase")
-    alternatives: List[str] = Field(
-        default_factory=list,
-        description="Alternative suggestions"
-    )
-class InfillResult(BaseModel):
-    """Result for a single infill item."""
-    id: str
-    status: str = Field(..., description="'ok' or 'error'")
-    filled_text: Optional[str] = Field(None, description="Text with gaps filled")
-    gaps: List[GapFill] = Field(default_factory=list)
-    error: Optional[str] = Field(None, description="Error message if status='error'")
-class InfillRequest(BaseModel):
-    """Request for single-model batch infill."""
-    domain: str = Field(..., description="Domain name (e.g., 'cars')")
-    items: List[InfillItem] = Field(..., description="Batch of items to process")
-    model: str = Field(default="bielik-1.5b", description="Model to use")
-    options: InfillOptions = Field(default_factory=InfillOptions)
-class InfillResponse(BaseModel):
-    """Response for single-model batch infill."""
-    model: str
-    results: List[InfillResult]
-    total_time: float
-    processed_count: int
-    error_count: int
-class CompareInfillRequest(BaseModel):
-    """Request for multi-model batch infill comparison."""
-    domain: str
-    items: List[InfillItem]
-    models: Optional[List[str]] = Field(
-        None,
-        description="Models to compare. If None, use all available."
-    )
-    options: InfillOptions = Field(default_factory=InfillOptions)
-class ModelInfillResult(BaseModel):
-    """Per-model results in comparison."""
-    model: str
-    type: str
-    results: List[InfillResult]
-    time: float
-    error_count: int
-class CompareInfillResponse(BaseModel):
-    """Response for multi-model batch infill comparison."""
-    domain: str
-    models: List[ModelInfillResult]
-    total_time: float
-class ModelInfo(BaseModel):
-    name: str
-    model_id: str
-    type: str
-    polish_support: str
-    size: str
-    loaded: bool
-    active: Optional[bool] = None  # Only for local models
-class CompareRequest(BaseModel):
-    domain: str
-    data: Dict[str, Any]
-    models: Optional[List[str]] = None  # If None, use all models
-class ModelResult(BaseModel):
-    model: str
-    output: str
-    time: float
-    type: str
-    error: Optional[str] = None
-class CompareResponse(BaseModel):
-    domain: str
-    results: List[ModelResult]
-    total_time: float

requirements.txt DELETED Viewed

@@ -1,10 +0,0 @@
-fastapi==0.104.1
-uvicorn[standard]==0.24.0
-transformers==4.36.2
-accelerate==0.25.0
-bitsandbytes>=0.41.1
-huggingface_hub>=0.26.0
-pydantic==2.5.0
-importlib-metadata
---extra-index-url https://download.pytorch.org/whl/cu121
-torch>=2.1.0

start_container.ps1 DELETED Viewed

@@ -1,23 +0,0 @@
-# PowerShell script to build and run the Docker container for your FastAPI service
-# Set variables
-$imageName = "bielik-fastapi-service"
-$containerName = "bielik_app_instance"
-$tokenFile = "my_hf_token.txt"
-Write-Host "Building Docker image..."
-docker build --secret id=huggingface_token,src=$tokenFile -t $imageName .
-Write-Host "Stopping and removing any existing container named $containerName..."
-docker stop $containerName | Out-Null 2>&1
-docker rm $containerName | Out-Null 2>&1
-Write-Host "Running new container..."
-docker run -d --name $containerName -p 8000:8000 $imageName
-Write-Host ""
-Write-Host "$containerName should be starting up."
-Write-Host "You can view logs with: docker logs $containerName -f"
-Write-Host "To stop the container, run: docker stop $containerName"
-Write-Host "The service will be available at http://127.0.0.1:8000"

start_container.sh DELETED Viewed

@@ -1,25 +0,0 @@
-#!/bin/bash
-IMAGE_NAME="bielik-fastapi-service"
-CONTAINER_NAME="bielik_app_instance"
-TOKEN_FILE="my_hf_token.txt"
-# Build the Docker image with Hugging Face token as a secret
-echo "Building Docker image..."
-DOCKER_BUILDKIT=1 docker build --secret id=huggingface_token,src=$TOKEN_FILE -t $IMAGE_NAME .
-echo "Attempting to stop and remove existing container named $CONTAINER_NAME (if any)..."
-docker stop $CONTAINER_NAME > /dev/null 2>&1 || true # Stop if running, ignore error if not
-docker rm $CONTAINER_NAME > /dev/null 2>&1 || true   # Remove if exists, ignore error if not
-echo "Starting new $IMAGE_NAME container as $CONTAINER_NAME..."
-docker run -d --name $CONTAINER_NAME -p 8000:8000 $IMAGE_NAME
-# -d : Runs the container in detached mode (in the background)
-# --name : Assigns a specific name to your running container instance
-# -p 8000:8000 : Maps port 8000 on your host to port 8000 in the container
-echo ""
-echo "$CONTAINER_NAME should be starting up."
-echo "You can view logs with: docker logs $CONTAINER_NAME -f"
-echo "To stop the container, run: docker stop $CONTAINER_NAME"
-echo "The service will be available at http://127.0.0.1:8000"

test_simplified.py DELETED Viewed

@@ -1,132 +0,0 @@
-"""
-Unit tests for simplified Bielik service
-Tests the API structure without running actual models
-"""
-import os
-import json
-from unittest.mock import Mock, AsyncMock, patch
-# Skip llama-cpp installation during testing
-os.environ["SKIP_LLAMA_INSTALL"] = "1"
-# Mock the registry before importing main
-mock_registry = Mock()
-mock_registry.get_available_model_names.return_value = ["bielik-1.5b-transformer", "bielik-11b-transformer"]
-mock_registry.get_model_info.return_value = {"type": "transformers", "device": "cuda:0"}
-@patch("app.main.registry", mock_registry)
-def test_app_structure():
-    """Test that simplified app has correct endpoints"""
-    from app.main import app
-    # Get all routes
-    routes = {route.path: route.methods for route in app.routes}
-    # Check required endpoints exist
-    assert "/" in routes, "Root endpoint missing"
-    assert "/health" in routes, "Health endpoint missing"
-    assert "/models" in routes, "Models endpoint missing"
-    assert "/chat" in routes, "Chat endpoint missing"
-    assert "/generate" in routes, "Generate endpoint missing"
-    # Check methods
-    assert "GET" in routes["/health"], "Health should be GET"
-    assert "GET" in routes["/models"], "Models should be GET"
-    assert "POST" in routes["/chat"], "Chat should be POST"
-    assert "POST" in routes["/generate"], "Generate should be POST"
-    print("✅ App structure correct")
-    print(f"   Routes: {list(routes.keys())}")
-@patch("app.main.registry", mock_registry)
-def test_no_business_logic():
-    """Verify no domain/infill endpoints exist"""
-    from app.main import app
-    routes = {route.path for route in app.routes}
-    # These should NOT exist
-    forbidden_routes = ["/enhance", "/compare", "/infill", "/compare-infill", "/user/me"]
-    for route in forbidden_routes:
-        assert route not in routes, f"Business logic endpoint {route} should not exist"
-    print("✅ No business logic endpoints found")
-@patch("app.main.registry", mock_registry)
-def test_request_schemas():
-    """Test request/response schemas are valid"""
-    from app.main import ChatRequest, GenerateRequest, ChatResponse, GenerateResponse
-    from app.main import Message, HealthResponse, ModelsResponse
-    # Test ChatRequest
-    chat_req = ChatRequest(
-        model="bielik-1.5b-transformer",
-        messages=[Message(role="user", content="Hello")]
-    )
-    assert chat_req.model == "bielik-1.5b-transformer"
-    assert len(chat_req.messages) == 1
-    print("✅ ChatRequest schema valid")
-    # Test GenerateRequest
-    gen_req = GenerateRequest(
-        model="bielik-1.5b-transformer",
-        prompt="Hello world"
-    )
-    assert gen_req.model == "bielik-1.5b-transformer"
-    assert gen_req.prompt == "Hello world"
-    print("✅ GenerateRequest schema valid")
-    # Test HealthResponse
-    health = HealthResponse(
-        status="ok",
-        gpu_available=True,
-        models_available=2
-    )
-    assert health.status == "ok"
-    print("✅ HealthResponse schema valid")
-    # Test ModelsResponse
-    models_resp = ModelsResponse(models=[])
-    assert isinstance(models_resp.models, list)
-    print("✅ ModelsResponse schema valid")
-@patch("app.main.registry", mock_registry)
-def test_default_values():
-    """Test that requests have sensible defaults"""
-    from app.main import ChatRequest, GenerateRequest, Message
-    # Chat with minimal fields
-    chat = ChatRequest(
-        model="test",
-        messages=[Message(role="user", content="test")]
-    )
-    assert chat.max_tokens == 150
-    assert chat.temperature == 0.7
-    assert chat.top_p == 0.9
-    print("✅ Chat defaults correct")
-    # Generate with minimal fields
-    gen = GenerateRequest(
-        model="test",
-        prompt="test"
-    )
-    assert gen.max_tokens == 150
-    assert gen.temperature == 0.7
-    assert gen.top_p == 0.9
-    print("✅ Generate defaults correct")
-if __name__ == "__main__":
-    print("\n=== Testing Simplified Bielik Service ===\n")
-    try:
-        test_app_structure()
-        test_no_business_logic()
-        test_request_schemas()
-        test_default_values()
-        print("\n✅ All tests passed!")
-        print("\n=== Phase 1 Verification Complete ===")
-    except AssertionError as e:
-        print(f"\n❌ Test failed: {e}")
-        exit(1)