Spaces:

jeanbaptdzd
/

open-finance-llm-8b

Paused

jeanbaptdzd commited on 18 days ago

Commit

e3878fa

1 Parent(s): 8c38d11

Add vLLM deployment for Koyeb with CUDA optimizations

- Add Dockerfile.koyeb: vLLM-optimized image with native OpenAI API
- Add start-vllm.sh: vLLM server startup script with optimizations
- Add start.sh: HF Spaces startup script
- Update README.md: Document both HF Spaces and Koyeb deployments
- Add KOYEB_VLLM_DEPLOYMENT.md: Detailed Koyeb setup guide
- Remove redundant status/setup docs
- Remove ad-hoc test scripts

Docker Hub public images:
- jeanbapt/dragon-llm-inference:vllm (Koyeb - vLLM)
- jeanbapt/dragon-llm-inference:latest (HF Spaces - Transformers)

Files changed (7) hide show

Dockerfile +12 -3
Dockerfile.koyeb +57 -0
KOYEB_VLLM_DEPLOYMENT.md +101 -0
README.md +80 -99
start-vllm.sh +68 -0
start.sh +10 -0
test_deployment.sh +0 -101

Dockerfile CHANGED Viewed

@@ -68,14 +68,23 @@ RUN test -f /app/app/providers/transformers_provider.py && \
     grep -q "def initialize_model" /app/app/providers/transformers_provider.py || \
     (echo "ERROR: transformers_provider.py not found or invalid!" && exit 1)
 # Create non-root user and cache directories in single layer
 # Use ${HF_HOME} variable (defaults to /tmp/huggingface if not set)
 RUN useradd -m -u 1000 user && \
     mkdir -p ${HF_HOME:-/tmp/huggingface} /tmp/torch/inductor /tmp/triton && \
-    chown -R user:user /app ${HF_HOME:-/tmp/huggingface} /tmp/torch /tmp/triton
 USER user
-EXPOSE 7860
-CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7860"]

     grep -q "def initialize_model" /app/app/providers/transformers_provider.py || \
     (echo "ERROR: transformers_provider.py not found or invalid!" && exit 1)
+# Copy startup script
+COPY start.sh /app/start.sh
 # Create non-root user and cache directories in single layer
 # Use ${HF_HOME} variable (defaults to /tmp/huggingface if not set)
 RUN useradd -m -u 1000 user && \
     mkdir -p ${HF_HOME:-/tmp/huggingface} /tmp/torch/inductor /tmp/triton && \
+    chmod +x /app/start.sh && \
+    chown -R user:user /app ${HF_HOME:-/tmp/huggingface} /tmp/torch /tmp/triton && \
+    # Verify startup script is executable and has correct shebang
+    test -x /app/start.sh && head -1 /app/start.sh | grep -q "^#!/bin/bash" || (echo "ERROR: start.sh not executable or wrong shebang!" && exit 1)
 USER user
+# Expose ports for both HF Spaces (7860) and Koyeb (8000)
+# PORT environment variable controls which port the app actually uses
+EXPOSE 7860 8000
+# Use startup script for more reliable execution
+CMD ["/app/start.sh"]

Dockerfile.koyeb ADDED Viewed

	@@ -0,0 +1,57 @@

+# Koyeb-optimized Dockerfile using vLLM's native OpenAI API server
+# This leverages vLLM's built-in optimizations: continuous batching, PagedAttention, CUDA graphs
+FROM nvidia/cuda:12.4.0-devel-ubuntu22.04
+# Build argument for cache control
+ARG CACHE_BUST=20250125_vllm
+ENV PYTHONUNBUFFERED=1 \
+    DEBIAN_FRONTEND=noninteractive \
+    HF_HOME=/tmp/huggingface \
+    VLLM_ATTENTION_BACKEND=FLASH_ATTN \
+    CUDA_VISIBLE_DEVICES=0
+# Install system dependencies
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        python3.11 \
+        python3.11-dev \
+        python3-pip \
+        git \
+        curl && \
+    rm -rf /var/lib/apt/lists/* && \
+    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 && \
+    update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1 && \
+    python3 -m pip install --upgrade pip
+WORKDIR /app
+# Install PyTorch with CUDA 12.4
+RUN pip install --no-cache-dir \
+    torch>=2.5.0 \
+    --index-url https://download.pytorch.org/whl/cu124
+# Install vLLM with all CUDA optimizations
+# vLLM includes: Flash Attention, PagedAttention, continuous batching, CUDA graphs
+RUN pip install --no-cache-dir \
+    vllm>=0.6.0 \
+    huggingface-hub>=0.20.0
+# Create non-root user and cache directories
+RUN useradd -m -u 1000 user && \
+    mkdir -p /tmp/huggingface /tmp/vllm && \
+    chown -R user:user /app /tmp/huggingface /tmp/vllm
+# Copy startup script
+COPY start-vllm.sh /app/start-vllm.sh
+RUN chmod +x /app/start-vllm.sh && chown user:user /app/start-vllm.sh
+USER user
+# vLLM OpenAI server default port
+EXPOSE 8000
+# Use vLLM's native OpenAI-compatible server
+CMD ["/app/start-vllm.sh"]

KOYEB_VLLM_DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,101 @@

+# Koyeb vLLM Deployment
+## Overview
+The Koyeb deployment uses **vLLM's native OpenAI-compatible API server** with full CUDA optimizations for maximum inference performance.
+## Docker Image
+**Public image on Docker Hub:**
+```
+jeanbapt/dragon-llm-inference:vllm
+```
+Built from `Dockerfile.koyeb` with:
+- NVIDIA CUDA 12.4 base
+- vLLM 0.6.0+ with all optimizations
+- Native OpenAI-compatible server
+## vLLM Optimizations
+| Feature | Benefit |
+|---------|---------|
+| **Flash Attention 2** | Faster attention computation |
+| **PagedAttention** | Efficient KV cache management |
+| **Continuous Batching** | Handle multiple requests simultaneously |
+| **Prefix Caching** | Reuse KV cache for common prefixes |
+| **Chunked Prefill** | Better memory utilization |
+| **CUDA Graphs** | Reduced kernel launch overhead |
+## Koyeb Configuration
+### Environment Variables
+| Variable | Value | Description |
+|----------|-------|-------------|
+| `MODEL` | `DragonLLM/Qwen-Open-Finance-R-8B` | Model to serve |
+| `HF_TOKEN_LC2` | (secret) | Hugging Face token |
+| `PORT` | `8000` | Server port |
+| `MAX_MODEL_LEN` | `8192` | Maximum context length |
+| `GPU_MEMORY_UTILIZATION` | `0.90` | GPU memory usage (90%) |
+### Instance Type
+- **Recommended**: `gpu-nvidia-l40s` (48GB VRAM)
+- **Alternative**: `gpu-nvidia-rtx-4000-sff-ada` (20GB VRAM)
+### Health Check
+- **Path**: `/health`
+- **Port**: 8000
+- **Grace Period**: 300s (model loading time)
+- **Interval**: 60s
+## API Endpoints
+vLLM's native OpenAI-compatible server provides:
+```
+POST /v1/chat/completions  - Chat completions
+POST /v1/completions       - Text completions
+GET  /v1/models            - List models
+GET  /health               - Health check
+```
+## Usage Example
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="https://dragon-llm-dealexmachina.koyeb.app/v1",
+    api_key="not-needed"
+)
+response = client.chat.completions.create(
+    model="DragonLLM/Qwen-Open-Finance-R-8B",
+    messages=[
+        {"role": "user", "content": "Analyze the impact of rising interest rates on bond portfolios"}
+    ],
+    temperature=0.7,
+    max_tokens=1024
+)
+print(response.choices[0].message.content)
+```
+## Build & Push (Development)
+```bash
+# Build vLLM image
+docker build -f Dockerfile.koyeb -t jeanbapt/dragon-llm-inference:vllm .
+# Push to Docker Hub
+docker push jeanbapt/dragon-llm-inference:vllm
+```
+## Performance Notes
+- **First request**: Slower due to model loading + CUDA warmup
+- **Subsequent requests**: Benefit from batching, KV cache reuse, CUDA graphs
+- **L40s GPU**: 48GB VRAM provides ample room for 8B model with long context

README.md CHANGED Viewed

@@ -11,33 +11,38 @@ suggested_hardware: l4x1
 # Open Finance LLM 8B
-OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B using Transformers.
-## Overview
-This service provides an OpenAI-compatible API for the DragonLLM Qwen3-8B finance-specialized language model. The model supports both English and French financial terminology and includes chain-of-thought reasoning.
 ## Features
-- OpenAI-compatible API - Drop-in replacement for OpenAI API
-- French and English support - Automatic language detection
-- Rate limiting - Built-in protection (30 req/min, 500 req/hour)
-- Statistics tracking - Token usage and request metrics via `/v1/stats`
-- Health monitoring - Model readiness status in `/health` endpoint
-- Streaming support - Real-time response streaming
-- Tool calls support - OpenAI-compatible tool/function calling
-- Structured outputs - JSON format support via response_format
 ## API Endpoints
-### List Models
-```bash
-curl -X GET "https://jeanbaptdzd-open-finance-llm-8b.hf.space/v1/models"
-```
 ### Chat Completions
 ```bash
-curl -X POST "https://jeanbaptdzd-open-finance-llm-8b.hf.space/v1/chat/completions" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "DragonLLM/Qwen-Open-Finance-R-8B",
@@ -47,9 +52,14 @@ curl -X POST "https://jeanbaptdzd-open-finance-llm-8b.hf.space/v1/chat/completio
   }'
 ```
 ### Streaming
 ```bash
-curl -X POST "https://jeanbaptdzd-open-finance-llm-8b.hf.space/v1/chat/completions" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "DragonLLM/Qwen-Open-Finance-R-8B",
@@ -58,25 +68,11 @@ curl -X POST "https://jeanbaptdzd-open-finance-llm-8b.hf.space/v1/chat/completio
   }'
 ```
-### Statistics
-```bash
-curl -X GET "https://jeanbaptdzd-open-finance-llm-8b.hf.space/v1/stats"
-```
 ### Health Check
 ```bash
-curl -X GET "https://jeanbaptdzd-open-finance-llm-8b.hf.space/health"
 ```
-## Response Format
-Responses include chain-of-thought reasoning in `<think>` tags followed by the answer. Reasoning typically consumes 40-60% of tokens.
-**Recommended `max_tokens`:**
-- Simple queries: 300-400
-- Complex queries: 500-800
-- Detailed analysis: 800-1200
 ## Configuration
 ### Environment Variables
@@ -85,11 +81,10 @@ Responses include chain-of-thought reasoning in `<think>` tags followed by the a
 - `HF_TOKEN_LC2` - Hugging Face token with access to DragonLLM models
 **Optional:**
-- `MODEL` - Model name (default: DragonLLM/Qwen-Open-Finance-R-8B)
 - `SERVICE_API_KEY` - API key for authentication
-- `LOG_LEVEL` - Logging level (default: info)
-- `HF_HOME` - Hugging Face cache directory (default: /tmp/huggingface)
-- `FORCE_MODEL_RELOAD` - Force reload model from Hub on startup (default: false)
 Token priority: `HF_TOKEN_LC2` > `HF_TOKEN_LC` > `HF_TOKEN` > `HUGGING_FACE_HUB_TOKEN`
@@ -103,8 +98,8 @@ Token priority: `HF_TOKEN_LC2` > `HF_TOKEN_LC` > `HF_TOKEN` > `HUGGING_FACE_HUB_
 from openai import OpenAI
 client = OpenAI(
-    base_url="https://jeanbaptdzd-open-finance-llm-8b.hf.space/v1",
-    api_key="not-needed"
 )
 response = client.chat.completions.create(
@@ -114,6 +109,27 @@ response = client.chat.completions.create(
 )
 ```
 ## Technical Specifications
@@ -122,36 +138,36 @@ response = client.chat.completions.create(
 - Fine-tuned on financial data
 - English and French support
-**Backend:**
 - Transformers 4.45.0+
 - PyTorch 2.5.0+ (CUDA 12.4)
-- Accelerate 0.30.0+
-**Performance:**
-- Inference: ~15 tokens/second (L4 GPU)
-- Response time: 3-27 seconds
-- Minimum VRAM: 20GB
 **Hardware:**
-- Development: L4x1 GPU (24GB VRAM)
-- Production: L40s GPU (48GB VRAM)
-## Recent Improvements
-### Code Quality & Hugging Face Best Practices Alignment
-This codebase has been optimized to align with Hugging Face inference best practices:
-- **Simplified Memory Management**: Removed redundant manual GPU memory cleanup - `device_map="auto"` handles this automatically
-- **Streamlined Token Management**: Hugging Face Hub now auto-detects tokens from environment variables
-- **Auto-Loading Chat Templates**: Leverages transformers 4.45.0+ automatic chat template loading
-- **Automatic Device Placement**: Removed manual device management - `device_map="auto"` handles GPU/CPU placement
-- **Improved Thread Safety**: Enhanced model access checks with thread-safe helpers
-- **Centralized Version Management**: Single source of truth for API version
-### Deprecated Functions
-- `clear_gpu_memory(model, tokenizer)` - Parameters deprecated, use `clear_gpu_memory()` without arguments
 ## Development
@@ -164,50 +180,15 @@ uvicorn app.main:app --reload --port 8080
 ### Testing
-**Unit Tests:**
 ```bash
 pytest tests/ -v
-```
-**Integration Tests:**
-The integration tests evaluate the model's ability to produce valid JSON outputs and execute tool calls, which are critical requirements for financial applications.
-```bash
-# Basic API functionality
 python tests/integration/test_space_basic.py
-# Tool calls and JSON format
-python tests/integration/test_space_with_tools.py
-# Detailed tool call validation
 python tests/integration/test_tool_calls.py
 ```
-**Test Coverage:**
-- API endpoints (health, models, chat completions)
-- Tool calls with `tool_choice` parameter
-- Structured JSON outputs via `response_format`
-- Model response parsing and validation
-These tests verify that the small 8B model can reliably produce valid JSON and execute tool calls, which is mandatory for financial workflows requiring structured data and function execution.
-## Project Structure
-```
-.
-├── app/                    # Main API application
-│   ├── main.py            # FastAPI app
-│   ├── routers/           # API routes
-│   ├── providers/         # Model providers
-│   ├── middleware/       # Rate limiting, auth
-│   └── utils/             # Utilities, stats tracking
-├── docs/                  # Documentation
-├── tests/                 # Test suite
-│   ├── integration/      # Integration tests (API, tool calls, JSON)
-│   └── performance/      # Performance benchmarks
-└── scripts/               # Utility scripts
-```
 ## License
 MIT License - see [LICENSE](LICENSE) file.

 # Open Finance LLM 8B
+OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B.
+## Deployment Options
+| Platform | Backend | Docker Image | Port |
+|----------|---------|--------------|------|
+| **HF Spaces** | Transformers | Default (builds from `Dockerfile`) | 7860 |
+| **Koyeb** | vLLM (optimized) | `jeanbapt/dragon-llm-inference:vllm` | 8000 |
+### Docker Hub Public Images
+```
+jeanbapt/dragon-llm-inference:vllm      # Koyeb - vLLM with CUDA optimizations
+jeanbapt/dragon-llm-inference:latest    # HF Spaces - Transformers backend
+```
 ## Features
+- **OpenAI-compatible API** - Drop-in replacement for OpenAI SDK
+- **French and English support** - Automatic language detection
+- **Rate limiting** - Built-in protection (30 req/min, 500 req/hour)
+- **Statistics tracking** - Token usage and request metrics via `/v1/stats`
+- **Health monitoring** - Model readiness status in `/health` endpoint
+- **Streaming support** - Real-time response streaming
+- **Tool calls support** - OpenAI-compatible tool/function calling
+- **Structured outputs** - JSON format support via `response_format`
 ## API Endpoints
 ### Chat Completions
 ```bash
+curl -X POST "https://your-endpoint/v1/chat/completions" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "DragonLLM/Qwen-Open-Finance-R-8B",
   }'
 ```
+### List Models
+```bash
+curl -X GET "https://your-endpoint/v1/models"
+```
 ### Streaming
 ```bash
+curl -X POST "https://your-endpoint/v1/chat/completions" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "DragonLLM/Qwen-Open-Finance-R-8B",
   }'
 ```
 ### Health Check
 ```bash
+curl -X GET "https://your-endpoint/health"
 ```
 ## Configuration
 ### Environment Variables
 - `HF_TOKEN_LC2` - Hugging Face token with access to DragonLLM models
 **Optional:**
+- `MODEL` - Model name (default: `DragonLLM/Qwen-Open-Finance-R-8B`)
+- `PORT` - Server port (default: 7860 for HF, 8000 for Koyeb)
 - `SERVICE_API_KEY` - API key for authentication
+- `LOG_LEVEL` - Logging level (default: `info`)
 Token priority: `HF_TOKEN_LC2` > `HF_TOKEN_LC` > `HF_TOKEN` > `HUGGING_FACE_HUB_TOKEN`
 from openai import OpenAI
 client = OpenAI(
+    base_url="https://your-endpoint/v1",
+    api_key="not-needed"  # or your SERVICE_API_KEY
 )
 response = client.chat.completions.create(
 )
 ```
+## Koyeb Deployment (vLLM)
+The Koyeb deployment uses vLLM's native OpenAI-compatible server with full CUDA optimizations:
+- **Flash Attention 2** - Faster attention computation
+- **PagedAttention** - Efficient GPU memory management
+- **Continuous batching** - High throughput inference
+- **Prefix caching** - Reuse KV cache for common prefixes
+See [KOYEB_VLLM_DEPLOYMENT.md](KOYEB_VLLM_DEPLOYMENT.md) for detailed setup.
+### Quick Deploy to Koyeb
+1. Create app in Koyeb dashboard
+2. Set Docker image: `jeanbapt/dragon-llm-inference:vllm`
+3. Add environment variables:
+   - `MODEL`: `DragonLLM/Qwen-Open-Finance-R-8B`
+   - `HF_TOKEN_LC2`: (your HF token as secret)
+   - `PORT`: `8000`
+4. Select GPU instance (L40s recommended)
+5. Set health check: `GET /health` on port 8000
 ## Technical Specifications
 - Fine-tuned on financial data
 - English and French support
+**HF Spaces Backend:**
 - Transformers 4.45.0+
 - PyTorch 2.5.0+ (CUDA 12.4)
+**Koyeb Backend:**
+- vLLM 0.6.0+
+- Flash Attention 2
+- CUDA 12.4
 **Hardware:**
+- Minimum: L4 GPU (24GB VRAM)
+- Recommended: L40s GPU (48GB VRAM)
+## Project Structure
+```
+.
+├── app/                      # Main API application
+│   ├── main.py              # FastAPI app (HF Spaces)
+│   ├── routers/             # API routes
+│   ├── providers/           # Model providers (Transformers)
+│   ├── middleware/          # Rate limiting, auth
+│   └── utils/               # Utilities, stats tracking
+├── Dockerfile               # HF Spaces (Transformers)
+├── Dockerfile.koyeb         # Koyeb (vLLM)
+├── start.sh                 # HF Spaces startup
+├── start-vllm.sh            # Koyeb vLLM startup
+├── docs/                    # Technical documentation
+└── tests/                   # Test suite
+```
 ## Development
 ### Testing
 ```bash
+# Unit tests
 pytest tests/ -v
+# Integration tests
 python tests/integration/test_space_basic.py
 python tests/integration/test_tool_calls.py
 ```
 ## License
 MIT License - see [LICENSE](LICENSE) file.

start-vllm.sh ADDED Viewed

	@@ -0,0 +1,68 @@

+#!/bin/bash
+# vLLM OpenAI-compatible API server startup script for Koyeb
+# Uses vLLM's native server with all CUDA optimizations
+set -e
+# Configuration from environment
+MODEL=${MODEL:-"DragonLLM/Qwen-Open-Finance-R-8B"}
+PORT=${PORT:-8000}
+MAX_MODEL_LEN=${MAX_MODEL_LEN:-8192}
+GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION:-0.90}
+# HF Token (try multiple env var names)
+HF_TOKEN="${HF_TOKEN_LC2:-${HF_TOKEN:-${HUGGING_FACE_HUB_TOKEN:-}}}"
+echo "=========================================="
+echo "vLLM OpenAI Server - Koyeb Deployment"
+echo "=========================================="
+echo "Model: $MODEL"
+echo "Port: $PORT"
+echo "Max Model Length: $MAX_MODEL_LEN"
+echo "GPU Memory Utilization: $GPU_MEMORY_UTILIZATION"
+echo "HF Token: ${HF_TOKEN:+set (${#HF_TOKEN} chars)}"
+echo "CUDA Devices: ${CUDA_VISIBLE_DEVICES:-auto}"
+echo "=========================================="
+# Check for GPU
+if command -v nvidia-smi &> /dev/null; then
+    echo "GPU Info:"
+    nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader
+    echo "=========================================="
+fi
+# Build vLLM serve command with optimizations
+VLLM_ARGS=(
+    "--model" "$MODEL"
+    "--port" "$PORT"
+    "--host" "0.0.0.0"
+    "--dtype" "bfloat16"
+    "--max-model-len" "$MAX_MODEL_LEN"
+    "--gpu-memory-utilization" "$GPU_MEMORY_UTILIZATION"
+    "--trust-remote-code"
+    # Optimization flags
+    "--enable-prefix-caching"           # Cache common prefixes for faster inference
+    "--enable-chunked-prefill"          # Better memory management
+    "--max-num-batched-tokens" "8192"   # Batch optimization
+    "--max-num-seqs" "256"              # Concurrent request handling
+    # Disable logging overhead in production
+    "--disable-log-requests"
+)
+# Add HF token if available
+if [ -n "$HF_TOKEN" ]; then
+    export HF_TOKEN
+    export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
+fi
+echo "Starting vLLM OpenAI server..."
+echo "Endpoints available:"
+echo "  - POST /v1/chat/completions"
+echo "  - POST /v1/completions"
+echo "  - GET  /v1/models"
+echo "  - GET  /health"
+echo "=========================================="
+# Start vLLM server
+exec python -m vllm.entrypoints.openai.api_server "${VLLM_ARGS[@]}"

start.sh ADDED Viewed

	@@ -0,0 +1,10 @@

+#!/bin/bash
+# Get port from environment variable, default to 7860
+PORT=${PORT:-7860}
+# Redirect all output to stderr so it shows in logs
+exec >&2
+# Start uvicorn with the specified port
+exec python -m uvicorn app.main:app --host 0.0.0.0 --port "$PORT"

test_deployment.sh DELETED Viewed

@@ -1,101 +0,0 @@
-#!/bin/bash
-# Quick deployment test script
-# Tests the new features without requiring the full model to be loaded
-set -e
-echo "=========================================="
-echo "Testing New Features"
-echo "=========================================="
-echo ""
-# Check if server is running
-if ! curl -s http://localhost:8080/health > /dev/null 2>&1; then
-    echo "⚠️  Server not running on localhost:8080"
-    echo "   Start server with: uvicorn app.main:app --host 0.0.0.0 --port 8080"
-    echo ""
-    echo "Or test against deployed instance by setting API_URL:"
-    echo "   export API_URL=https://your-space.hf.space"
-    echo "   ./test_deployment.sh"
-    exit 1
-fi
-API_URL="${API_URL:-http://localhost:8080}"
-echo "Testing against: $API_URL"
-echo ""
-# Test 1: Health endpoint
-echo "1. Testing /health endpoint..."
-HEALTH=$(curl -s "$API_URL/health")
-if echo "$HEALTH" | grep -q "model_ready"; then
-    echo "   ✓ Health endpoint includes model_ready field"
-    echo "   Response: $HEALTH"
-else
-    echo "   ✗ Health endpoint missing model_ready field"
-    exit 1
-fi
-echo ""
-# Test 2: Stats endpoint
-echo "2. Testing /v1/stats endpoint..."
-STATS=$(curl -s "$API_URL/v1/stats")
-if echo "$STATS" | grep -q "total_requests"; then
-    echo "   ✓ Stats endpoint working"
-    echo "   Response preview: $(echo "$STATS" | head -c 200)..."
-else
-    echo "   ✗ Stats endpoint not working"
-    exit 1
-fi
-echo ""
-# Test 3: Rate limiting headers
-echo "3. Testing rate limiting headers..."
-HEADERS=$(curl -s -I "$API_URL/v1/models")
-if echo "$HEADERS" | grep -q "X-RateLimit-Limit-Minute"; then
-    echo "   ✓ Rate limit headers present"
-    echo "$HEADERS" | grep "X-RateLimit"
-else
-    echo "   ✗ Rate limit headers missing"
-    exit 1
-fi
-echo ""
-# Test 4: Error sanitization
-echo "4. Testing error sanitization..."
-ERROR_RESPONSE=$(curl -s -w "\n%{http_code}" -X POST "$API_URL/v1/chat/completions" \
-    -H "Content-Type: application/json" \
-    -d '{"model":"test","messages":[]}')
-HTTP_CODE=$(echo "$ERROR_RESPONSE" | tail -n1)
-ERROR_BODY=$(echo "$ERROR_RESPONSE" | head -n-1)
-if [ "$HTTP_CODE" = "400" ]; then
-    if echo "$ERROR_BODY" | grep -q "messages list cannot be empty"; then
-        echo "   ✓ Error properly formatted (400 with clear message)"
-    else
-        echo "   ⚠️  Got 400 but error message format unexpected"
-    fi
-else
-    echo "   ⚠️  Expected 400, got $HTTP_CODE"
-fi
-echo ""
-# Test 5: Root endpoint
-echo "5. Testing / endpoint..."
-ROOT=$(curl -s "$API_URL/")
-if echo "$ROOT" | grep -q "status"; then
-    echo "   ✓ Root endpoint working"
-else
-    echo "   ✗ Root endpoint not working"
-    exit 1
-fi
-echo ""
-echo "=========================================="
-echo "✅ All basic tests passed!"
-echo "=========================================="
-echo ""
-echo "Next steps:"
-echo "1. Test with actual model requests (requires model to be loaded)"
-echo "2. Test rate limiting by making 31 requests in a minute"
-echo "3. Check stats endpoint after making some requests"