Spaces:

jeanbaptdzd
/

open-finance-llm-8b

Paused

jeanbaptdzd commited on 18 days ago

Commit

0d30de3

1 Parent(s): e3878fa

Update Dockerfile.koyeb to use official vLLM base image

- Based on vllm/vllm-openai:latest (Koyeb's proven approach)
- Includes all CUDA/vLLM optimizations out of the box
- Flash Attention 2, PagedAttention, continuous batching
- Model args embedded in CMD

Files changed (3) hide show

Dockerfile.koyeb +19 -49
KOYEB_VLLM_DEPLOYMENT.md +43 -49
start-vllm.sh +51 -35

Dockerfile.koyeb CHANGED Viewed

@@ -1,57 +1,27 @@
-# Koyeb-optimized Dockerfile using vLLM's native OpenAI API server
-# This leverages vLLM's built-in optimizations: continuous batching, PagedAttention, CUDA graphs
-FROM nvidia/cuda:12.4.0-devel-ubuntu22.04
-# Build argument for cache control
-ARG CACHE_BUST=20250125_vllm
-ENV PYTHONUNBUFFERED=1 \
-    DEBIAN_FRONTEND=noninteractive \
-    HF_HOME=/tmp/huggingface \
-    VLLM_ATTENTION_BACKEND=FLASH_ATTN \
-    CUDA_VISIBLE_DEVICES=0
-# Install system dependencies
-RUN apt-get update && \
-    apt-get install -y --no-install-recommends \
-        python3.11 \
-        python3.11-dev \
-        python3-pip \
-        git \
-        curl && \
-    rm -rf /var/lib/apt/lists/* && \
-    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 && \
-    update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1 && \
-    python3 -m pip install --upgrade pip
-WORKDIR /app
-# Install PyTorch with CUDA 12.4
-RUN pip install --no-cache-dir \
-    torch>=2.5.0 \
-    --index-url https://download.pytorch.org/whl/cu124
-# Install vLLM with all CUDA optimizations
-# vLLM includes: Flash Attention, PagedAttention, continuous batching, CUDA graphs
-RUN pip install --no-cache-dir \
-    vllm>=0.6.0 \
-    huggingface-hub>=0.20.0
-# Create non-root user and cache directories
-RUN useradd -m -u 1000 user && \
-    mkdir -p /tmp/huggingface /tmp/vllm && \
-    chown -R user:user /app /tmp/huggingface /tmp/vllm
-# Copy startup script
-COPY start-vllm.sh /app/start-vllm.sh
-RUN chmod +x /app/start-vllm.sh && chown user:user /app/start-vllm.sh
-USER user
-# vLLM OpenAI server default port
 EXPOSE 8000
-# Use vLLM's native OpenAI-compatible server
-CMD ["/app/start-vllm.sh"]

+# Koyeb-optimized Dockerfile using official vLLM OpenAI image
+# Based on Koyeb's proven vLLM deployment approach
+FROM vllm/vllm-openai:latest
+# Environment variables
+ENV HF_HOME=/tmp/huggingface \
+    VLLM_ATTENTION_BACKEND=FLASH_ATTN
+# Create cache directories with proper permissions
+USER root
+RUN mkdir -p /tmp/huggingface && chmod 777 /tmp/huggingface
+# Switch back to default user
+USER 1000
+# Expose vLLM default port
 EXPOSE 8000
+# Default model and settings - can be overridden via Koyeb env/args
+ENV MODEL="DragonLLM/Qwen-Open-Finance-R-8B"
+ENV MAX_MODEL_LEN="8192"
+ENV DTYPE="bfloat16"
+# Use vLLM's native OpenAI server entrypoint
+# Model is specified via environment or command args
+CMD ["--model", "DragonLLM/Qwen-Open-Finance-R-8B", "--trust-remote-code", "--dtype", "bfloat16", "--max-model-len", "8192", "--gpu-memory-utilization", "0.90"]

KOYEB_VLLM_DEPLOYMENT.md CHANGED Viewed

@@ -2,61 +2,51 @@
 ## Overview
-The Koyeb deployment uses **vLLM's native OpenAI-compatible API server** with full CUDA optimizations for maximum inference performance.
-## Docker Image
-**Public image on Docker Hub:**
-```
-jeanbapt/dragon-llm-inference:vllm
-```
-Built from `Dockerfile.koyeb` with:
-- NVIDIA CUDA 12.4 base
-- vLLM 0.6.0+ with all optimizations
-- Native OpenAI-compatible server
-## vLLM Optimizations
-| Feature | Benefit |
-|---------|---------|
-| **Flash Attention 2** | Faster attention computation |
-| **PagedAttention** | Efficient KV cache management |
-| **Continuous Batching** | Handle multiple requests simultaneously |
-| **Prefix Caching** | Reuse KV cache for common prefixes |
-| **Chunked Prefill** | Better memory utilization |
-| **CUDA Graphs** | Reduced kernel launch overhead |
-## Koyeb Configuration
 ### Environment Variables
 | Variable | Value | Description |
 |----------|-------|-------------|
-| `MODEL` | `DragonLLM/Qwen-Open-Finance-R-8B` | Model to serve |
-| `HF_TOKEN_LC2` | (secret) | Hugging Face token |
-| `PORT` | `8000` | Server port |
-| `MAX_MODEL_LEN` | `8192` | Maximum context length |
-| `GPU_MEMORY_UTILIZATION` | `0.90` | GPU memory usage (90%) |
 ### Instance Type
 - **Recommended**: `gpu-nvidia-l40s` (48GB VRAM)
-- **Alternative**: `gpu-nvidia-rtx-4000-sff-ada` (20GB VRAM)
 ### Health Check
-- **Path**: `/health`
 - **Port**: 8000
-- **Grace Period**: 300s (model loading time)
-- **Interval**: 60s
-## API Endpoints
-vLLM's native OpenAI-compatible server provides:
 ```
-POST /v1/chat/completions  - Chat completions
 POST /v1/completions       - Text completions
 GET  /v1/models            - List models
 GET  /health               - Health check
@@ -69,33 +59,37 @@ from openai import OpenAI
 client = OpenAI(
     base_url="https://dragon-llm-dealexmachina.koyeb.app/v1",
-    api_key="not-needed"
 )
 response = client.chat.completions.create(
     model="DragonLLM/Qwen-Open-Finance-R-8B",
     messages=[
-        {"role": "user", "content": "Analyze the impact of rising interest rates on bond portfolios"}
     ],
     temperature=0.7,
     max_tokens=1024
 )
-print(response.choices[0].message.content)
 ```
-## Build & Push (Development)
-```bash
-# Build vLLM image
-docker build -f Dockerfile.koyeb -t jeanbapt/dragon-llm-inference:vllm .
-# Push to Docker Hub
-docker push jeanbapt/dragon-llm-inference:vllm
-```
-## Performance Notes
-- **First request**: Slower due to model loading + CUDA warmup
-- **Subsequent requests**: Benefit from batching, KV cache reuse, CUDA graphs
-- **L40s GPU**: 48GB VRAM provides ample room for 8B model with long context

 ## Overview
+The Koyeb deployment uses **vLLM's official Docker image** (`vllm/vllm-openai`) for maximum compatibility and performance.
+## Koyeb Configuration
+### Using Official vLLM Image (Recommended)
+**Docker Image:** `vllm/vllm-openai:latest`
+**Command args:**
+```
+--model DragonLLM/Qwen-Open-Finance-R-8B --trust-remote-code --dtype bfloat16 --max-model-len 8192
+```
 ### Environment Variables
 | Variable | Value | Description |
 |----------|-------|-------------|
+| `HF_TOKEN` | (secret) | Hugging Face token for gated model |
+| `VLLM_API_KEY` | (optional) | API key to protect the endpoint |
 ### Instance Type
 - **Recommended**: `gpu-nvidia-l40s` (48GB VRAM)
+- **Region**: `na` (North America) - where L40s is most available
 ### Health Check
+- **Type**: TCP
 - **Port**: 8000
+- **Grace Period**: 900 seconds (15 minutes for model loading)
+## Koyeb Dashboard Setup
+1. **Create new service** in `dragon-llm` app
+2. **Docker image**: `vllm/vllm-openai:latest`
+3. **Command args**: `--model DragonLLM/Qwen-Open-Finance-R-8B --trust-remote-code --dtype bfloat16 --max-model-len 8192`
+4. **Environment**: Add `HF_TOKEN` secret (your HuggingFace token)
+5. **Instance**: `gpu-nvidia-l40s` in `na` region
+6. **Port**: 8000 (HTTP)
+7. **Health check**: TCP on port 8000, grace period 900s
+## API Endpoints (vLLM Native)
 ```
+POST /v1/chat/completions  - Chat completions (OpenAI compatible)
 POST /v1/completions       - Text completions
 GET  /v1/models            - List models
 GET  /health               - Health check
 client = OpenAI(
     base_url="https://dragon-llm-dealexmachina.koyeb.app/v1",
+    api_key="not-needed"  # or your VLLM_API_KEY
 )
 response = client.chat.completions.create(
     model="DragonLLM/Qwen-Open-Finance-R-8B",
     messages=[
+        {"role": "user", "content": "Analyze the impact of rising interest rates"}
     ],
     temperature=0.7,
     max_tokens=1024
 )
 ```
+## Troubleshooting
+### "Application exited with code 8" with no logs
+This usually means GPU allocation failed at the hypervisor level. Try:
+1. Different region (try `na` for L40s availability)
+2. Different GPU type (`gpu-nvidia-a100`)
+3. Wait and retry later (GPU availability varies)
+### Model download issues
+Ensure `HF_TOKEN` is set and the token has access to the gated model.
+## Custom Image (Alternative)
+If you prefer a custom image, use:
+```
+jeanbapt/dragon-llm-inference:vllm
+```
+Built from `Dockerfile.koyeb` in this repository.

start-vllm.sh CHANGED Viewed

@@ -2,7 +2,15 @@
 # vLLM OpenAI-compatible API server startup script for Koyeb
 # Uses vLLM's native server with all CUDA optimizations
-set -e
 # Configuration from environment
 MODEL=${MODEL:-"DragonLLM/Qwen-Open-Finance-R-8B"}
@@ -14,55 +22,63 @@ GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION:-0.90}
 HF_TOKEN="${HF_TOKEN_LC2:-${HF_TOKEN:-${HUGGING_FACE_HUB_TOKEN:-}}}"
 echo "=========================================="
-echo "vLLM OpenAI Server - Koyeb Deployment"
-echo "=========================================="
-echo "Model: $MODEL"
-echo "Port: $PORT"
-echo "Max Model Length: $MAX_MODEL_LEN"
-echo "GPU Memory Utilization: $GPU_MEMORY_UTILIZATION"
-echo "HF Token: ${HF_TOKEN:+set (${#HF_TOKEN} chars)}"
-echo "CUDA Devices: ${CUDA_VISIBLE_DEVICES:-auto}"
 echo "=========================================="
 # Check for GPU
 if command -v nvidia-smi &> /dev/null; then
-    echo "GPU Info:"
-    nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader
-    echo "=========================================="
 fi
-# Build vLLM serve command with optimizations
-VLLM_ARGS=(
-    "--model" "$MODEL"
-    "--port" "$PORT"
-    "--host" "0.0.0.0"
-    "--dtype" "bfloat16"
-    "--max-model-len" "$MAX_MODEL_LEN"
-    "--gpu-memory-utilization" "$GPU_MEMORY_UTILIZATION"
-    "--trust-remote-code"
-    # Optimization flags
-    "--enable-prefix-caching"           # Cache common prefixes for faster inference
-    "--enable-chunked-prefill"          # Better memory management
-    "--max-num-batched-tokens" "8192"   # Batch optimization
-    "--max-num-seqs" "256"              # Concurrent request handling
-    # Disable logging overhead in production
-    "--disable-log-requests"
-)
-# Add HF token if available
 if [ -n "$HF_TOKEN" ]; then
     export HF_TOKEN
     export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
 fi
 echo "Starting vLLM OpenAI server..."
-echo "Endpoints available:"
 echo "  - POST /v1/chat/completions"
 echo "  - POST /v1/completions"
 echo "  - GET  /v1/models"
 echo "  - GET  /health"
 echo "=========================================="
-# Start vLLM server
-exec python -m vllm.entrypoints.openai.api_server "${VLLM_ARGS[@]}"

 # vLLM OpenAI-compatible API server startup script for Koyeb
 # Uses vLLM's native server with all CUDA optimizations
+# Redirect all output to stderr for Koyeb logs
+exec 2>&1
+echo "=========================================="
+echo "vLLM OpenAI Server - Starting"
+echo "=========================================="
+echo "Date: $(date)"
+echo "User: $(whoami)"
+echo "PWD: $(pwd)"
 # Configuration from environment
 MODEL=${MODEL:-"DragonLLM/Qwen-Open-Finance-R-8B"}
 HF_TOKEN="${HF_TOKEN_LC2:-${HF_TOKEN:-${HUGGING_FACE_HUB_TOKEN:-}}}"
 echo "=========================================="
+echo "Configuration:"
+echo "  Model: $MODEL"
+echo "  Port: $PORT"
+echo "  Max Model Length: $MAX_MODEL_LEN"
+echo "  GPU Memory Utilization: $GPU_MEMORY_UTILIZATION"
+echo "  HF Token: ${HF_TOKEN:+set (${#HF_TOKEN} chars)}"
 echo "=========================================="
+# Check Python
+echo "Checking Python..."
+which python || { echo "ERROR: python not found!"; exit 1; }
+python --version
+# Check vLLM
+echo "Checking vLLM installation..."
+python -c "import vllm; print(f'vLLM version: {vllm.__version__}')" || {
+    echo "ERROR: vLLM not installed correctly!"
+    exit 1
+}
 # Check for GPU
+echo "=========================================="
+echo "GPU Information:"
 if command -v nvidia-smi &> /dev/null; then
+    nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader || echo "nvidia-smi failed"
+    nvidia-smi || echo "nvidia-smi full output failed"
+else
+    echo "WARNING: nvidia-smi not found - GPU may not be available!"
 fi
+echo "=========================================="
+# Set HF token for model download
 if [ -n "$HF_TOKEN" ]; then
     export HF_TOKEN
     export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
+    echo "HF Token exported for model download"
+else
+    echo "WARNING: No HF token set - model download may fail for gated models!"
 fi
+echo "=========================================="
 echo "Starting vLLM OpenAI server..."
+echo "Endpoints:"
 echo "  - POST /v1/chat/completions"
 echo "  - POST /v1/completions"
 echo "  - GET  /v1/models"
 echo "  - GET  /health"
 echo "=========================================="
+# Build vLLM serve command
+exec python -m vllm.entrypoints.openai.api_server \
+    --model "$MODEL" \
+    --port "$PORT" \
+    --host "0.0.0.0" \
+    --dtype "bfloat16" \
+    --max-model-len "$MAX_MODEL_LEN" \
+    --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
+    --trust-remote-code \
+    --enable-prefix-caching \
+    --disable-log-requests