Andrew McCracken Claude commited on
Commit
bfa102d
Β·
1 Parent(s): 457c9e1

Add GPU support

Browse files

Added GPU-enabled Docker configuration:
- Dockerfile.base.gpu: CUDA 12.1 base with llama-cpp-python GPU support
- Dockerfile.gpu: HF Spaces GPU deployment dockerfile
- build-and-push-gpu.sh: Script to build and push GPU image
- Updated llm_handler.py to use N_GPU_LAYERS env variable

To use GPU:
1. Build: ./build-and-push-gpu.sh
2. Switch HF Space to GPU hardware
3. Use Dockerfile.gpu for deployment

Expected speedup: ~15s β†’ 1-3s per response

πŸ€– Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (4) hide show
  1. Dockerfile.base.gpu +52 -0
  2. Dockerfile.gpu +26 -0
  3. build-and-push-gpu.sh +64 -0
  4. llm_handler.py +11 -2
Dockerfile.base.gpu ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
2
+
3
+ WORKDIR /app
4
+
5
+ # Install Python and system dependencies
6
+ RUN apt-get update && apt-get install -y \
7
+ python3.11 \
8
+ python3.11-dev \
9
+ python3-pip \
10
+ build-essential \
11
+ cmake \
12
+ git \
13
+ && rm -rf /var/lib/apt/lists/*
14
+
15
+ # Set Python 3.11 as default
16
+ RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1 && \
17
+ update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1
18
+
19
+ # Upgrade pip
20
+ RUN python -m pip install --upgrade pip
21
+
22
+ # Copy requirements and install
23
+ COPY requirements.txt .
24
+
25
+ # Install llama-cpp-python with CUDA support
26
+ RUN CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python --no-cache-dir
27
+
28
+ # Install remaining dependencies
29
+ RUN pip install --no-cache-dir -r requirements.txt
30
+
31
+ # Copy application code
32
+ COPY . .
33
+
34
+ # Create data directory for persistence
35
+ RUN mkdir -p /data
36
+
37
+ # Set environment variables
38
+ ENV PYTHONUNBUFFERED=1
39
+ ENV MODEL_REPO=daskalos-apps/phi4-cybersec-Q4_K_M
40
+ ENV MODEL_FILENAME=phi4-mini-instruct-Q4_K_M.gguf
41
+ ENV USE_RAG=false
42
+ ENV CACHE_ENABLED=true
43
+
44
+ # Expose port
45
+ EXPOSE 8000
46
+
47
+ # Health check
48
+ HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
49
+ CMD python -c "import requests; requests.get('http://localhost:8000/health')"
50
+
51
+ # Run the application
52
+ CMD ["python", "main.py"]
Dockerfile.gpu ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Use pre-built GPU image from Docker Hub
2
+ # Build this image locally with: docker buildx build --platform linux/amd64 -f Dockerfile.base.gpu -t techdaskalos/cybersecchatbot:gpu . --push
3
+ FROM techdaskalos/cybersecchatbot:gpu
4
+
5
+ # Environment variables (already set in base image, but can override)
6
+ ENV PYTHONUNBUFFERED=1
7
+ ENV MODEL_REPO=daskalos-apps/phi4-cybersec-Q4_K_M
8
+ ENV MODEL_FILENAME=phi4-mini-instruct-Q4_K_M.gguf
9
+ ENV USE_RAG=false
10
+ ENV CACHE_ENABLED=true
11
+
12
+ # GPU configuration - offload all layers to GPU
13
+ ENV N_GPU_LAYERS=35
14
+
15
+ # Set Hugging Face cache to /data for persistence and write permissions
16
+ ENV HF_HOME=/data/huggingface
17
+
18
+ # Ensure all required directories exist and are writable
19
+ RUN mkdir -p /data /app/models /app/knowledge_db /data/huggingface/hub /data/huggingface/transformers && \
20
+ chmod -R 777 /data /app/models /app/knowledge_db
21
+
22
+ # Copy test interface (needed for /test endpoint)
23
+ COPY test_interface.html /app/
24
+
25
+ EXPOSE 8000
26
+ CMD ["python", "main.py"]
build-and-push-gpu.sh ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ set -e
3
+
4
+ # Configuration
5
+ DOCKER_USERNAME="techdaskalos"
6
+ IMAGE_NAME="cybersecchatbot"
7
+ VERSION="${1:-gpu}"
8
+ FULL_IMAGE="$DOCKER_USERNAME/$IMAGE_NAME:$VERSION"
9
+
10
+ echo "πŸ—οΈ Building GPU Docker image: $FULL_IMAGE"
11
+ echo "================================"
12
+
13
+ # Build the image for GPU
14
+ docker buildx build --platform linux/amd64 -f Dockerfile.base.gpu -t "$FULL_IMAGE" .
15
+
16
+ echo ""
17
+ echo "βœ… Build complete!"
18
+ echo ""
19
+ echo "πŸ§ͺ Testing the image locally (requires NVIDIA GPU)..."
20
+ echo " Run: docker run --gpus all -p 8000:8000 $FULL_IMAGE"
21
+ echo " Then visit: http://localhost:8000/test"
22
+ echo ""
23
+
24
+ read -p "Would you like to test locally before pushing? (y/n) " -n 1 -r
25
+ echo
26
+ if [[ $REPLY =~ ^[Yy]$ ]]; then
27
+ echo "Starting local test server with GPU support..."
28
+ echo "Press Ctrl+C to stop when done testing"
29
+ docker run --gpus all -p 8000:8000 "$FULL_IMAGE"
30
+ fi
31
+
32
+ echo ""
33
+ read -p "Push to Docker Hub? (y/n) " -n 1 -r
34
+ echo
35
+ if [[ $REPLY =~ ^[Yy]$ ]]; then
36
+ echo "πŸ“€ Pushing to Docker Hub..."
37
+
38
+ # Check if logged in
39
+ if ! docker info | grep -q "Username: $DOCKER_USERNAME"; then
40
+ echo "Please login to Docker Hub:"
41
+ docker login
42
+ fi
43
+
44
+ docker push "$FULL_IMAGE"
45
+
46
+ echo ""
47
+ echo "βœ… Successfully pushed: $FULL_IMAGE"
48
+ echo ""
49
+ echo "πŸ“ Next steps:"
50
+ echo " 1. Update your HF Space Dockerfile to:"
51
+ echo " FROM $FULL_IMAGE"
52
+ echo ""
53
+ echo " 2. Update HF Space to use GPU hardware"
54
+ echo ""
55
+ echo " 3. Commit and push to HF Spaces:"
56
+ echo " cp Dockerfile.gpu Dockerfile"
57
+ echo " git add Dockerfile"
58
+ echo " git commit -m \"Switch to GPU-enabled image: $FULL_IMAGE\""
59
+ echo " git push"
60
+ echo ""
61
+ echo " Your HF Space will deploy with GPU acceleration!"
62
+ else
63
+ echo "Skipped push. Image is ready locally as: $FULL_IMAGE"
64
+ fi
llm_handler.py CHANGED
@@ -49,12 +49,21 @@ class CybersecurityLLM:
49
 
50
  # Initialize llama.cpp with the model
51
  logger.info("Initializing model...")
 
 
 
 
 
 
 
 
 
52
  self.llm = Llama(
53
  model_path=model_path,
54
  n_ctx=4096, # Context window
55
  n_batch=512, # Batch size for prompt processing
56
- n_threads=6, # Use 6 of 8 vCPUs (leave 2 for system)
57
- n_gpu_layers=0, # CPU only
58
  seed=-1, # Random seed
59
  f16_kv=True, # Use f16 for key/value cache (saves memory)
60
  logits_all=False, # Only compute logits for last token
 
49
 
50
  # Initialize llama.cpp with the model
51
  logger.info("Initializing model...")
52
+
53
+ # Check for GPU support via environment variable
54
+ n_gpu_layers = int(os.getenv("N_GPU_LAYERS", "0"))
55
+
56
+ if n_gpu_layers > 0:
57
+ logger.info(f"GPU acceleration enabled: {n_gpu_layers} layers")
58
+ else:
59
+ logger.info("Running in CPU-only mode")
60
+
61
  self.llm = Llama(
62
  model_path=model_path,
63
  n_ctx=4096, # Context window
64
  n_batch=512, # Batch size for prompt processing
65
+ n_threads=6 if n_gpu_layers == 0 else 4, # Fewer threads needed with GPU
66
+ n_gpu_layers=n_gpu_layers, # GPU layers (0 for CPU-only)
67
  seed=-1, # Random seed
68
  f16_kv=True, # Use f16 for key/value cache (saves memory)
69
  logits_all=False, # Only compute logits for last token