GRDN.AI.3 / HUGGINGFACE_GPU_SETUP.md
danidanidani's picture
Fresh deployment: Llama 3.2-1B with GPU acceleration
5e35da7
# HuggingFace Spaces GPU Setup Guide πŸš€
This guide will help you enable GPU acceleration for GRDN AI on HuggingFace Spaces with your Nvidia T4 grant.
## Prerequisites
- HuggingFace Space with GPU enabled (Nvidia T4 small: 4 vCPU, 15GB RAM, 16GB GPU)
- Model files uploaded to your Space
## Setup Steps
### 1. Enable GPU in Space Settings
1. Go to your Space settings on HuggingFace
2. Navigate to "Hardware" section
3. Select "T4 small" (or your granted GPU tier)
4. Save changes
### 2. Upload Model Files
Your Space needs the GGUF model files in the `src/models/` directory:
- `llama-2-7b-chat.Q4_K_M.gguf` (for Llama2)
- `decilm-7b-uniform-gqa-q8_0.gguf` (for DeciLM)
You can upload these via:
- HuggingFace web interface (Files tab)
- Git LFS (recommended for large files)
- HuggingFace Hub CLI
### 3. Install Dependencies
Make sure your Space has the updated `requirements.txt` which includes:
```
torch>=2.0.0
```
### 4. Verify GPU Detection
Once your Space restarts, check the sidebar in the app for:
- πŸš€ **GPU Acceleration: ENABLED** - GPU is working!
- ⚠️ **GPU Acceleration: DISABLED** - Something's wrong
You should also see in the logs:
```
πŸ€— Running on HuggingFace Spaces
πŸš€ GPU detected: Tesla T4 with 15.xx GB memory
πŸš€ Will offload all layers to GPU (n_gpu_layers=-1)
βœ… GPU acceleration ENABLED with -1 layers
```
## How It Works
The app now automatically:
1. **Detects HuggingFace Spaces environment** via `SPACE_ID` or `SPACE_AUTHOR_NAME` env variables
2. **Checks for GPU availability** using PyTorch's `torch.cuda.is_available()`
3. **Configures LlamaCPP** to use GPU with `n_gpu_layers=-1` (all layers on GPU)
4. **Shows status** in the sidebar UI
### GPU Configuration
- **CPU Mode**: `n_gpu_layers=0` - All computation on CPU (slow)
- **GPU Mode**: `n_gpu_layers=-1` - All model layers offloaded to GPU (fast)
## Performance Expectations
With GPU acceleration on Nvidia T4:
- **Response time**: ~2-5 seconds (vs 30-60+ seconds on CPU)
- **Token generation**: ~20-50 tokens/sec (vs 1-3 tokens/sec on CPU)
- **Memory**: Model fits comfortably in 16GB VRAM
## Troubleshooting
### GPU Not Detected
1. **Check Space hardware**: Ensure T4 is selected in settings
2. **Check logs**: Look for GPU detection messages
3. **Verify torch installation**: `torch.cuda.is_available()` should return `True`
4. **Try restarting**: Sometimes requires Space restart after hardware change
### Model File Not Found
If you see: `⚠️ Model not found at src/models/...`
- Upload the model files to the correct path
- Check file names match exactly
- Ensure files aren't corrupted during upload
### Out of Memory Errors
If GPU runs out of memory:
- The quantized models (Q4_K_M, q8_0) are designed to fit in 16GB
- Try restarting the Space
- Check if other processes are using GPU memory
### Still Slow After GPU Setup
1. Verify GPU is actually being used (check logs)
2. Ensure `n_gpu_layers=-1` is set (check initialization logs)
3. Check HuggingFace Space isn't in "Sleeping" mode
4. Verify model is fully loaded before making requests
## Code Changes Summary
The following changes enable automatic GPU detection:
1. **`src/backend/chatbot.py`**:
- Added `detect_gpu_and_environment()` function
- Modified `init_llm()` to use dynamic GPU configuration
- Automatic path detection for HF Spaces vs local
2. **`app.py`**:
- Added GPU status indicator in sidebar
- Shows real-time GPU availability
3. **`src/requirements.txt`**:
- Added `torch>=2.0.0` for GPU detection
## Testing Locally
To test GPU detection locally (if you have an Nvidia GPU):
```bash
# Install CUDA-enabled PyTorch
pip install torch --index-url https://download.pytorch.org/whl/cu118
# Run the app
streamlit run app.py
```
Without GPU locally, you'll see:
```
⚠️ No GPU detected via torch.cuda
⚠️ Running on CPU (no GPU detected)
```
## Additional Resources
- [HuggingFace Spaces Hardware Documentation](https://huggingface.co/docs/hub/spaces-gpus)
- [LlamaCPP GPU Acceleration Guide](https://github.com/ggerganov/llama.cpp#cublas)
- [PyTorch CUDA Setup](https://pytorch.org/get-started/locally/)
---
**Note**: This GPU setup is backward compatible - the app will still work on CPU if no GPU is available!