Spaces:

danidanidani
/

GRDN.AI.3

Sleeping

App Files Files Community

GRDN.AI.3 / HUGGINGFACE_GPU_SETUP.md

danidanidani

Fresh deployment: Llama 3.2-1B with GPU acceleration

5e35da7 3 months ago

preview code

raw

history blame contribute delete

4.25 kB

	# HuggingFace Spaces GPU Setup Guide 🚀

	This guide will help you enable GPU acceleration for GRDN AI on HuggingFace Spaces with your Nvidia T4 grant.

	## Prerequisites
	- HuggingFace Space with GPU enabled (Nvidia T4 small: 4 vCPU, 15GB RAM, 16GB GPU)
	- Model files uploaded to your Space

	## Setup Steps

	### 1. Enable GPU in Space Settings
	1. Go to your Space settings on HuggingFace
	2. Navigate to "Hardware" section
	3. Select "T4 small" (or your granted GPU tier)
	4. Save changes

	### 2. Upload Model Files
	Your Space needs the GGUF model files in the `src/models/` directory:
	- `llama-2-7b-chat.Q4_K_M.gguf` (for Llama2)
	- `decilm-7b-uniform-gqa-q8_0.gguf` (for DeciLM)

	You can upload these via:
	- HuggingFace web interface (Files tab)
	- Git LFS (recommended for large files)
	- HuggingFace Hub CLI

	### 3. Install Dependencies
	Make sure your Space has the updated `requirements.txt` which includes:
	```
	torch>=2.0.0
	```

	### 4. Verify GPU Detection
	Once your Space restarts, check the sidebar in the app for:
	- 🚀 GPU Acceleration: ENABLED - GPU is working!
	- ⚠️ GPU Acceleration: DISABLED - Something's wrong

	You should also see in the logs:
	```
	🤗 Running on HuggingFace Spaces
	🚀 GPU detected: Tesla T4 with 15.xx GB memory
	🚀 Will offload all layers to GPU (n_gpu_layers=-1)
	✅ GPU acceleration ENABLED with -1 layers
	```

	## How It Works

	The app now automatically:
	1. Detects HuggingFace Spaces environment via `SPACE_ID` or `SPACE_AUTHOR_NAME` env variables
	2. Checks for GPU availability using PyTorch's `torch.cuda.is_available()`
	3. Configures LlamaCPP to use GPU with `n_gpu_layers=-1` (all layers on GPU)
	4. Shows status in the sidebar UI

	### GPU Configuration
	- CPU Mode: `n_gpu_layers=0` - All computation on CPU (slow)
	- GPU Mode: `n_gpu_layers=-1` - All model layers offloaded to GPU (fast)

	## Performance Expectations

	With GPU acceleration on Nvidia T4:
	- Response time: ~2-5 seconds (vs 30-60+ seconds on CPU)
	- Token generation: ~20-50 tokens/sec (vs 1-3 tokens/sec on CPU)
	- Memory: Model fits comfortably in 16GB VRAM

	## Troubleshooting

	### GPU Not Detected
	1. Check Space hardware: Ensure T4 is selected in settings
	2. Check logs: Look for GPU detection messages
	3. Verify torch installation: `torch.cuda.is_available()` should return `True`
	4. Try restarting: Sometimes requires Space restart after hardware change

	### Model File Not Found
	If you see: `⚠️ Model not found at src/models/...`
	- Upload the model files to the correct path
	- Check file names match exactly
	- Ensure files aren't corrupted during upload

	### Out of Memory Errors
	If GPU runs out of memory:
	- The quantized models (Q4_K_M, q8_0) are designed to fit in 16GB
	- Try restarting the Space
	- Check if other processes are using GPU memory

	### Still Slow After GPU Setup
	1. Verify GPU is actually being used (check logs)
	2. Ensure `n_gpu_layers=-1` is set (check initialization logs)
	3. Check HuggingFace Space isn't in "Sleeping" mode
	4. Verify model is fully loaded before making requests

	## Code Changes Summary

	The following changes enable automatic GPU detection:

	1. `src/backend/chatbot.py`:
	- Added `detect_gpu_and_environment()` function
	- Modified `init_llm()` to use dynamic GPU configuration
	- Automatic path detection for HF Spaces vs local

	2. `app.py`:
	- Added GPU status indicator in sidebar
	- Shows real-time GPU availability

	3. `src/requirements.txt`:
	- Added `torch>=2.0.0` for GPU detection

	## Testing Locally

	To test GPU detection locally (if you have an Nvidia GPU):
	```bash
	# Install CUDA-enabled PyTorch
	pip install torch --index-url https://download.pytorch.org/whl/cu118

	# Run the app
	streamlit run app.py
	```

	Without GPU locally, you'll see:
	```
	⚠️ No GPU detected via torch.cuda
	⚠️ Running on CPU (no GPU detected)
	```

	## Additional Resources

	- [HuggingFace Spaces Hardware Documentation](https://huggingface.co/docs/hub/spaces-gpus)
	- [LlamaCPP GPU Acceleration Guide](https://github.com/ggerganov/llama.cpp#cublas)
	- [PyTorch CUDA Setup](https://pytorch.org/get-started/locally/)

	---

	Note: This GPU setup is backward compatible - the app will still work on CPU if no GPU is available!