| ================================================================================ | |
| LLAMA.CPP SETUP & USAGE GUIDE FOR NVIDIA A100 80GB (CUDA 13) | |
| ================================================================================ | |
| 1. COMPILATION & INSTALLATION (Optimized for Ampere A100) | |
| -------------------------------------------------------------------------------- | |
| # Step 1: Set Environment Variables | |
| export PATH=/usr/local/cuda-13.0/bin:$PATH | |
| export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH | |
| export CUDACXX=/usr/local/cuda-13.0/bin/nvcc | |
| # Step 2: Clean and Configure | |
| # We force Architecture 80 (A100) and disable CUDA compression to avoid GCC errors. | |
| cd ~/llama.cpp | |
| rm -rf build && mkdir build && cd build | |
| cmake .. -DGGML_CUDA=ON \ | |
| -DCMAKE_CUDA_ARCHITECTURES=80 \ | |
| -DGGML_CUDA_COMPRESS_OPTIM_SIZE=OFF \ | |
| -DCMAKE_BUILD_TYPE=Release | |
| # Step 3: Build | |
| cmake --build . --config Release -j $(nproc) | |
| 2. ENVIRONMENT PERMANENCE | |
| -------------------------------------------------------------------------------- | |
| # Add binaries and CUDA paths to your .bashrc to skip manual exports next time: | |
| echo 'export PATH="/usr/local/cuda-13.0/bin:$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc | |
| echo 'export LD_LIBRARY_PATH="/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH"' >> ~/.bashrc | |
| source ~/.bashrc | |
| 3. RUNNING INFERENCE (llama-cli) | |
| -------------------------------------------------------------------------------- | |
| # Key flags for A100 80GB: | |
| # -ngl 99 : Offload all layers to GPU VRAM | |
| # -fa : Enable Flash Attention (massive speedup for A100) | |
| # -c 8192 : Set context size (you can go much higher on 80GB) | |
| llama-cli -m /path/to/your/model.gguf \ | |
| -ngl 99 \ | |
| -fa \ | |
| -c 8192 \ | |
| -n 512 \ | |
| -p "You are a helpful assistant. Explain the benefits of A100 GPUs." | |
| 4. SERVING AN API (llama-server) | |
| -------------------------------------------------------------------------------- | |
| # Starts an OpenAI-compatible API server | |
| llama-server -m /path/to/your/model.gguf \ | |
| -ngl 99 \ | |
| -fa \ | |
| --port 8080 \ | |
| --host 0.0.0.0 | |
| 5. PERFORMANCE BENCHMARKING | |
| -------------------------------------------------------------------------------- | |
| # Test the tokens-per-second (t/s) capability of your hardware: | |
| llama-bench -m /path/to/your/model.gguf -ngl 99 -fa | |
| 6. OPTIMIZATION TIPS FOR A100 80GB | |
| -------------------------------------------------------------------------------- | |
| * QUANTIZATION: With 80GB VRAM, use Q8_0 or Q6_K quantizations for near-native | |
| precision. Use Q4_K_M only if running massive 100B+ parameter models. | |
| * FLASH ATTENTION: Always use the -fa flag. It is specifically optimized for | |
| the A100's architecture. | |
| * BATCHING: If running multiple requests, increase '-b' (physical batch size) | |
| and '-ub' (logical batch size) to 2048 or higher to saturate the A100 cores. | |
| ================================================================================ | |