================================================================================ LLAMA.CPP SETUP & USAGE GUIDE FOR NVIDIA A100 80GB (CUDA 13) ================================================================================ 1. COMPILATION & INSTALLATION (Optimized for Ampere A100) -------------------------------------------------------------------------------- # Step 1: Set Environment Variables export PATH=/usr/local/cuda-13.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH export CUDACXX=/usr/local/cuda-13.0/bin/nvcc # Step 2: Clean and Configure # We force Architecture 80 (A100) and disable CUDA compression to avoid GCC errors. cd ~/llama.cpp rm -rf build && mkdir build && cd build cmake .. -DGGML_CUDA=ON \ -DCMAKE_CUDA_ARCHITECTURES=80 \ -DGGML_CUDA_COMPRESS_OPTIM_SIZE=OFF \ -DCMAKE_BUILD_TYPE=Release # Step 3: Build cmake --build . --config Release -j $(nproc) 2. ENVIRONMENT PERMANENCE -------------------------------------------------------------------------------- # Add binaries and CUDA paths to your .bashrc to skip manual exports next time: echo 'export PATH="/usr/local/cuda-13.0/bin:$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc echo 'export LD_LIBRARY_PATH="/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH"' >> ~/.bashrc source ~/.bashrc 3. RUNNING INFERENCE (llama-cli) -------------------------------------------------------------------------------- # Key flags for A100 80GB: # -ngl 99 : Offload all layers to GPU VRAM # -fa : Enable Flash Attention (massive speedup for A100) # -c 8192 : Set context size (you can go much higher on 80GB) llama-cli -m /path/to/your/model.gguf \ -ngl 99 \ -fa \ -c 8192 \ -n 512 \ -p "You are a helpful assistant. Explain the benefits of A100 GPUs." 4. SERVING AN API (llama-server) -------------------------------------------------------------------------------- # Starts an OpenAI-compatible API server llama-server -m /path/to/your/model.gguf \ -ngl 99 \ -fa \ --port 8080 \ --host 0.0.0.0 5. PERFORMANCE BENCHMARKING -------------------------------------------------------------------------------- # Test the tokens-per-second (t/s) capability of your hardware: llama-bench -m /path/to/your/model.gguf -ngl 99 -fa 6. OPTIMIZATION TIPS FOR A100 80GB -------------------------------------------------------------------------------- * QUANTIZATION: With 80GB VRAM, use Q8_0 or Q6_K quantizations for near-native precision. Use Q4_K_M only if running massive 100B+ parameter models. * FLASH ATTENTION: Always use the -fa flag. It is specifically optimized for the A100's architecture. * BATCHING: If running multiple requests, increase '-b' (physical batch size) and '-ub' (logical batch size) to 2048 or higher to saturate the A100 cores. ================================================================================