readctrl / llama_cpp_a100_guide.txt
shahidul034's picture
Add files using upload-large-folder tool
c29669c verified
================================================================================
LLAMA.CPP SETUP & USAGE GUIDE FOR NVIDIA A100 80GB (CUDA 13)
================================================================================
1. COMPILATION & INSTALLATION (Optimized for Ampere A100)
--------------------------------------------------------------------------------
# Step 1: Set Environment Variables
export PATH=/usr/local/cuda-13.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH
export CUDACXX=/usr/local/cuda-13.0/bin/nvcc
# Step 2: Clean and Configure
# We force Architecture 80 (A100) and disable CUDA compression to avoid GCC errors.
cd ~/llama.cpp
rm -rf build && mkdir build && cd build
cmake .. -DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=80 \
-DGGML_CUDA_COMPRESS_OPTIM_SIZE=OFF \
-DCMAKE_BUILD_TYPE=Release
# Step 3: Build
cmake --build . --config Release -j $(nproc)
2. ENVIRONMENT PERMANENCE
--------------------------------------------------------------------------------
# Add binaries and CUDA paths to your .bashrc to skip manual exports next time:
echo 'export PATH="/usr/local/cuda-13.0/bin:$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH="/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH"' >> ~/.bashrc
source ~/.bashrc
3. RUNNING INFERENCE (llama-cli)
--------------------------------------------------------------------------------
# Key flags for A100 80GB:
# -ngl 99 : Offload all layers to GPU VRAM
# -fa : Enable Flash Attention (massive speedup for A100)
# -c 8192 : Set context size (you can go much higher on 80GB)
llama-cli -m /path/to/your/model.gguf \
-ngl 99 \
-fa \
-c 8192 \
-n 512 \
-p "You are a helpful assistant. Explain the benefits of A100 GPUs."
4. SERVING AN API (llama-server)
--------------------------------------------------------------------------------
# Starts an OpenAI-compatible API server
llama-server -m /path/to/your/model.gguf \
-ngl 99 \
-fa \
--port 8080 \
--host 0.0.0.0
5. PERFORMANCE BENCHMARKING
--------------------------------------------------------------------------------
# Test the tokens-per-second (t/s) capability of your hardware:
llama-bench -m /path/to/your/model.gguf -ngl 99 -fa
6. OPTIMIZATION TIPS FOR A100 80GB
--------------------------------------------------------------------------------
* QUANTIZATION: With 80GB VRAM, use Q8_0 or Q6_K quantizations for near-native
precision. Use Q4_K_M only if running massive 100B+ parameter models.
* FLASH ATTENTION: Always use the -fa flag. It is specifically optimized for
the A100's architecture.
* BATCHING: If running multiple requests, increase '-b' (physical batch size)
and '-ub' (logical batch size) to 2048 or higher to saturate the A100 cores.
================================================================================