================================================================================
LLAMA.CPP SETUP & USAGE GUIDE FOR NVIDIA A100 80GB (CUDA 13)
================================================================================

1. COMPILATION & INSTALLATION (Optimized for Ampere A100)
--------------------------------------------------------------------------------
# Step 1: Set Environment Variables
export PATH=/usr/local/cuda-13.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH
export CUDACXX=/usr/local/cuda-13.0/bin/nvcc

# Step 2: Clean and Configure
# We force Architecture 80 (A100) and disable CUDA compression to avoid GCC errors.
cd ~/llama.cpp
rm -rf build && mkdir build && cd build

cmake .. -DGGML_CUDA=ON \
         -DCMAKE_CUDA_ARCHITECTURES=80 \
         -DGGML_CUDA_COMPRESS_OPTIM_SIZE=OFF \
         -DCMAKE_BUILD_TYPE=Release

# Step 3: Build
cmake --build . --config Release -j $(nproc)


2. ENVIRONMENT PERMANENCE
--------------------------------------------------------------------------------
# Add binaries and CUDA paths to your .bashrc to skip manual exports next time:
echo 'export PATH="/usr/local/cuda-13.0/bin:$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH="/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH"' >> ~/.bashrc
source ~/.bashrc


3. RUNNING INFERENCE (llama-cli)
--------------------------------------------------------------------------------
# Key flags for A100 80GB:
# -ngl 99 : Offload all layers to GPU VRAM
# -fa     : Enable Flash Attention (massive speedup for A100)
# -c 8192 : Set context size (you can go much higher on 80GB)

llama-cli -m /path/to/your/model.gguf \
    -ngl 99 \
    -fa \
    -c 8192 \
    -n 512 \
    -p "You are a helpful assistant. Explain the benefits of A100 GPUs."


4. SERVING AN API (llama-server)
--------------------------------------------------------------------------------
# Starts an OpenAI-compatible API server
llama-server -m /path/to/your/model.gguf \
    -ngl 99 \
    -fa \
    --port 8080 \
    --host 0.0.0.0


5. PERFORMANCE BENCHMARKING
--------------------------------------------------------------------------------
# Test the tokens-per-second (t/s) capability of your hardware:
llama-bench -m /path/to/your/model.gguf -ngl 99 -fa


6. OPTIMIZATION TIPS FOR A100 80GB
--------------------------------------------------------------------------------
* QUANTIZATION: With 80GB VRAM, use Q8_0 or Q6_K quantizations for near-native 
  precision. Use Q4_K_M only if running massive 100B+ parameter models.
* FLASH ATTENTION: Always use the -fa flag. It is specifically optimized for 
  the A100's architecture.
* BATCHING: If running multiple requests, increase '-b' (physical batch size) 
  and '-ub' (logical batch size) to 2048 or higher to saturate the A100 cores.
================================================================================