| # llama.cpp for AMD ZenDNN | |
| > [!WARNING] | |
| > **Note:** ZenDNN is **not** the same as zDNN. | |
| > - **ZenDNN** (this page): AMD's deep learning library for AMD EPYC CPUs | |
| > - **zDNN**: IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes ([see zDNN documentation](zDNN.md)) | |
| - [Background](#background) | |
| - [OS](#os) | |
| - [Hardware](#hardware) | |
| - [Supported Operations](#supported-operations) | |
| - [DataType Supports](#datatype-supports) | |
| - [Linux](#linux) | |
| - [Environment Variable](#environment-variable) | |
| - [Performance Optimization](#performance-optimization) | |
| - [Known Issues](#known-issues) | |
| - [TODO](#todo) | |
| ## Background | |
| **ZenDNN** (Zen Deep Neural Network Library) is AMD's high-performance deep learning inference library optimized for AMD EPYC™ CPUs. It provides optimized implementations of key deep learning primitives and operations, delivering significant performance improvements for neural network workloads on AMD Zen-based processor architectures. | |
| **Llama.cpp + ZenDNN** | |
| The llama.cpp ZenDNN backend leverages AMD's optimized matrix multiplication primitives to accelerate inference on AMD CPUs. It utilizes ZenDNN's **LowOHA (Low Overhead Hardware Accelerated)** MatMul operator for efficient GEMM operations with minimal execution overhead, built-in weight caching, and direct access to backend libraries (AOCL DLP, LibXSMM, OneDNN). | |
| For more information about ZenDNN, visit: https://www.amd.com/en/developer/zendnn.html | |
| ## OS | |
| | OS | Status | Verified | | |
| |:-------:|:-------:|:----------------------------------------------:| | |
| | Linux | Support | Ubuntu 20.04, 22.04, 24.04 | | |
| For the latest list of supported operating systems, see the [ZenDNN Supported OS](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/README.md#15-supported-os). | |
| ## Hardware | |
| ### AMD CPUs | |
| **Recommended Processors** | |
| ZenDNN is optimized for AMD EPYC™ processors and AMD Ryzen™ processors based on "Zen" microarchitecture and newer. | |
| | CPU Family | Status | Notes | | |
| |:-----------------------------:|:-------:|:----------------------------------:| | |
| | AMD EPYC™ 9005 Series (Turin) | Support | 5th Gen - Zen 5 architecture | | |
| | AMD EPYC™ 9004 Series (Genoa) | Support | 4th Gen - Zen 4 architecture | | |
| | AMD EPYC™ 7003 Series (Milan) | Support | 3rd Gen - Zen 3 architecture | | |
| | AMD Ryzen™ AI MAX (Strix Halo)| Support | High-performance mobile processors | | |
| *Notes:* | |
| - Best performance is achieved on AMD EPYC™ processors with high core counts (e.g., EPYC 9005 series). | |
| - ZenDNN leverages AMD's advanced CPU features including AVX2 and AVX-512 instruction sets. | |
| - For optimal performance, ensure your system has sufficient memory bandwidth. | |
| ## Supported Operations | |
| The ZenDNN backend currently accelerates **matrix multiplication (MUL_MAT)** operations only. Other operations are handled by the standard CPU backend. | |
| | Operation | Status | Notes | | |
| |:-------------|:-------:|:----------------------------------------------:| | |
| | MUL_MAT | Support | Accelerated via ZenDNN LowOHA MatMul | | |
| *Note:* Since only MUL_MAT is accelerated, models will benefit most from ZenDNN when matrix multiplications dominate the computational workload (which is typical for transformer-based LLMs). | |
| ## DataType Supports | |
| | DataType | Status | Notes | | |
| |:----------------------:|:-------:|:---------------------------------------------:| | |
| | FP32 | Support | Full precision floating point | | |
| | BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) | | |
| *Notes:* | |
| - **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin). | |
| ## Linux | |
| ### I. Setup Environment | |
| You have two options to set up ZenDNN: | |
| #### Option 1: Automatic Download and Build (Recommended) | |
| CMake will automatically download and build ZenDNN for you: | |
| ```sh | |
| # Build llama.cpp - ZenDNN will be automatically downloaded and built | |
| cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release | |
| cmake --build build --config Release -j $(nproc) | |
| ``` | |
| No manual ZenDNN installation required. CMake will handle everything automatically. | |
| #### Option 2: Use Custom ZenDNN Installation | |
| If you want to build ZenDNN yourself or use a specific version: | |
| **Step 1: Build ZenDNN from source** | |
| ```sh | |
| # Clone ZenDNN repository | |
| git clone https://github.com/amd/ZenDNN.git | |
| cd ZenDNN | |
| # Build and install (requires CMake >= 3.25) | |
| mkdir build && cd build | |
| cmake .. | |
| cmake --build . --target all | |
| ``` | |
| Default installation path: `ZenDNN/build/install` | |
| **For detailed build instructions**, refer to the [ZenDNN README](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/README.md). | |
| **Step 2: Build llama.cpp with custom ZenDNN path** | |
| ```sh | |
| # Using environment variable | |
| export ZENDNN_ROOT=/path/to/ZenDNN/build/install | |
| cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release | |
| cmake --build build --config Release -j $(nproc) | |
| # OR specify path directly in CMake | |
| cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/ZenDNN/build/install -DCMAKE_BUILD_TYPE=Release | |
| cmake --build build --config Release -j $(nproc) | |
| ``` | |
| ### II. Run the Server | |
| #### 1. Download Model | |
| Download LLaMA 3.1 8B Instruct BF16 model: | |
| ```sh | |
| # Download from Hugging Face | |
| huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/ | |
| ``` | |
| #### 2. Start Server | |
| Run llama.cpp server with ZenDNN acceleration: | |
| ```sh | |
| # Set optimal configuration | |
| export ZENDNNL_MATMUL_ALGO=1 # Blocked AOCL DLP algo for best performance | |
| # Start server | |
| ./build/bin/llama-server \ | |
| -m models/Llama-3.1-8B-Instruct.BF16.gguf \ | |
| --host 0.0.0.0 \ | |
| --port 8080 \ | |
| -t 64 | |
| ``` | |
| Access the server at `http://localhost:8080`. | |
| **Performance tips**: | |
| - Use `ZENDNNL_MATMUL_ALGO=1` for optimal performance | |
| - For NUMA systems: `numactl --cpunodebind=0 --membind=0 ./build/bin/llama-server ...` | |
| ## Environment Variable | |
| For environment variables related to ZenDNN, refer to the [ZenDNN Environment Variables Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/runtime_env.md). | |
| ### Performance Optimization | |
| ZenDNN's LowOHA MatMul supports multiple backend algorithms. For **best performance**, use the **Blocked AOCL DLP** algorithm: | |
| ```sh | |
| export ZENDNNL_MATMUL_ALGO=1 # Blocked AOCL DLP algo (recommended) | |
| ``` | |
| For more details on available algorithms, see the [ZenDNN MatMul Algorithm Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/runtime_env.md#algorithm-details). | |
| ### Profiling and Debugging | |
| For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/logging.md). | |
| ## Known Issues | |
| - **Limited operation support**: Currently only matrix multiplication (MUL_MAT) is accelerated via ZenDNN. Other operations fall back to the standard CPU backend. | |
| - **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32. | |
| - **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance. | |
| ## Q&A | |
| **Q: How do I verify that ZenDNN backend is being used?** | |
| A: Check the log output when running llama.cpp. You should see messages indicating the ZenDNN backend is initialized. You can also check the backend name in the output. | |
| **Q: What performance improvement can I expect?** | |
| A: Performance gains vary depending on the model size, batch size, and CPU architecture. On AMD EPYC processors, you can typically expect 1.1x-2x speedup compared to standard CPU inference for matrix multiplication operations. | |
| **Q: Can I use ZenDNN on non-AMD processors?** | |
| A: ZenDNN is optimized specifically for AMD processors. While it may work on other x86-64 CPUs, performance benefits are only guaranteed on AMD Zen-based architectures. | |
| **Q: Does ZenDNN support quantized models?** | |
| A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time. | |
| **Q: Why is my inference not faster with ZenDNN?** | |
| A: Ensure: | |
| 1. You're using an AMD EPYC or Ryzen processor (Zen 2 or newer) | |
| 2. `ZENDNNL_MATMUL_ALGO=1` is set for best performance (Blocked AOCL DLP) | |
| 3. You're using a sufficiently large model (small models may not benefit as much) | |
| 4. Enable profiling to verify ZenDNN MatMul is being called | |
| ### **GitHub Contribution**: | |
| Please add the **[ZenDNN]** prefix/tag in issues/PRs titles to help the ZenDNN-team check/address them without delay. | |
| ## TODO | |
| - Expand operation support beyond MUL_MAT (attention operations, activations, etc.) | |