File size: 9,120 Bytes
93e7af1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
# llama.cpp for AMD ZenDNN

> [!WARNING]
> **Note:** ZenDNN is **not** the same as zDNN.
> - **ZenDNN** (this page): AMD's deep learning library for AMD EPYC CPUs
> - **zDNN**: IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes ([see zDNN documentation](zDNN.md))

- [Background](#background)
- [OS](#os)
- [Hardware](#hardware)
- [Supported Operations](#supported-operations)
- [DataType Supports](#datatype-supports)
- [Linux](#linux)
- [Environment Variable](#environment-variable)
- [Performance Optimization](#performance-optimization)
- [Known Issues](#known-issues)
- [TODO](#todo)

## Background

**ZenDNN** (Zen Deep Neural Network Library) is AMD's high-performance deep learning inference library optimized for AMD EPYC™ CPUs. It provides optimized implementations of key deep learning primitives and operations, delivering significant performance improvements for neural network workloads on AMD Zen-based processor architectures.

**Llama.cpp + ZenDNN**

The llama.cpp ZenDNN backend leverages AMD's optimized matrix multiplication primitives to accelerate inference on AMD CPUs. It utilizes ZenDNN's **LowOHA (Low Overhead Hardware Accelerated)** MatMul operator for efficient GEMM operations with minimal execution overhead, built-in weight caching, and direct access to backend libraries (AOCL DLP, LibXSMM, OneDNN).

For more information about ZenDNN, visit: https://www.amd.com/en/developer/zendnn.html

## OS

| OS      | Status  | Verified                                       |
|:-------:|:-------:|:----------------------------------------------:|
| Linux   | Support | Ubuntu 20.04, 22.04, 24.04                     |

For the latest list of supported operating systems, see the [ZenDNN Supported OS](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/README.md#15-supported-os).

## Hardware

### AMD CPUs

**Recommended Processors**

ZenDNN is optimized for AMD EPYC™ processors and AMD Ryzen™ processors based on "Zen" microarchitecture and newer.

| CPU Family                    | Status  | Notes                              |
|:-----------------------------:|:-------:|:----------------------------------:|
| AMD EPYC™ 9005 Series (Turin) | Support | 5th Gen - Zen 5 architecture       |
| AMD EPYC™ 9004 Series (Genoa) | Support | 4th Gen - Zen 4 architecture       |
| AMD EPYC™ 7003 Series (Milan) | Support | 3rd Gen - Zen 3 architecture       |
| AMD Ryzen™ AI MAX (Strix Halo)| Support | High-performance mobile processors |

*Notes:*

- Best performance is achieved on AMD EPYC™ processors with high core counts (e.g., EPYC 9005 series).
- ZenDNN leverages AMD's advanced CPU features including AVX2 and AVX-512 instruction sets.
- For optimal performance, ensure your system has sufficient memory bandwidth.

## Supported Operations

The ZenDNN backend currently accelerates **matrix multiplication (MUL_MAT)** operations only. Other operations are handled by the standard CPU backend.



| Operation    | Status  | Notes                                          |

|:-------------|:-------:|:----------------------------------------------:|

| MUL_MAT      | Support | Accelerated via ZenDNN LowOHA MatMul           |



*Note:* Since only MUL_MAT is accelerated, models will benefit most from ZenDNN when matrix multiplications dominate the computational workload (which is typical for transformer-based LLMs).



## DataType Supports



| DataType               | Status  | Notes                                         |

|:----------------------:|:-------:|:---------------------------------------------:|

| FP32                   | Support | Full precision floating point                 |

| BF16                   | Support | BFloat16 (best performance on Zen 4/Zen 5)    |



*Notes:*



- **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin).



## Linux



### I. Setup Environment



You have two options to set up ZenDNN:



#### Option 1: Automatic Download and Build (Recommended)



CMake will automatically download and build ZenDNN for you:



```sh

# Build llama.cpp - ZenDNN will be automatically downloaded and built

cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -j $(nproc)

```



No manual ZenDNN installation required. CMake will handle everything automatically.



#### Option 2: Use Custom ZenDNN Installation



If you want to build ZenDNN yourself or use a specific version:



**Step 1: Build ZenDNN from source**

```sh

# Clone ZenDNN repository

git clone https://github.com/amd/ZenDNN.git

cd ZenDNN



# Build and install (requires CMake >= 3.25)

mkdir build && cd build

cmake ..

cmake --build . --target all

```

Default installation path: `ZenDNN/build/install`

**For detailed build instructions**, refer to the [ZenDNN README](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/README.md).

**Step 2: Build llama.cpp with custom ZenDNN path**

```sh

# Using environment variable

export ZENDNN_ROOT=/path/to/ZenDNN/build/install

cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -j $(nproc)



# OR specify path directly in CMake

cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/ZenDNN/build/install -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -j $(nproc)

```

### II. Run the Server

#### 1. Download Model

Download LLaMA 3.1 8B Instruct BF16 model:

```sh

# Download from Hugging Face

huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/

```

#### 2. Start Server

Run llama.cpp server with ZenDNN acceleration:

```sh

# Set optimal configuration

export ZENDNNL_MATMUL_ALGO=1    # Blocked AOCL DLP algo for best performance



# Start server

./build/bin/llama-server \

    -m models/Llama-3.1-8B-Instruct.BF16.gguf \

    --host 0.0.0.0 \

    --port 8080 \

    -t 64

```

Access the server at `http://localhost:8080`.

**Performance tips**:
- Use `ZENDNNL_MATMUL_ALGO=1` for optimal performance
- For NUMA systems: `numactl --cpunodebind=0 --membind=0 ./build/bin/llama-server ...`

## Environment Variable

For environment variables related to ZenDNN, refer to the [ZenDNN Environment Variables Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/runtime_env.md).

### Performance Optimization

ZenDNN's LowOHA MatMul supports multiple backend algorithms. For **best performance**, use the **Blocked AOCL DLP** algorithm:

```sh

export ZENDNNL_MATMUL_ALGO=1    # Blocked AOCL DLP algo (recommended)

```

For more details on available algorithms, see the [ZenDNN MatMul Algorithm Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/runtime_env.md#algorithm-details).

### Profiling and Debugging

For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/logging.md).

## Known Issues

- **Limited operation support**: Currently only matrix multiplication (MUL_MAT) is accelerated via ZenDNN. Other operations fall back to the standard CPU backend.

- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32.

- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance.



## Q&A



**Q: How do I verify that ZenDNN backend is being used?**



A: Check the log output when running llama.cpp. You should see messages indicating the ZenDNN backend is initialized. You can also check the backend name in the output.



**Q: What performance improvement can I expect?**



A: Performance gains vary depending on the model size, batch size, and CPU architecture. On AMD EPYC processors, you can typically expect 1.1x-2x speedup compared to standard CPU inference for matrix multiplication operations.



**Q: Can I use ZenDNN on non-AMD processors?**



A: ZenDNN is optimized specifically for AMD processors. While it may work on other x86-64 CPUs, performance benefits are only guaranteed on AMD Zen-based architectures.



**Q: Does ZenDNN support quantized models?**



A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time.



**Q: Why is my inference not faster with ZenDNN?**



A: Ensure:

1. You're using an AMD EPYC or Ryzen processor (Zen 2 or newer)

2. `ZENDNNL_MATMUL_ALGO=1` is set for best performance (Blocked AOCL DLP)

3. You're using a sufficiently large model (small models may not benefit as much)

4. Enable profiling to verify ZenDNN MatMul is being called



### **GitHub Contribution**:

Please add the **[ZenDNN]** prefix/tag in issues/PRs titles to help the ZenDNN-team check/address them without delay.



## TODO



- Expand operation support beyond MUL_MAT (attention operations, activations, etc.)