Shangming Cai commited on
Commit ·
575c4e9
1
Parent(s): 08c8530
Update README of branch dev_triton.
Browse files- README.md +27 -0
- triton_kernels.py +10 -0
README.md
CHANGED
|
@@ -67,6 +67,14 @@ cd flash-attention && pip install .
|
|
| 67 |
# pip install csrc/layer_norm
|
| 68 |
# pip install csrc/rotary
|
| 69 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
<br>
|
| 71 |
|
| 72 |
|
|
@@ -140,6 +148,25 @@ In detail, the setting of profiling is generating 8192 new tokens with 1 context
|
|
| 140 |
|
| 141 |
Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
|
| 142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
### 显存使用 (GPU Memory Usage)
|
| 144 |
|
| 145 |
我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。(显存消耗在是否使用FlashAttn的情况下均类似。)结果如下所示:
|
|
|
|
| 67 |
# pip install csrc/layer_norm
|
| 68 |
# pip install csrc/rotary
|
| 69 |
```
|
| 70 |
+
|
| 71 |
+
如果您有更高推理性能方面的需求,但上述可选加速项`layer_norm`及`rotary`未能安装成功,或是您所使用的GPU不满足`flash-attention`库所要求的NVIDIA Ampere/Ada/Hopper架构,您可以尝试使用该分支下基于Triton进行实现的推理加速方案。该方案适用于更宽范围的GPU产品,且无需安装。您可以通过将config.json里的`use_triton`设置为true来进行启用。
|
| 72 |
+
|
| 73 |
+
**(在dev_triton分支下`use_triton`默认设置为auto,由于pytorch 2.0及以上版本已默认安装了Triton,因此上述优化方案无需其它安装与配置操作即可直接启用。如果您不想开启该优化,请将config.json里的`use_triton`设置为false)**
|
| 74 |
+
|
| 75 |
+
If you require higher inference performance yet encounter some problems when installing the optional acceleration features (i.e., `layer_norm` and `rotary`) or if the GPU you are using does not meet the NVIDIA Ampere/Ada/Hopper architecture required by the `flash-attention` library, you may consider trying the inference acceleration solution implemented with Triton in this branch. This solution adapts to a wider range of GPU products and does not require installation. You can enable this acceleration feature by setting the `use_triton` option to true in the config.json file.
|
| 76 |
+
|
| 77 |
+
**(In the dev_triton branch, `use_triton` is set to 'auto' by default. As Triton is pre-installed with pytorch version 2.0 and above, this acceleration solution can be enabled directly without additional installation or configuration. If you prefer not to activate this optimization, please set `use_triton` to false in the config.json file.)**
|
| 78 |
<br>
|
| 79 |
|
| 80 |
|
|
|
|
| 148 |
|
| 149 |
Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
|
| 150 |
|
| 151 |
+
另外,我们也测算了在使用不同GPU及推理加速方法时Qwen-7B-Chat-Int4模型生成2048和8192个token的平均推理速度。所有评测均使用PyTorch 2.1.0和CUDA 11.8。
|
| 152 |
+
|
| 153 |
+
In addition, we also measured the average inference speed of generating 2048 and 8192 tokens with different GPU devices and acceleration methods, respectively. All results run with PyTorch 2.1.0 and CUDA 11.8.
|
| 154 |
+
|
| 155 |
+
| GPU Device | Method | Speed (2048 tokens) | Speed (8192 tokens) |
|
| 156 |
+
| :--------: | :----------: | :------------------:| :------------------:|
|
| 157 |
+
| A10 | FlashAttn v2 | 41.28 | 30.78 |
|
| 158 |
+
| A10 | Triton | 49.04 | 29.17 |
|
| 159 |
+
| A10 | Disabled | 39.26 | 26.81 |
|
| 160 |
+
| V100 | FlashAttn v2 | N/A | N/A |
|
| 161 |
+
| V100 | Triton | 37.01 | 27.66 |
|
| 162 |
+
| V100 | Disabled | 24.47 | 20.40 |
|
| 163 |
+
| P100 | FlashAttn v2 | N/A | N/A |
|
| 164 |
+
| P100 | Triton | 29.03 | 13.85 |
|
| 165 |
+
| P100 | Disabled | 20.50 | 12.73 |
|
| 166 |
+
| T4 | FlashAttn v2 | N/A | N/A |
|
| 167 |
+
| T4 | Triton | 27.98 | 15.22 |
|
| 168 |
+
| T4 | Disabled | 23.11 | 14.55 |
|
| 169 |
+
|
| 170 |
### 显存使用 (GPU Memory Usage)
|
| 171 |
|
| 172 |
我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。(显存消耗在是否使用FlashAttn的情况下均类似。)结果如下所示:
|
triton_kernels.py
CHANGED
|
@@ -1,3 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
from typing import Any, Callable, Dict, Hashable, Tuple
|
| 2 |
|
| 3 |
import torch
|
|
|
|
| 1 |
+
# Copyright (c) Alibaba Cloud.
|
| 2 |
+
#
|
| 3 |
+
# This source code is licensed under the license found in the
|
| 4 |
+
# LICENSE file in the root directory of this source tree.
|
| 5 |
+
|
| 6 |
+
# This module provides ApplyRoPE and RMSNorm kernels written in OpenAI Triton.
|
| 7 |
+
# Feel free to contact the contributors if you have any questions or issues regarding this code.
|
| 8 |
+
# Contributors: Shangming Cai, Zihan Wang
|
| 9 |
+
# Contacts: csmthu@gmail.com, wzh1999_frog@126.com
|
| 10 |
+
|
| 11 |
from typing import Any, Callable, Dict, Hashable, Tuple
|
| 12 |
|
| 13 |
import torch
|