File size: 23,479 Bytes
17c6d62 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 |
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
â ïž Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Efficient Inference on a Single GPU
ãã®ã¬ã€ãã«å ããŠã[1ã€ã®GPUã§ã®ãã¬ãŒãã³ã°ã¬ã€ã](perf_train_gpu_one)ãš[CPUã§ã®æšè«ã¬ã€ã](perf_infer_cpu)ã«é¢é£ããæ
å ±ããããŸãã
## Flash Attention 2
<Tip>
ãã®æ©èœã¯å®éšçã§ãããå°æ¥ã®ããŒãžã§ã³ã§å€§å¹
ã«å€æŽãããå¯èœæ§ããããŸããããšãã°ãFlash Attention 2 APIã¯è¿ãå°æ¥`BetterTransformer` APIã«ç§»è¡ãããããããŸããã
</Tip>
Flash Attention 2ã¯ããã©ã³ã¹ãã©ãŒããŒããŒã¹ã®ã¢ãã«ã®ãã¬ãŒãã³ã°ãšæšè«é床ã倧å¹
ã«é«éåã§ããŸããFlash Attention 2ã¯ãTri Daoæ°ã«ãã£ãŠ[å
¬åŒã®Flash Attentionãªããžããª](https://github.com/Dao-AILab/flash-attention)ã§å°å
¥ãããŸãããFlash Attentionã«é¢ããç§åŠè«æã¯[ãã¡ã](https://arxiv.org/abs/2205.14135)ã§èŠãããšãã§ããŸãã
Flash Attention 2ãæ£ããã€ã³ã¹ããŒã«ããã«ã¯ãäžèšã®ãªããžããªã«èšèŒãããŠããã€ã³ã¹ããŒã«ã¬ã€ãã«åŸã£ãŠãã ããã
以äžã®ã¢ãã«ã«å¯ŸããŠFlash Attention 2ããã€ãã£ããµããŒãããŠããŸãïŒ
- Llama
- Falcon
ããã«å€ãã®ã¢ãã«ã«Flash Attention 2ã®ãµããŒãã远å ããããšãGitHubã§ææ¡ããããšãã§ãã倿Žãçµ±åããããã«ãã«ãªã¯ãšã¹ããéãããšãã§ããŸãããµããŒããããŠããã¢ãã«ã¯ãããã£ã³ã°ããŒã¯ã³ã䜿çšããŠãã¬ãŒãã³ã°ãå«ããæšè«ãšãã¬ãŒãã³ã°ã«äœ¿çšã§ããŸãïŒçŸåšã®`BetterTransformer` APIã§ã¯ãµããŒããããŠããªãïŒã
<Tip>
Flash Attention 2ã¯ãã¢ãã«ã®dtypeã`fp16`ãŸãã¯`bf16`ã®å Žåã«ã®ã¿äœ¿çšã§ããNVIDIA-GPUããã€ã¹ã§ã®ã¿å®è¡ãããŸãããã®æ©èœã䜿çšããåã«ãã¢ãã«ãé©åãªdtypeã«ãã£ã¹ããããµããŒããããŠããããã€ã¹ã«ããŒãããŠãã ããã
</Tip>
### Quick usage
ã¢ãã«ã§Flash Attention 2ãæå¹ã«ããã«ã¯ã`from_pretrained`ã®åŒæ°ã«`attn_implementation="flash_attention_2"`ã远å ããŸãã
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
model_id = "tiiuae/falcon-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
```
ãã¡ãã¯ãçæãŸãã¯åŸ®èª¿æŽã®ããã«äœ¿çšããããã¹ãã§ãã
### Expected speedups
ç¹ã«é·ãã·ãŒã±ã³ã¹ã«å¯ŸããŠã埮調æŽãšæšè«ã®éã«ã¯ãããªãã®é«éåãæåŸ
ã§ããŸãããã ããFlash Attentionã¯ããã£ã³ã°ããŒã¯ã³ã䜿çšããŠã¢ãã³ã·ã§ã³ã¹ã³ã¢ãèšç®ããªããããã·ãŒã±ã³ã¹ã«ããã£ã³ã°ããŒã¯ã³ãå«ãŸããå Žåããããæšè«ã«ãããŠã¢ãã³ã·ã§ã³ã¹ã³ã¢ãæåã§ããã/ã¢ã³ãããããå¿
èŠããããããã£ã³ã°ããŒã¯ã³ãå«ããããçæã®å€§å¹
ãªé
å»¶ãçºçããŸãã
ãããå
æããããã«ããã¬ãŒãã³ã°äžã«ã·ãŒã±ã³ã¹ã«ããã£ã³ã°ããŒã¯ã³ã䜿çšããã«Flash Attentionã䜿çšããå¿
èŠããããŸãïŒããšãã°ãããŒã¿ã»ãããããã¯ããããšã«ãããã·ãŒã±ã³ã¹ãæå€§ã·ãŒã±ã³ã¹é·ã«éãããŸã§é£çµããããšãªã©ïŒãããã«[äŸ](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py#L516)ãæäŸãããŠããŸãã
以äžã¯ãããã£ã³ã°ããŒã¯ã³ã®ãªãå Žåã«ãã·ãŒã±ã³ã¹é·ã4096ã®[tiiuae/falcon-7b](https://hf.co/tiiuae/falcon-7b)ã«å¯ŸããåçŽãªãã©ã¯ãŒããã¹ã®äºæ³ãããé«éåã§ããããŸããŸãªããããµã€ãºã瀺ãããŠããŸãïŒ
<div style="text-align: center">
<img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/falcon-7b-inference-large-seqlen.png">
</div>
以äžã¯ãããã£ã³ã°ããŒã¯ã³ã®ãªãå Žåã«ãã·ãŒã±ã³ã¹é·ã4096ã®[`meta-llama/Llama-7b-hf`](https://hf.co/meta-llama/Llama-7b-hf)ã«å¯ŸããåçŽãªãã©ã¯ãŒããã¹ã®äºæ³ãããé«éåã§ããããŸããŸãªããããµã€ãºã瀺ãããŠããŸãïŒ
<div style="text-align: center">
<img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/llama-7b-inference-large-seqlen.png">
</div>
ããã£ã³ã°ããŒã¯ã³ãå«ãã·ãŒã±ã³ã¹ïŒããã£ã³ã°ããŒã¯ã³ã䜿çšããŠãã¬ãŒãã³ã°ãŸãã¯çæããïŒã®å Žåãã¢ãã³ã·ã§ã³ã¹ã³ã¢ãæ£ããèšç®ããããã«å
¥åã·ãŒã±ã³ã¹ãã¢ã³ããã/ãããããå¿
èŠããããŸããæ¯èŒçå°ããã·ãŒã±ã³ã¹é·ã®å ŽåãçŽç²ãªãã©ã¯ãŒããã¹ã§ã¯ããã£ã³ã°ããŒã¯ã³ã30%æªæºããåããããŠããªããããããã¯ããããªé«éåããããããŸãã
<div style="text-align: center">
<img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/llama-2-small-seqlen-padding.png">
</div>
ãããã倧ããªã·ãŒã±ã³ã¹é·ã®å ŽåãçŽç²ãªæšè«ïŒãã¬ãŒãã³ã°ãå«ãïŒã«ã¯è峿·±ãé«éåãåŸãããŸãã
Flash Attentionã¯ãã¢ãã³ã·ã§ã³èšç®ãããã¡ã¢ãªå¹çã®è¯ããã®ã«ãã倧ããªã·ãŒã±ã³ã¹é·ã§ã®CUDA OOMã®åé¡ãåé¿ã§ããããã«ããŸãã倧ããªã·ãŒã±ã³ã¹é·ã«å¯ŸããŠæå€§20ã®ã¡ã¢ãªåæžãããããããšããããŸãã詳现ã«ã€ããŠã¯ã[å
¬åŒã®Flash Attentionãªããžããª](https://github.com/Dao-AILab/flash-attention)ãã芧ãã ããã
<div style="text-align: center">
<img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/llama-2-large-seqlen-padding.png">
</div>
### Advanced usage
ãã®æ©èœãã¢ãã«ã®æé©åã«å€ãã®æ¢åã®æ©èœãšçµã¿åãããããšãã§ããŸãã以äžã«ããã€ãã®äŸã瀺ããŸãïŒ
### Combining Flash Attention 2 and 8-bit models
ãã®æ©èœã8ãããã®éååãšçµã¿åãããããšãã§ããŸãïŒ
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
model_id = "tiiuae/falcon-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_8bit=True,
attn_implementation="flash_attention_2",
)
```
### Combining Flash Attention 2 and 4-bit models
ãã®æ©èœã 4 ãããã®éååãšçµã¿åãããããšãã§ããŸãïŒ
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
model_id = "tiiuae/falcon-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_4bit=True,
attn_implementation="flash_attention_2",
)
```
### Combining Flash Attention 2 and PEFT
ãã®æ©èœã䜿çšããŠãFlash Attention 2ãããŒã¹ã«ã¢ããã¿ãŒããã¬ãŒãã³ã°ããéã«PEFTãçµã¿åãããããšãã§ããŸãã
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
from peft import LoraConfig
model_id = "tiiuae/falcon-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_4bit=True,
attn_implementation="flash_attention_2",
)
lora_config = LoraConfig(
r=8,
task_type="CAUSAL_LM"
)
model.add_adapter(lora_config)
... # train your model
```
## BetterTransformer
[BetterTransformer](https://huggingface.co/docs/optimum/bettertransformer/overview)ã¯ãð€ Transformersã¢ãã«ãPyTorchãã€ãã£ãã®é«éãã¹å®è¡ã«å€æããŸããããã«ãããFlash Attentionãªã©ã®æé©åãããã«ãŒãã«ãå
éšã§åŒã³åºãããŸãã
BetterTransformerã¯ãããã¹ããç»åãããã³ãªãŒãã£ãªã¢ãã«ã®åäžããã³ãã«ãGPUã§ã®é«éãªæšè«ããµããŒãããŠããŸãã
<Tip>
Flash Attentionã¯ãfp16ãŸãã¯bf16ã®dtypeã䜿çšããã¢ãã«ã«ã®ã¿äœ¿çšã§ããŸããBetterTransformerã䜿çšããåã«ãã¢ãã«ãé©åãªdtypeã«ãã£ã¹ãããŠãã ããã
</Tip>
### Encoder models
PyTorchãã€ãã£ãã®[`nn.MultiHeadAttention`](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/)ã¢ãã³ã·ã§ã³é«éãã¹ãBetterTransformerãšåŒã°ãããã®ã¯ã[ð€ Optimumã©ã€ãã©ãª](https://huggingface.co/docs/optimum/bettertransformer/overview)ã®çµ±åãéããŠTransformersãšäžç·ã«äœ¿çšã§ããŸãã
PyTorchã®ã¢ãã³ã·ã§ã³é«éãã¹ã䜿çšãããšãã«ãŒãã«ãã¥ãŒãžã§ã³ãš[ãã¹ãããããã³ãœã«](https://pytorch.org/docs/stable/nested.html)ã®äœ¿çšã«ãããæšè«ãé«éåã§ããŸãã詳现ãªãã³ãããŒã¯æ
å ±ã¯[ãã®ããã°èšäº](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2)ã«ãããŸãã
[`optimum`](https://github.com/huggingface/optimum)ããã±ãŒãžãã€ã³ã¹ããŒã«ããåŸãæšè«äžã«Better Transformerã䜿çšããã«ã¯ãé¢é£ããå
éšã¢ãžã¥ãŒã«ãåŒã³åºãããšã§çœ®ãæããå¿
èŠããããŸã[`~PreTrainedModel.to_bettertransformer`]:
```python
model = model.to_bettertransformer()
```
ã¡ãœãã [`~PreTrainedModel.reverse_bettertransformer`] ã¯ãã¢ãã«ãä¿åããåã«äœ¿çšãã¹ãã§ãæšæºã®ãã©ã³ã¹ãã©ãŒããŒã¢ããªã³ã°ã䜿çšããããã®ãã®ã§ãïŒ
```python
model = model.reverse_bettertransformer()
model.save_pretrained("saved_model")
```
BetterTransformer APIã䜿ã£ããšã³ã³ãŒããŒã¢ãã«ã®å¯èœæ§ã«ã€ããŠè©³ããç¥ãã«ã¯ã[ãã®ããã°ãã¹ã](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2)ãã芧ãã ããã
### Decoder models
ããã¹ãã¢ãã«ãç¹ã«ãã³ãŒããŒããŒã¹ã®ã¢ãã«ïŒGPTãT5ãLlamaãªã©ïŒã«ãšã£ãŠãBetterTransformer APIã¯ãã¹ãŠã®æ³šææäœã[`torch.nn.functional.scaled_dot_product_attention`ãªãã¬ãŒã¿ãŒ](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention)ïŒSDPAïŒã䜿çšããããã«å€æããŸãããã®ãªãã¬ãŒã¿ãŒã¯PyTorch 2.0以éã§ã®ã¿å©çšå¯èœã§ãã
ã¢ãã«ãBetterTransformerã«å€æããã«ã¯ã以äžã®æé ãå®è¡ããŠãã ããïŒ
```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
# convert the model to BetterTransformer
model.to_bettertransformer()
# Use it for training or inference
```
SDPAã¯ãããŒããŠã§ã¢ãåé¡ã®ãµã€ãºã«å¿ããŠ[Flash Attention](https://arxiv.org/abs/2205.14135)ã«ãŒãã«ã䜿çšããããšãã§ããŸããFlash Attentionãæå¹ã«ããããç¹å®ã®èšå®ïŒããŒããŠã§ã¢ãåé¡ãµã€ãºïŒã§äœ¿çšå¯èœãã©ããã確èªããã«ã¯ã[`torch.nn.attention.sdpa_kernel`](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html)ãã³ã³ããã¹ããããŒãžã£ãšããŠäœ¿çšããŸãã
```diff
import torch
+ from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.float16).to("cuda")
# convert the model to BetterTransformer
model.to_bettertransformer()
input_text = "Hello my dog is cute and"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
+ with sdpa_kernel(SDPBackend.FLASH_ATTENTION):
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
ãããã¬ãŒã¹ããã¯ã«ãã°ã衚瀺ãããå Žå
```bash
RuntimeError: No available kernel. Aborting execution.
```
Flash Attention ã®åºç¯ãªã«ãã¬ããžãæã€ãããããªã PyTorch ã®ãã€ããªãŒããŒãžã§ã³ã詊ããŠã¿ãããšããå§ãããŸãã
```bash
pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
```
Or make sure your model is correctly casted in float16 or bfloat16
ã¢ãã«ãæ£ããfloat16ãŸãã¯bfloat16ã«ãã£ã¹ããããŠããããšã確èªããŠãã ããã
Have a look at [this detailed blogpost](https://pytorch.org/blog/out-of-the-box-acceleration/) to read more about what is possible to do with `BetterTransformer` + SDPA API.
`BetterTransformer` + SDPA APIã䜿çšããŠäœãå¯èœãã«ã€ããŠè©³ããèªãã«ã¯ã[ãã®è©³çްãªããã°ãã¹ã](https://pytorch.org/blog/out-of-the-box-acceleration/)ãã芧ãã ããã
## `bitsandbytes` integration for FP4 mixed-precision inference
FP4æ··å粟床æšè«ã®ããã®`bitsandbytes`çµ±å
You can install `bitsandbytes` and benefit from easy model compression on GPUs. Using FP4 quantization you can expect to reduce up to 8x the model size compared to its native full precision version. Check out below how to get started.
`bitsandbytes`ãã€ã³ã¹ããŒã«ããGPUã§ç°¡åãªã¢ãã«ã®å§çž®ãå©çšã§ããŸããFP4éååã䜿çšãããšããã€ãã£ãã®ãã«ãã¬ã·ãžã§ã³ããŒãžã§ã³ãšæ¯èŒããŠã¢ãã«ãµã€ãºãæå€§8ååæžã§ããããšãæåŸ
ã§ããŸãã以äžã確èªããŠãã©ã®ããã«å§ããããã芧ãã ããã
<Tip>
Note that this feature can also be used in a multi GPU setup.
ãã®æ©èœã¯ããã«ãGPUã»ããã¢ããã§ã䜿çšã§ããããšã«æ³šæããŠãã ããã
</Tip>
### Requirements [[requirements-for-fp4-mixedprecision-inference]]
- Latest `bitsandbytes` library
`pip install bitsandbytes>=0.39.0`
- Install latest `accelerate` from source
`pip install git+https://github.com/huggingface/accelerate.git`
- Install latest `transformers` from source
`pip install git+https://github.com/huggingface/transformers.git`
### Running FP4 models - single GPU setup - Quickstart
以äžã®ã³ãŒããå®è¡ããããšã§ãç°¡åã«åäžã®GPUã§FP4ã¢ãã«ãå®è¡ã§ããŸã:
```py
from transformers import AutoModelForCausalLM
model_name = "bigscience/bloom-2b5"
model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
```
泚æ: `device_map`ã¯ãªãã·ã§ã³ã§ãããæšè«æã« `device_map = 'auto'` ãèšå®ããããšãæšå¥šãããŠããŸããããã«ãããå©çšå¯èœãªãªãœãŒã¹ã«å¹ççã«ã¢ãã«ããã£ã¹ããããããŸãã
### Running FP4 models - multi GPU setup
æ··å4ãããã¢ãã«ãè€æ°ã®GPUã«ããŒãããæ¹æ³ã¯ãåäžGPUã»ããã¢ãããšåãã§ãïŒåäžGPUã»ããã¢ãããšåãã³ãã³ãã§ãïŒïŒ
```py
model_name = "bigscience/bloom-2b5"
model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
```
ãããã`accelerate`ã䜿çšããŠãåGPUã«å²ãåœãŠãGPU RAMãå¶åŸ¡ããããšãã§ããŸãã以äžã®ããã«ã`max_memory`åŒæ°ã䜿çšããŸãïŒ
```py
max_memory_mapping = {0: "600MB", 1: "1GB"}
model_name = "bigscience/bloom-3b"
model_4bit = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto", load_in_4bit=True, max_memory=max_memory_mapping
)
```
ãã®äŸã§ã¯ãæåã®GPUã¯600MBã®ã¡ã¢ãªã䜿çšãã2çªç®ã®GPUã¯1GBã䜿çšããŸãã
### Advanced usage
ãã®ã¡ãœããã®ãããªãé«åºŠãªäœ¿çšæ³ã«ã€ããŠã¯ã[éåå](main_classes/quantization)ã®ããã¥ã¡ã³ããŒã·ã§ã³ããŒãžãã芧ãã ããã
## `bitsandbytes` integration for Int8 mixed-precision matrix decomposition
<Tip>
ãã®æ©èœã¯ããã«ãGPUç°å¢ã§ã䜿çšã§ããŸãã
</Tip>
è«æ[`LLM.int8()ïŒã¹ã±ãŒã©ãã«ãªTransformeråãã®8ãããè¡åä¹ç®`](https://arxiv.org/abs/2208.07339)ã«ããã°ãHugging Faceçµ±åãHubå
ã®ãã¹ãŠã®ã¢ãã«ã§ãããæ°è¡ã®ã³ãŒãã§ãµããŒããããŠããŸãããã®ã¡ãœããã¯ãå粟床ïŒ`float16`ããã³`bfloat16`ïŒã®éã¿ã®å Žåã«`nn.Linear`ãµã€ãºã2åãå粟床ïŒ`float32`ïŒã®éã¿ã®å Žåã¯4åã«çž®å°ããå€ãå€ã«å¯ŸããŠã»ãšãã©åœ±é¿ãäžããŸããã

Int8æ··å粟床è¡ååè§£ã¯ãè¡åä¹ç®ã2ã€ã®ã¹ããªãŒã ã«åå²ããããšã«ãã£ãŠåäœããŸãïŒ(1) ã·ã¹ãããã£ãã¯ãªç¹åŸŽå€ãå€ã¹ããªãŒã ãfp16ã§è¡åä¹ç®ïŒ0.01%ïŒã(2) int8è¡åä¹ç®ã®éåžžã®ã¹ããªãŒã ïŒ99.9%ïŒããã®æ¹æ³ã䜿çšãããšãéåžžã«å€§ããªã¢ãã«ã«å¯ŸããŠäºæž¬ã®å£åãªãã«int8æšè«ãå¯èœã§ãã
ãã®ã¡ãœããã®è©³çްã«ã€ããŠã¯ã[è«æ](https://arxiv.org/abs/2208.07339)ãŸãã¯[ãã®çµ±åã«é¢ããããã°èšäº](https://huggingface.co/blog/hf-bitsandbytes-integration)ãã確èªãã ããã

ãªãããã®æ©èœã䜿çšããã«ã¯GPUãå¿
èŠã§ãããã«ãŒãã«ã¯GPUå°çšã«ã³ã³ãã€ã«ãããŠããå¿
èŠããããŸãããã®æ©èœã䜿çšããåã«ãã¢ãã«ã®1/4ïŒãŸãã¯ããŒã粟床ã®éã¿ã®å Žåã¯1/2ïŒãä¿åããã®ã«ååãªGPUã¡ã¢ãªãããããšã確èªããŠãã ããã
ãã®ã¢ãžã¥ãŒã«ã䜿çšããéã®ãã«ãã«é¢ãã詳现ã¯ã以äžã®ããŒããã芧ããã ããã[Google Colabã®ãã¢](#colab-demos)ãã芧ãã ããã
### Requirements [[requirements-for-int8-mixedprecision-matrix-decomposition]]
- `bitsandbytes<0.37.0`ã䜿çšããå ŽåãNVIDIA GPUã䜿çšããŠããããšã確èªãã8ããããã³ãœã«ã³ã¢ããµããŒãããŠããããšã確èªããŠãã ããïŒTuringãAmpereããŸãã¯ãã以éã®ã¢ãŒããã¯ãã£ãŒãäŸïŒT4ãRTX20s RTX30sãA40-A100ãªã©ïŒã`bitsandbytes>=0.37.0`ã®å Žåããã¹ãŠã®GPUããµããŒããããã¯ãã§ãã
- æ£ããããŒãžã§ã³ã®`bitsandbytes`ãã€ã³ã¹ããŒã«ããã«ã¯ã次ã®ã³ãã³ããå®è¡ããŠãã ããïŒ
`pip install bitsandbytes>=0.31.5`
- `accelerate`ãã€ã³ã¹ããŒã«ããŸãïŒ
`pip install accelerate>=0.12.0`
### Running mixed-Int8 models - single GPU setup
å¿
èŠãªã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããåŸãããã¯ã¹ 8 ãããã¢ãã«ãèªã¿èŸŒãæ¹æ³ã¯æ¬¡ã®éãã§ãïŒ
```py
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
```
以äžã¯ã·ã³ãã«ãªäŸã§ãïŒ
* `pipeline()` 颿°ã®ä»£ããã«ãã¢ãã«ã® `generate()` ã¡ãœããã䜿çšããããšããå§ãããŸãã`pipeline()` 颿°ã䜿çšããŠæšè«ããããšã¯å¯èœã§ãããæ··å8ãããã¢ãã«ã«æé©åãããŠãããã`generate()` ã¡ãœããã䜿çšãããããé
ããªããŸãããŸããäžéšã®ãµã³ããªã³ã°æŠç¥ïŒäŸïŒãã¯ã¬ãŠã¹ãµã³ããªã³ã°ïŒã¯ã`pipeline()` 颿°ã§ã¯æ··å8ãããã¢ãã«ã§ã¯ãµããŒããããŠããŸããã
* ãã¹ãŠã®å
¥åãã¢ãã«ãšåãããã€ã¹ã«é
眮ããŠãã ããã
```py
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "bigscience/bloom-2b5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
prompt = "Hello, my llama is cute"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
```
### Running mixed-int8 models - multi GPU setup
è€æ°ã®GPUã«æ··å8ãããã¢ãã«ãããŒãããæ¹æ³ã¯ã次ã®éãã§ãïŒã·ã³ã°ã«GPUã»ããã¢ãããšåãã³ãã³ãã§ãïŒïŒ
```py
model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
```
`accelerate`ã䜿çšããŠåGPUã«å²ãåœãŠãGPU RAMãå¶åŸ¡ããéã«ã¯ã以äžã®ããã«`max_memory`åŒæ°ã䜿çšããŸãïŒ
```py
max_memory_mapping = {0: "1GB", 1: "2GB"}
model_name = "bigscience/bloom-3b"
model_8bit = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping
)
```
In this example, the first GPU will use 1GB of memory and the second 2GB.
### Colab demos
ãã®æ¹æ³ã䜿çšãããšã以åã®Google Colabã§ã¯æšè«ã§ããªãã£ãã¢ãã«ã«å¯ŸããŠæšè«ãè¡ãããšãã§ããŸãã以äžã¯ãGoogle Colabã§8ãããéååã䜿çšããŠT5-11bïŒfp32ã§42GBïŒãå®è¡ãããã¢ã®ãªã³ã¯ã§ãïŒ
[](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing)
ãŸããBLOOM-3Bã®ãã¢ãã芧ããã ããŸãïŒ
[](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing)
## Advanced usage: mixing FP4 (or Int8) and BetterTransformer
ç°ãªãæ¹æ³ãçµã¿åãããŠãã¢ãã«ã®æé©ãªããã©ãŒãã³ã¹ãåŸãããšãã§ããŸããäŸãã°ãBetterTransformerã䜿çšããŠFP4ããã¯ã¹ãã¬ã·ãžã§ã³æšè«ãšãã©ãã·ã¥ã¢ãã³ã·ã§ã³ãçµã¿åãããããšãã§ããŸãã
```py
import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", quantization_config=quantization_config)
input_text = "Hello my dog is cute and"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
with sdpa_kernel(SDPBackend.FLASH_ATTENTION):
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
``` |