Student0809 commited on
Commit
b50b784
·
verified ·
1 Parent(s): 1100969

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. docs/transformers/docs/source/en/fsdp.md +145 -0
  2. docs/transformers/docs/source/en/generation_features.md +82 -0
  3. docs/transformers/docs/source/en/gguf.md +53 -0
  4. docs/transformers/docs/source/en/glossary.md +522 -0
  5. docs/transformers/docs/source/en/gpu_selection.md +94 -0
  6. docs/transformers/docs/source/en/how_to_hack_models.md +156 -0
  7. docs/transformers/docs/source/en/hpo_train.md +167 -0
  8. docs/transformers/docs/source/en/image_processors.md +222 -0
  9. docs/transformers/docs/source/en/index.md +45 -0
  10. docs/transformers/docs/source/en/installation.md +223 -0
  11. docs/transformers/docs/source/en/internal/audio_utils.md +39 -0
  12. docs/transformers/docs/source/en/internal/file_utils.md +50 -0
  13. docs/transformers/docs/source/en/internal/generation_utils.md +446 -0
  14. docs/transformers/docs/source/en/internal/image_processing_utils.md +48 -0
  15. docs/transformers/docs/source/en/internal/import_utils.md +91 -0
  16. docs/transformers/docs/source/en/internal/model_debugging_utils.md +213 -0
  17. docs/transformers/docs/source/en/internal/modeling_utils.md +78 -0
  18. docs/transformers/docs/source/en/internal/pipelines_utils.md +44 -0
  19. docs/transformers/docs/source/en/internal/time_series_utils.md +29 -0
  20. docs/transformers/docs/source/en/internal/tokenization_utils.md +42 -0
  21. docs/transformers/docs/source/en/internal/trainer_utils.md +49 -0
  22. docs/transformers/docs/source/en/kv_cache.md +359 -0
  23. docs/transformers/docs/source/en/llm_optims.md +420 -0
  24. docs/transformers/docs/source/en/llm_tutorial.md +289 -0
  25. docs/transformers/docs/source/en/llm_tutorial_optimization.md +782 -0
  26. docs/transformers/docs/source/en/main_classes/backbones.md +60 -0
  27. docs/transformers/docs/source/en/main_classes/callback.md +137 -0
  28. docs/transformers/docs/source/en/main_classes/configuration.md +32 -0
  29. docs/transformers/docs/source/en/main_classes/data_collator.md +76 -0
  30. docs/transformers/docs/source/en/main_classes/deepspeed.md +32 -0
  31. docs/transformers/docs/source/en/main_classes/executorch.md +33 -0
  32. docs/transformers/docs/source/en/main_classes/feature_extractor.md +39 -0
  33. docs/transformers/docs/source/en/main_classes/image_processor.md +79 -0
  34. docs/transformers/docs/source/en/main_classes/keras_callbacks.md +28 -0
  35. docs/transformers/docs/source/en/main_classes/logging.md +119 -0
  36. docs/transformers/docs/source/en/main_classes/model.md +73 -0
  37. docs/transformers/docs/source/en/main_classes/onnx.md +54 -0
  38. docs/transformers/docs/source/en/main_classes/optimizer_schedules.md +76 -0
  39. docs/transformers/docs/source/en/main_classes/output.md +321 -0
  40. docs/transformers/docs/source/en/main_classes/peft.md +23 -0
  41. docs/transformers/docs/source/en/main_classes/pipelines.md +501 -0
  42. docs/transformers/docs/source/en/main_classes/processors.md +163 -0
  43. docs/transformers/docs/source/en/main_classes/quantization.md +98 -0
  44. docs/transformers/docs/source/en/main_classes/text_generation.md +59 -0
  45. docs/transformers/docs/source/en/main_classes/tokenizer.md +104 -0
  46. docs/transformers/docs/source/en/main_classes/trainer.md +54 -0
  47. docs/transformers/docs/source/en/model_doc/albert.md +307 -0
  48. docs/transformers/docs/source/en/model_doc/align.md +108 -0
  49. docs/transformers/docs/source/en/model_doc/altclip.md +116 -0
  50. docs/transformers/docs/source/en/model_doc/aria.md +112 -0
docs/transformers/docs/source/en/fsdp.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2024 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # FullyShardedDataParallel
18
+
19
+ [Fully Sharded Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) is a [parallelism](./perf_train_gpu_many) method that combines the advantages of data and model parallelism for distributed training.
20
+
21
+ Unlike [DistributedDataParallel (DDP)](./perf_train_gpu_many#distributeddataparallel), FSDP saves more memory because it doesn't replicate a model on each GPU. It shards the models parameters, gradients and optimizer states across GPUs. Each model shard processes a portion of the data and the results are synchronized to speed up training.
22
+
23
+ This guide covers how to set up training a model with FSDP and [Accelerate](https://hf.co/docs/accelerate/index), a library for managing distributed training.
24
+
25
+ ```bash
26
+ pip install accelerate
27
+ ```
28
+
29
+ ## Configuration options
30
+
31
+ Always start by running the [accelerate config](https://hf.co/docs/accelerate/package_reference/cli#accelerate-config) command to help Accelerate set up the correct distributed training environment.
32
+
33
+ ```bash
34
+ accelerate config
35
+ ```
36
+
37
+ The section below discusses some of the more important FSDP configuration options. Learn more about other available options in the [fsdp_config](https://hf.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) parameter.
38
+
39
+ ### Sharding strategy
40
+
41
+ FSDP offers several sharding strategies to distribute a model. Refer to the table below to help you choose the best strategy for your setup. Specify a strategy with the `fsdp_sharding_strategy` parameter in the configuration file.
42
+
43
+ | sharding strategy | description | parameter value |
44
+ |---|---|---|
45
+ | `FULL_SHARD` | shards model parameters, gradients, and optimizer states | `1` |
46
+ | `SHARD_GRAD_OP` | shards gradients and optimizer states | `2` |
47
+ | `NO_SHARD` | don't shard the model | `3` |
48
+ | `HYBRID_SHARD` | shards model parameters, gradients, and optimizer states within each GPU | `4` |
49
+ | `HYBRID_SHARD_ZERO2` | shards gradients and optimizer states within each GPU | `5` |
50
+
51
+ ### CPU offload
52
+
53
+ Offload model parameters and gradients when they aren't being used to the CPU to save additional GPU memory. This is useful for scenarios where a model is too large even with FSDP.
54
+
55
+ Specify `fsdp_offload_params: true` in the configuration file to enable offloading.
56
+
57
+ ### Wrapping policy
58
+
59
+ FSDP is applied by wrapping each layer in the network. The wrapping is usually applied in a nested way where the full weights are discarded after each forward pass to save memory for the next layer.
60
+
61
+ There are several wrapping policies available, but the *auto wrapping* policy is the simplest and doesn't require any changes to your code. Specify `fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP` to wrap a Transformer layer and `fsdp_transformer_layer_cls_to_wrap` to determine which layer to wrap (for example, `BertLayer`).
62
+
63
+ Size-based wrapping is also available. If a layer exceeds a certain number of parameters, it is wrapped. Specify `fsdp_wrap_policy: SIZED_BASED_WRAP` and `min_num_param` to set the minimum number of parameters for a layer to be wrapped.
64
+
65
+ ### Checkpoints
66
+
67
+ Intermediate checkpoints should be saved as a sharded state dict because saving the full state dict - even with CPU offloading - is time consuming and can cause `NCCL Timeout` errors due to indefinite hanging during broadcasting.
68
+
69
+ Specify `fsdp_state_dict_type: SHARDED_STATE_DICT` in the configuration file to save the sharded state dict. Now you can resume training from the sharded state dict with [`~accelerate.Accelerator.load_state`].
70
+
71
+ ```py
72
+ accelerator.load_state("directory/containing/checkpoints")
73
+ ```
74
+
75
+ Once training is complete though, you should save the full state dict because the sharded state dict is only compatible with FSDP.
76
+
77
+ ```py
78
+ if trainer.is_fsdp_enabled:
79
+ trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
80
+
81
+ trainer.save_model(script_args.output_dir)
82
+ ```
83
+
84
+ ### TPU
85
+
86
+ [PyTorch XLA](https://pytorch.org/xla/release/2.1/index.html), a package for running PyTorch on XLA devices, enables FSDP on TPUs. Modify the configuration file to include the parameters below. Refer to the [xla_fsdp_settings](https://github.com/pytorch/xla/blob/2e6e183e0724818f137c8135b34ef273dea33318/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py#L128) parameter for additional XLA-specific parameters you can configure for FSDP.
87
+
88
+ ```yaml
89
+ xla: True # must be set to True to enable PyTorch/XLA
90
+ xla_fsdp_settings: # XLA specific FSDP parameters
91
+ xla_fsdp_grad_ckpt: True # enable gradient checkpointing
92
+ ```
93
+
94
+ ## Training
95
+
96
+ After running [accelerate config](https://hf.co/docs/accelerate/package_reference/cli#accelerate-config), your configuration file should be ready. An example configuration file is shown below that fully shards the parameter, gradient and optimizer states on two GPUs. Your file may look different depending on how you set up your configuration.
97
+
98
+ ```yaml
99
+ compute_environment: LOCAL_MACHINE
100
+ debug: false
101
+ distributed_type: FSDP
102
+ downcast_bf16: 'no'
103
+ fsdp_config:
104
+ fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
105
+ fsdp_backward_prefetch_policy: BACKWARD_PRE
106
+ fsdp_cpu_ram_efficient_loading: true
107
+ fsdp_forward_prefetch: false
108
+ fsdp_offload_params: true
109
+ fsdp_sharding_strategy: 1
110
+ fsdp_state_dict_type: SHARDED_STATE_DICT
111
+ fsdp_sync_module_states: true
112
+ fsdp_transformer_layer_cls_to_wrap: BertLayer
113
+ fsdp_use_orig_params: true
114
+ machine_rank: 0
115
+ main_training_function: main
116
+ mixed_precision: bf16
117
+ num_machines: 1
118
+ num_processes: 2
119
+ rdzv_backend: static
120
+ same_network: true
121
+ tpu_env: []
122
+ tpu_use_cluster: false
123
+ tpu_use_sudo: false
124
+ use_cpu: false
125
+ ```
126
+
127
+ Run the [accelerate launch](https://hf.co/docs/accelerate/package_reference/cli#accelerate-launch) command to launch a training script with the FSDP configurations you chose in the configuration file.
128
+
129
+ ```bash
130
+ accelerate launch my-training-script.py
131
+ ```
132
+
133
+ It is also possible to directly specify some of the FSDP arguments in the command line.
134
+
135
+ ```bash
136
+ accelerate launch --fsdp="full shard" --fsdp_config="path/to/fsdp_config/" my-training-script.py
137
+ ```
138
+
139
+ ## Resources
140
+
141
+ FSDP is a powerful tool for training large models with fewer GPUs compared to other parallelism strategies. Refer to the following resources below to learn even more about FSDP.
142
+
143
+ - Follow along with the more in-depth Accelerate guide for [FSDP](https://hf.co/docs/accelerate/usage_guides/fsdp).
144
+ - Read the [Introducing PyTorch Fully Sharded Data Parallel (FSDP) API](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) blog post.
145
+ - Read the [Scaling PyTorch models on Cloud TPUs with FSDP](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/) blog post.
docs/transformers/docs/source/en/generation_features.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2025 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Generation features
18
+
19
+ The [`~GenerationMixin.generate`] API supports a couple features for building applications on top of it.
20
+
21
+ This guide will show you how to use these features.
22
+
23
+ ## Streaming
24
+
25
+ Streaming starts returning text as soon as it is generated so you don't have to wait to see the entire generated response all at once. It is important in user-facing applications because it reduces perceived latency and allows users to see the generation progression.
26
+
27
+ <div class="flex justify-center">
28
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual-dark_360.gif"/>
29
+ </div>
30
+
31
+ > [!TIP]
32
+ > Learn more about streaming in the [Text Generation Inference](https://huggingface.co/docs/text-generation-inference/en/conceptual/streaming) docs.
33
+
34
+ Create an instance of [`TextStreamer`] with the tokenizer. Pass [`TextStreamer`] to the `streamer` parameter in [`~GenerationMixin.generate`] to stream the output one word at a time.
35
+
36
+ ```py
37
+ from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
38
+
39
+ tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
40
+ model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
41
+ inputs = tokenizer(["The secret to baking a good cake is "], return_tensors="pt")
42
+ streamer = TextStreamer(tokenizer)
43
+
44
+ _ = model.generate(**inputs, streamer=streamer, max_new_tokens=20)
45
+ ```
46
+
47
+ The `streamer` parameter is compatible with any class with a [`~TextStreamer.put`] and [`~TextStreamer.end`] method. [`~TextStreamer.put`] pushes new tokens and [`~TextStreamer.end`] flags the end of generation. You can create your own streamer class as long as they include these two methods, or you can use Transformers' basic streamer classes.
48
+
49
+ ## Watermarking
50
+
51
+ Watermarking is useful for detecting whether text is generated. The [watermarking strategy](https://hf.co/papers/2306.04634) in Transformers randomly "colors" a subset of the tokens green. When green tokens are generated, they have a small bias added to their logits, and a higher probability of being generated. You can detect generated text by comparing the proportion of green tokens to the amount of green tokens typically found in human-generated text.
52
+
53
+ Watermarking is supported for any generative model in Transformers and doesn't require an extra classification model to detect the watermarked text.
54
+
55
+ Create a [`WatermarkingConfig`] with the bias value to add to the logits and watermarking algorithm. The example below uses the `"selfhash"` algorithm, where the green token selection only depends on the current token. Pass the [`WatermarkingConfig`] to [`~GenerationMixin.generate`].
56
+
57
+ > [!TIP]
58
+ > The [`WatermarkDetector`] class detects the proportion of green tokens in generated text, which is why it is recommended to strip the prompt text, if it is much longer than the generated text. Padding can also have an effect on [`WatermarkDetector`].
59
+
60
+ ```py
61
+ from transformers import AutoTokenizer, AutoModelForCausalLM, WatermarkDetector, WatermarkingConfig
62
+
63
+ model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
64
+ tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
65
+ tokenizer.pad_token_id = tokenizer.eos_token_id
66
+ tokenizer.padding_side = "left"
67
+
68
+ inputs = tokenizer(["This is the beginning of a long story", "Alice and Bob are"], padding=True, return_tensors="pt")
69
+ input_len = inputs["input_ids"].shape[-1]
70
+
71
+ watermarking_config = WatermarkingConfig(bias=2.5, seeding_scheme="selfhash")
72
+ out = model.generate(**inputs, watermarking_config=watermarking_config, do_sample=False, max_length=20)
73
+ ```
74
+
75
+ Create an instance of [`WatermarkDetector`] and pass the model output to it to detect whether the text is machine-generated. The [`WatermarkDetector`] must have the same [`WatermarkingConfig`] used during generation.
76
+
77
+ ```py
78
+ detector = WatermarkDetector(model_config=model.config, device="cpu", watermarking_config=watermarking_config)
79
+ detection_out = detector(out, return_dict=True)
80
+ detection_out.prediction
81
+ array([True, True])
82
+ ```
docs/transformers/docs/source/en/gguf.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2024 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # GGUF
18
+
19
+ [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) is a file format used to store models for inference with [GGML](https://github.com/ggerganov/ggml), a fast and lightweight inference framework written in C and C++. GGUF is a single-file format containing the model metadata and tensors.
20
+
21
+ <div class="flex justify-center">
22
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/gguf-spec.png"/>
23
+ </div>
24
+
25
+ The GGUF format also supports many quantized data types (refer to [quantization type table](https://hf.co/docs/hub/en/gguf#quantization-types) for a complete list of supported quantization types) which saves a significant amount of memory, making inference with large models like Whisper and Llama feasible on local and edge devices.
26
+
27
+ Transformers supports loading models stored in the GGUF format for further training or finetuning. The GGUF checkpoint is **dequantized to fp32** where the full model weights are available and compatible with PyTorch.
28
+
29
+ > [!TIP]
30
+ > Models that support GGUF include Llama, Mistral, Qwen2, Qwen2Moe, Phi3, Bloom, Falcon, StableLM, GPT2, Starcoder2, and [more](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/ggml.py)
31
+
32
+ Add the `gguf_file` parameter to [`~PreTrainedModel.from_pretrained`] to specify the GGUF file to load.
33
+
34
+ ```py
35
+ # pip install gguf
36
+ from transformers import AutoTokenizer, AutoModelForCausalLM
37
+
38
+ model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
39
+ filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"
40
+
41
+ torch_dtype = torch.float32 # could be torch.float16 or torch.bfloat16 too
42
+ tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
43
+ model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename, torch_dtype=torch_dtype)
44
+ ```
45
+
46
+ Once you're done tinkering with the model, save and convert it back to the GGUF format with the [convert-hf-to-gguf.py](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py) script.
47
+
48
+ ```py
49
+ tokenizer.save_pretrained("directory")
50
+ model.save_pretrained("directory")
51
+
52
+ !python ${path_to_llama_cpp}/convert-hf-to-gguf.py ${directory}
53
+ ```
docs/transformers/docs/source/en/glossary.md ADDED
@@ -0,0 +1,522 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Glossary
18
+
19
+ This glossary defines general machine learning and 🤗 Transformers terms to help you better understand the
20
+ documentation.
21
+
22
+ ## A
23
+
24
+ ### attention mask
25
+
26
+ The attention mask is an optional argument used when batching sequences together.
27
+
28
+ <Youtube id="M6adb1j2jPI"/>
29
+
30
+ This argument indicates to the model which tokens should be attended to, and which should not.
31
+
32
+ For example, consider these two sequences:
33
+
34
+ ```python
35
+ >>> from transformers import BertTokenizer
36
+
37
+ >>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-cased")
38
+
39
+ >>> sequence_a = "This is a short sequence."
40
+ >>> sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."
41
+
42
+ >>> encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
43
+ >>> encoded_sequence_b = tokenizer(sequence_b)["input_ids"]
44
+ ```
45
+
46
+ The encoded versions have different lengths:
47
+
48
+ ```python
49
+ >>> len(encoded_sequence_a), len(encoded_sequence_b)
50
+ (8, 19)
51
+ ```
52
+
53
+ Therefore, we can't put them together in the same tensor as-is. The first sequence needs to be padded up to the length
54
+ of the second one, or the second one needs to be truncated down to the length of the first one.
55
+
56
+ In the first case, the list of IDs will be extended by the padding indices. We can pass a list to the tokenizer and ask
57
+ it to pad like this:
58
+
59
+ ```python
60
+ >>> padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)
61
+ ```
62
+
63
+ We can see that 0s have been added on the right of the first sentence to make it the same length as the second one:
64
+
65
+ ```python
66
+ >>> padded_sequences["input_ids"]
67
+ [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
68
+ ```
69
+
70
+ This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating the
71
+ position of the padded indices so that the model does not attend to them. For the [`BertTokenizer`], `1` indicates a
72
+ value that should be attended to, while `0` indicates a padded value. This attention mask is in the dictionary returned
73
+ by the tokenizer under the key "attention_mask":
74
+
75
+ ```python
76
+ >>> padded_sequences["attention_mask"]
77
+ [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
78
+ ```
79
+
80
+ ### autoencoding models
81
+
82
+ See [encoder models](#encoder-models) and [masked language modeling](#masked-language-modeling-mlm)
83
+
84
+ ### autoregressive models
85
+
86
+ See [causal language modeling](#causal-language-modeling) and [decoder models](#decoder-models)
87
+
88
+ ## B
89
+
90
+ ### backbone
91
+
92
+ The backbone is the network (embeddings and layers) that outputs the raw hidden states or features. It is usually connected to a [head](#head) which accepts the features as its input to make a prediction. For example, [`ViTModel`] is a backbone without a specific head on top. Other models can also use [`VitModel`] as a backbone such as [DPT](model_doc/dpt).
93
+
94
+ ## C
95
+
96
+ ### causal language modeling
97
+
98
+ A pretraining task where the model reads the texts in order and has to predict the next word. It's usually done by
99
+ reading the whole sentence but using a mask inside the model to hide the future tokens at a certain timestep.
100
+
101
+ ### channel
102
+
103
+ Color images are made up of some combination of values in three channels: red, green, and blue (RGB) and grayscale images only have one channel. In 🤗 Transformers, the channel can be the first or last dimension of an image's tensor: [`n_channels`, `height`, `width`] or [`height`, `width`, `n_channels`].
104
+
105
+ ### connectionist temporal classification (CTC)
106
+
107
+ An algorithm which allows a model to learn without knowing exactly how the input and output are aligned; CTC calculates the distribution of all possible outputs for a given input and chooses the most likely output from it. CTC is commonly used in speech recognition tasks because speech doesn't always cleanly align with the transcript for a variety of reasons such as a speaker's different speech rates.
108
+
109
+ ### convolution
110
+
111
+ A type of layer in a neural network where the input matrix is multiplied element-wise by a smaller matrix (kernel or filter) and the values are summed up in a new matrix. This is known as a convolutional operation which is repeated over the entire input matrix. Each operation is applied to a different segment of the input matrix. Convolutional neural networks (CNNs) are commonly used in computer vision.
112
+
113
+ ## D
114
+
115
+ ### DataParallel (DP)
116
+
117
+ Parallelism technique for training on multiple GPUs where the same setup is replicated multiple times, with each instance
118
+ receiving a distinct data slice. The processing is done in parallel and all setups are synchronized at the end of each training step.
119
+
120
+ Learn more about how DataParallel works [here](perf_train_gpu_many#dataparallel-vs-distributeddataparallel).
121
+
122
+ ### decoder input IDs
123
+
124
+ This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
125
+ inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a
126
+ way specific to each model.
127
+
128
+ Most encoder-decoder models (BART, T5) create their `decoder_input_ids` on their own from the `labels`. In such models,
129
+ passing the `labels` is the preferred way to handle training.
130
+
131
+ Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
132
+
133
+ ### decoder models
134
+
135
+ Also referred to as autoregressive models, decoder models involve a pretraining task (called causal language modeling) where the model reads the texts in order and has to predict the next word. It's usually done by
136
+ reading the whole sentence with a mask to hide future tokens at a certain timestep.
137
+
138
+ <Youtube id="d_ixlCubqQw"/>
139
+
140
+ ### deep learning (DL)
141
+
142
+ Machine learning algorithms which use neural networks with several layers.
143
+
144
+ ## E
145
+
146
+ ### encoder models
147
+
148
+ Also known as autoencoding models, encoder models take an input (such as text or images) and transform them into a condensed numerical representation called an embedding. Oftentimes, encoder models are pretrained using techniques like [masked language modeling](#masked-language-modeling-mlm), which masks parts of the input sequence and forces the model to create more meaningful representations.
149
+
150
+ <Youtube id="H39Z_720T5s"/>
151
+
152
+ ## F
153
+
154
+ ### feature extraction
155
+
156
+ The process of selecting and transforming raw data into a set of features that are more informative and useful for machine learning algorithms. Some examples of feature extraction include transforming raw text into word embeddings and extracting important features such as edges or shapes from image/video data.
157
+
158
+ ### feed forward chunking
159
+
160
+ In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
161
+ The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
162
+ `google-bert/bert-base-uncased`).
163
+
164
+ For an input of size `[batch_size, sequence_length]`, the memory required to store the intermediate feed forward
165
+ embeddings `[batch_size, sequence_length, config.intermediate_size]` can account for a large fraction of the memory
166
+ use. The authors of [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) noticed that since the
167
+ computation is independent of the `sequence_length` dimension, it is mathematically equivalent to compute the output
168
+ embeddings of both feed forward layers `[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n`
169
+ individually and concat them afterward to `[batch_size, sequence_length, config.hidden_size]` with `n = sequence_length`, which trades increased computation time against reduced memory use, but yields a mathematically
170
+ **equivalent** result.
171
+
172
+ For models employing the function [`apply_chunking_to_forward`], the `chunk_size` defines the number of output
173
+ embeddings that are computed in parallel and thus defines the trade-off between memory and time complexity. If
174
+ `chunk_size` is set to 0, no feed forward chunking is done.
175
+
176
+ ### finetuned models
177
+
178
+ Finetuning is a form of transfer learning which involves taking a pretrained model, freezing its weights, and replacing the output layer with a newly added [model head](#head). The model head is trained on your target dataset.
179
+
180
+ See the [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/training) tutorial for more details, and learn how to fine-tune models with 🤗 Transformers.
181
+
182
+ ## H
183
+
184
+ ### head
185
+
186
+ The model head refers to the last layer of a neural network that accepts the raw hidden states and projects them onto a different dimension. There is a different model head for each task. For example:
187
+
188
+ * [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
189
+ * [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
190
+ * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].
191
+
192
+ ## I
193
+
194
+ ### image patch
195
+
196
+ Vision-based Transformers models split an image into smaller patches which are linearly embedded, and then passed as a sequence to the model. You can find the `patch_size` - or resolution - of the model in its configuration.
197
+
198
+ ### inference
199
+
200
+ Inference is the process of evaluating a model on new data after training is complete. See the [Pipeline for inference](https://huggingface.co/docs/transformers/pipeline_tutorial) tutorial to learn how to perform inference with 🤗 Transformers.
201
+
202
+ ### input IDs
203
+
204
+ The input ids are often the only required parameters to be passed to the model as input. They are token indices,
205
+ numerical representations of tokens building the sequences that will be used as input by the model.
206
+
207
+ <Youtube id="VFp38yj8h3A"/>
208
+
209
+ Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT
210
+ tokenizer, which is a [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) tokenizer:
211
+
212
+ ```python
213
+ >>> from transformers import BertTokenizer
214
+
215
+ >>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-cased")
216
+
217
+ >>> sequence = "A Titan RTX has 24GB of VRAM"
218
+ ```
219
+
220
+ The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary.
221
+
222
+ ```python
223
+ >>> tokenized_sequence = tokenizer.tokenize(sequence)
224
+ ```
225
+
226
+ The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split
227
+ in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix
228
+ is added for "RA" and "M":
229
+
230
+ ```python
231
+ >>> print(tokenized_sequence)
232
+ ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
233
+ ```
234
+
235
+ These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding the sentence to the tokenizer, which leverages the Rust implementation of [🤗 Tokenizers](https://github.com/huggingface/tokenizers) for peak performance.
236
+
237
+ ```python
238
+ >>> inputs = tokenizer(sequence)
239
+ ```
240
+
241
+ The tokenizer returns a dictionary with all the arguments necessary for its corresponding model to work properly. The
242
+ token indices are under the key `input_ids`:
243
+
244
+ ```python
245
+ >>> encoded_sequence = inputs["input_ids"]
246
+ >>> print(encoded_sequence)
247
+ [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
248
+ ```
249
+
250
+ Note that the tokenizer automatically adds "special tokens" (if the associated model relies on them) which are special
251
+ IDs the model sometimes uses.
252
+
253
+ If we decode the previous sequence of ids,
254
+
255
+ ```python
256
+ >>> decoded_sequence = tokenizer.decode(encoded_sequence)
257
+ ```
258
+
259
+ we will see
260
+
261
+ ```python
262
+ >>> print(decoded_sequence)
263
+ [CLS] A Titan RTX has 24GB of VRAM [SEP]
264
+ ```
265
+
266
+ because this is the way a [`BertModel`] is going to expect its inputs.
267
+
268
+ ## L
269
+
270
+ ### labels
271
+
272
+ The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
273
+ should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its
274
+ predictions and the expected value (the label).
275
+
276
+ These labels are different according to the model head, for example:
277
+
278
+ - For sequence classification models, ([`BertForSequenceClassification`]), the model expects a tensor of dimension
279
+ `(batch_size)` with each value of the batch corresponding to the expected label of the entire sequence.
280
+ - For token classification models, ([`BertForTokenClassification`]), the model expects a tensor of dimension
281
+ `(batch_size, seq_length)` with each value corresponding to the expected label of each individual token.
282
+ - For masked language modeling, ([`BertForMaskedLM`]), the model expects a tensor of dimension `(batch_size,
283
+ seq_length)` with each value corresponding to the expected label of each individual token: the labels being the token
284
+ ID for the masked token, and values to be ignored for the rest (usually -100).
285
+ - For sequence to sequence tasks, ([`BartForConditionalGeneration`], [`MBartForConditionalGeneration`]), the model
286
+ expects a tensor of dimension `(batch_size, tgt_seq_length)` with each value corresponding to the target sequences
287
+ associated with each input sequence. During training, both BART and T5 will make the appropriate
288
+ `decoder_input_ids` and decoder attention masks internally. They usually do not need to be supplied. This does not
289
+ apply to models leveraging the Encoder-Decoder framework.
290
+ - For image classification models, ([`ViTForImageClassification`]), the model expects a tensor of dimension
291
+ `(batch_size)` with each value of the batch corresponding to the expected label of each individual image.
292
+ - For semantic segmentation models, ([`SegformerForSemanticSegmentation`]), the model expects a tensor of dimension
293
+ `(batch_size, height, width)` with each value of the batch corresponding to the expected label of each individual pixel.
294
+ - For object detection models, ([`DetrForObjectDetection`]), the model expects a list of dictionaries with a
295
+ `class_labels` and `boxes` key where each value of the batch corresponds to the expected label and number of bounding boxes of each individual image.
296
+ - For automatic speech recognition models, ([`Wav2Vec2ForCTC`]), the model expects a tensor of dimension `(batch_size,
297
+ target_length)` with each value corresponding to the expected label of each individual token.
298
+
299
+ <Tip>
300
+
301
+ Each model's labels may be different, so be sure to always check the documentation of each model for more information
302
+ about their specific labels!
303
+
304
+ </Tip>
305
+
306
+ The base models ([`BertModel`]) do not accept labels, as these are the base transformer models, simply outputting
307
+ features.
308
+
309
+ ### large language models (LLM)
310
+
311
+ A generic term that refers to transformer language models (GPT-3, BLOOM, OPT) that were trained on a large quantity of data. These models also tend to have a large number of learnable parameters (e.g. 175 billion for GPT-3).
312
+
313
+ ## M
314
+
315
+ ### masked language modeling (MLM)
316
+
317
+ A pretraining task where the model sees a corrupted version of the texts, usually done by
318
+ masking some tokens randomly, and has to predict the original text.
319
+
320
+ ### multimodal
321
+
322
+ A task that combines texts with another kind of inputs (for instance images).
323
+
324
+ ## N
325
+
326
+ ### Natural language generation (NLG)
327
+
328
+ All tasks related to generating text (for instance, [Write With Transformers](https://transformer.huggingface.co/), translation).
329
+
330
+ ### Natural language processing (NLP)
331
+
332
+ A generic way to say "deal with texts".
333
+
334
+ ### Natural language understanding (NLU)
335
+
336
+ All tasks related to understanding what is in a text (for instance classifying the
337
+ whole text, individual words).
338
+
339
+ ## P
340
+
341
+ ### pipeline
342
+
343
+ A pipeline in 🤗 Transformers is an abstraction referring to a series of steps that are executed in a specific order to preprocess and transform data and return a prediction from a model. Some example stages found in a pipeline might be data preprocessing, feature extraction, and normalization.
344
+
345
+ For more details, see [Pipelines for inference](https://huggingface.co/docs/transformers/pipeline_tutorial).
346
+
347
+ ### PipelineParallel (PP)
348
+
349
+ Parallelism technique in which the model is split up vertically (layer-level) across multiple GPUs, so that only one or
350
+ several layers of the model are placed on a single GPU. Each GPU processes in parallel different stages of the pipeline
351
+ and working on a small chunk of the batch. Learn more about how PipelineParallel works [here](perf_train_gpu_many#from-naive-model-parallelism-to-pipeline-parallelism).
352
+
353
+ ### pixel values
354
+
355
+ A tensor of the numerical representations of an image that is passed to a model. The pixel values have a shape of [`batch_size`, `num_channels`, `height`, `width`], and are generated from an image processor.
356
+
357
+ ### pooling
358
+
359
+ An operation that reduces a matrix into a smaller matrix, either by taking the maximum or average of the pooled dimension(s). Pooling layers are commonly found between convolutional layers to downsample the feature representation.
360
+
361
+ ### position IDs
362
+
363
+ Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of
364
+ each token. Therefore, the position IDs (`position_ids`) are used by the model to identify each token's position in the
365
+ list of tokens.
366
+
367
+ They are an optional parameter. If no `position_ids` are passed to the model, the IDs are automatically created as
368
+ absolute positional embeddings.
369
+
370
+ Absolute positional embeddings are selected in the range `[0, config.max_position_embeddings - 1]`. Some models use
371
+ other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
372
+
373
+ ### preprocessing
374
+
375
+ The task of preparing raw data into a format that can be easily consumed by machine learning models. For example, text is typically preprocessed by tokenization. To gain a better idea of what preprocessing looks like for other input types, check out the [Preprocess](https://huggingface.co/docs/transformers/preprocessing) tutorial.
376
+
377
+ ### pretrained model
378
+
379
+ A model that has been pretrained on some data (for instance all of Wikipedia). Pretraining methods involve a
380
+ self-supervised objective, which can be reading the text and trying to predict the next word (see [causal language
381
+ modeling](#causal-language-modeling)) or masking some words and trying to predict them (see [masked language
382
+ modeling](#masked-language-modeling-mlm)).
383
+
384
+ Speech and vision models have their own pretraining objectives. For example, Wav2Vec2 is a speech model pretrained on a contrastive task which requires the model to identify the "true" speech representation from a set of "false" speech representations. On the other hand, BEiT is a vision model pretrained on a masked image modeling task which masks some of the image patches and requires the model to predict the masked patches (similar to the masked language modeling objective).
385
+
386
+ ## R
387
+
388
+ ### recurrent neural network (RNN)
389
+
390
+ A type of model that uses a loop over a layer to process texts.
391
+
392
+ ### representation learning
393
+
394
+ A subfield of machine learning which focuses on learning meaningful representations of raw data. Some examples of representation learning techniques include word embeddings, autoencoders, and Generative Adversarial Networks (GANs).
395
+
396
+ ## S
397
+
398
+ ### sampling rate
399
+
400
+ A measurement in hertz of the number of samples (the audio signal) taken per second. The sampling rate is a result of discretizing a continuous signal such as speech.
401
+
402
+ ### self-attention
403
+
404
+ Each element of the input finds out which other elements of the input they should attend to.
405
+
406
+ ### self-supervised learning
407
+
408
+ A category of machine learning techniques in which a model creates its own learning objective from unlabeled data. It differs from [unsupervised learning](#unsupervised-learning) and [supervised learning](#supervised-learning) in that the learning process is supervised, but not explicitly from the user.
409
+
410
+ One example of self-supervised learning is [masked language modeling](#masked-language-modeling-mlm), where a model is passed sentences with a proportion of its tokens removed and learns to predict the missing tokens.
411
+
412
+ ### semi-supervised learning
413
+
414
+ A broad category of machine learning training techniques that leverages a small amount of labeled data with a larger quantity of unlabeled data to improve the accuracy of a model, unlike [supervised learning](#supervised-learning) and [unsupervised learning](#unsupervised-learning).
415
+
416
+ An example of a semi-supervised learning approach is "self-training", in which a model is trained on labeled data, and then used to make predictions on the unlabeled data. The portion of the unlabeled data that the model predicts with the most confidence gets added to the labeled dataset and used to retrain the model.
417
+
418
+ ### sequence-to-sequence (seq2seq)
419
+
420
+ Models that generate a new sequence from an input, like translation models, or summarization models (such as
421
+ [Bart](model_doc/bart) or [T5](model_doc/t5)).
422
+
423
+ ### Sharded DDP
424
+
425
+ Another name for the foundational [ZeRO](#zero-redundancy-optimizer-zero) concept as used by various other implementations of ZeRO.
426
+
427
+ ### stride
428
+
429
+ In [convolution](#convolution) or [pooling](#pooling), the stride refers to the distance the kernel is moved over a matrix. A stride of 1 means the kernel is moved one pixel over at a time, and a stride of 2 means the kernel is moved two pixels over at a time.
430
+
431
+ ### supervised learning
432
+
433
+ A form of model training that directly uses labeled data to correct and instruct model performance. Data is fed into the model being trained, and its predictions are compared to the known labels. The model updates its weights based on how incorrect its predictions were, and the process is repeated to optimize model performance.
434
+
435
+ ## T
436
+
437
+ ### Tensor Parallelism (TP)
438
+
439
+ Parallelism technique for training on multiple GPUs in which each tensor is split up into multiple chunks, so instead of
440
+ having the whole tensor reside on a single GPU, each shard of the tensor resides on its designated GPU. Shards gets
441
+ processed separately and in parallel on different GPUs and the results are synced at the end of the processing step.
442
+ This is what is sometimes called horizontal parallelism, as the splitting happens on horizontal level.
443
+ Learn more about Tensor Parallelism [here](perf_train_gpu_many#tensor-parallelism).
444
+
445
+ ### token
446
+
447
+ A part of a sentence, usually a word, but can also be a subword (non-common words are often split in subwords) or a
448
+ punctuation symbol.
449
+
450
+ ### token Type IDs
451
+
452
+ Some models' purpose is to do classification on pairs of sentences or question answering.
453
+
454
+ <Youtube id="0u3ioSwev3s"/>
455
+
456
+ These require two different sequences to be joined in a single "input_ids" entry, which usually is performed with the
457
+ help of special tokens, such as the classifier (`[CLS]`) and separator (`[SEP]`) tokens. For example, the BERT model
458
+ builds its two sequence input as such:
459
+
460
+ ```python
461
+ >>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
462
+ ```
463
+
464
+ We can use our tokenizer to automatically generate such a sentence by passing the two sequences to `tokenizer` as two
465
+ arguments (and not a list, like before) like this:
466
+
467
+ ```python
468
+ >>> from transformers import BertTokenizer
469
+
470
+ >>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-cased")
471
+ >>> sequence_a = "HuggingFace is based in NYC"
472
+ >>> sequence_b = "Where is HuggingFace based?"
473
+
474
+ >>> encoded_dict = tokenizer(sequence_a, sequence_b)
475
+ >>> decoded = tokenizer.decode(encoded_dict["input_ids"])
476
+ ```
477
+
478
+ which will return:
479
+
480
+ ```python
481
+ >>> print(decoded)
482
+ [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
483
+ ```
484
+
485
+ This is enough for some models to understand where one sequence ends and where another begins. However, other models,
486
+ such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying
487
+ the two types of sequence in the model.
488
+
489
+ The tokenizer returns this mask as the "token_type_ids" entry:
490
+
491
+ ```python
492
+ >>> encoded_dict["token_type_ids"]
493
+ [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
494
+ ```
495
+
496
+ The first sequence, the "context" used for the question, has all its tokens represented by a `0`, whereas the second
497
+ sequence, corresponding to the "question", has all its tokens represented by a `1`.
498
+
499
+ Some models, like [`XLNetModel`] use an additional token represented by a `2`.
500
+
501
+ ### transfer learning
502
+
503
+ A technique that involves taking a pretrained model and adapting it to a dataset specific to your task. Instead of training a model from scratch, you can leverage knowledge obtained from an existing model as a starting point. This speeds up the learning process and reduces the amount of training data needed.
504
+
505
+ ### transformer
506
+
507
+ Self-attention based deep learning model architecture.
508
+
509
+ ## U
510
+
511
+ ### unsupervised learning
512
+
513
+ A form of model training in which data provided to the model is not labeled. Unsupervised learning techniques leverage statistical information of the data distribution to find patterns useful for the task at hand.
514
+
515
+ ## Z
516
+
517
+ ### Zero Redundancy Optimizer (ZeRO)
518
+
519
+ Parallelism technique which performs sharding of the tensors somewhat similar to [TensorParallel](#tensor-parallelism-tp),
520
+ except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need
521
+ to be modified. This method also supports various offloading techniques to compensate for limited GPU memory.
522
+ Learn more about ZeRO [here](perf_train_gpu_many#zero-data-parallelism).
docs/transformers/docs/source/en/gpu_selection.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2025 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # GPU selection
18
+
19
+ During distributed training, you can specify the number of GPUs to use and in what order. This can be useful when you have GPUs with different computing power and you want to use the faster GPU first. Or you could only use a subset of the available GPUs. The selection process works for both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). You don't need Accelerate or [DeepSpeed integration](./main_classes/deepspeed).
20
+
21
+ This guide will show you how to select the number of GPUs to use and the order to use them in.
22
+
23
+ ## Number of GPUs
24
+
25
+ For example, if there are 4 GPUs and you only want to use the first 2, run the command below.
26
+
27
+ <hfoptions id="select-gpu">
28
+ <hfoption id="torchrun">
29
+
30
+ Use the `--nproc_per_node` to select how many GPUs to use.
31
+
32
+ ```bash
33
+ torchrun --nproc_per_node=2 trainer-program.py ...
34
+ ```
35
+
36
+ </hfoption>
37
+ <hfoption id="Accelerate">
38
+
39
+ Use `--num_processes` to select how many GPUs to use.
40
+
41
+ ```bash
42
+ accelerate launch --num_processes 2 trainer-program.py ...
43
+ ```
44
+
45
+ </hfoption>
46
+ <hfoption id="DeepSpeed">
47
+
48
+ Use `--num_gpus` to select how many GPUs to use.
49
+
50
+ ```bash
51
+ deepspeed --num_gpus 2 trainer-program.py ...
52
+ ```
53
+
54
+ </hfoption>
55
+ </hfoptions>
56
+
57
+ ### Order of GPUs
58
+
59
+ To select specific GPUs to use and their order, configure the `CUDA_VISIBLE_DEVICES` environment variable. It is easiest to set the environment variable in `~/bashrc` or another startup config file. `CUDA_VISIBLE_DEVICES` is used to map which GPUs are used. For example, if there are 4 GPUs (0, 1, 2, 3) and you only want to run GPUs 0 and 2:
60
+
61
+ ```bash
62
+ CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
63
+ ```
64
+
65
+ Only the 2 physical GPUs (0 and 2) are "visible" to PyTorch and these are mapped to `cuda:0` and `cuda:1` respectively. You can also reverse the order of the GPUs to use 2 first. The mapping becomes `cuda:1` for GPU 0 and `cuda:0` for GPU 2.
66
+
67
+ ```bash
68
+ CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
69
+ ```
70
+
71
+ You can also set the `CUDA_VISIBLE_DEVICES` environment variable to an empty value to create an environment without GPUs.
72
+
73
+ ```bash
74
+ CUDA_VISIBLE_DEVICES= python trainer-program.py ...
75
+ ```
76
+
77
+ > [!WARNING]
78
+ > As with any environment variable, they can be exported instead of being added to the command line. However, this is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong GPUs. Instead, it is common practice to set the environment variable for a specific training run on the same command line.
79
+
80
+ `CUDA_DEVICE_ORDER` is an alternative environment variable you can use to control how the GPUs are ordered. You can order according to the following.
81
+
82
+ 1. PCIe bus IDs that matches the order of [`nvidia-smi`](https://developer.nvidia.com/nvidia-system-management-interface) and [`rocm-smi`](https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/.doxygen/docBin/html/index.html) for NVIDIA and AMD GPUs respectively.
83
+
84
+ ```bash
85
+ export CUDA_DEVICE_ORDER=PCI_BUS_ID
86
+ ```
87
+
88
+ 2. GPU compute ability.
89
+
90
+ ```bash
91
+ export CUDA_DEVICE_ORDER=FASTEST_FIRST
92
+ ```
93
+
94
+ The `CUDA_DEVICE_ORDER` is especially useful if your training setup consists of an older and newer GPU, where the older GPU appears first, but you cannot physically swap the cards to make the newer GPU appear first. In this case, set `CUDA_DEVICE_ORDER=FASTEST_FIRST` to always use the newer and faster GPU first (`nvidia-smi` or `rocm-smi` still reports the GPUs in their PCIe order). Or you could also set `export CUDA_VISIBLE_DEVICES=1,0`.
docs/transformers/docs/source/en/how_to_hack_models.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2024 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+
11
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
12
+ rendered properly in your Markdown viewer.
13
+
14
+ -->
15
+
16
+ # Customizing model components
17
+
18
+ Another way to customize a model is to modify their components, rather than writing a new model entirely, allowing you to tailor a model to your specific use case. For example, you can add new layers or optimize the attention mechanism of an architecture. Customizations are applied directly to a Transformers model so that you can continue to use features such as [`Trainer`], [`PreTrainedModel`], and the [PEFT](https://huggingface.co/docs/peft/en/index) library.
19
+
20
+ This guide will show you how to customize a models attention mechanism in order to apply [Low-Rank Adaptation (LoRA)](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) to it.
21
+
22
+ > [!TIP]
23
+ > The [clear_import_cache](https://github.com/huggingface/transformers/blob/9985d06add07a4cc691dc54a7e34f54205c04d40/src/transformers/utils/import_utils.py#L2286) utility is very useful when you're iteratively modifying and developing model code. It removes all cached Transformers modules and allows Python to reload the modified code without constantly restarting your environment.
24
+ >
25
+ > ```py
26
+ > from transformers import AutoModel
27
+ > from transformers.utils.import_utils import clear_import_cache
28
+ >
29
+ > model = AutoModel.from_pretrained("bert-base-uncased")
30
+ > # modifications to model code
31
+ > # clear cache to reload modified code
32
+ > clear_import_cache()
33
+ > # re-import to use updated code
34
+ > model = AutoModel.from_pretrained("bert-base-uncased")
35
+ > ```
36
+
37
+ ## Attention class
38
+
39
+ [Segment Anything](./model_doc/sam) is an image segmentation model, and it combines the query-key-value (`qkv`) projection in its attention mechanisms. To reduce the number of trainable parameters and computational overhead, you can apply LoRA to the `qkv` projection. This requires splitting the `qkv` projection so that you can separately target the `q` and `v` with LoRA.
40
+
41
+ 1. Create a custom attention class, `SamVisionAttentionSplit`, by subclassing the original `SamVisionAttention` class. In the `__init__`, delete the combined `qkv` and create a separate linear layer for `q`, `k` and `v`.
42
+
43
+ ```py
44
+ import torch
45
+ import torch.nn as nn
46
+ from transformers.models.sam.modeling_sam import SamVisionAttention
47
+
48
+ class SamVisionAttentionSplit(SamVisionAttention, nn.Module):
49
+ def __init__(self, config, window_size):
50
+ super().__init__(config, window_size)
51
+ # remove combined qkv
52
+ del self.qkv
53
+ # separate q, k, v projections
54
+ self.q = nn.Linear(config.hidden_size, config.hidden_size, bias=config.qkv_bias)
55
+ self.k = nn.Linear(config.hidden_size, config.hidden_size, bias=config.qkv_bias)
56
+ self.v = nn.Linear(config.hidden_size, config.hidden_size, bias=config.qkv_bias)
57
+ self._register_load_state_dict_pre_hook(self.split_q_k_v_load_hook)
58
+ ```
59
+
60
+ 2. The `_split_qkv_load_hook` function splits the pretrained `qkv` weights into separate `q`, `k`, and `v` weights when loading the model to ensure compatibility with any pretrained model.
61
+
62
+ ```py
63
+ def split_q_k_v_load_hook(self, state_dict, prefix, *args):
64
+ keys_to_delete = []
65
+ for key in list(state_dict.keys()):
66
+ if "qkv." in key:
67
+ # split q, k, v from the combined projection
68
+ q, k, v = state_dict[key].chunk(3, dim=0)
69
+ # replace with individual q, k, v projections
70
+ state_dict[key.replace("qkv.", "q.")] = q
71
+ state_dict[key.replace("qkv.", "k.")] = k
72
+ state_dict[key.replace("qkv.", "v.")] = v
73
+ # mark the old qkv key for deletion
74
+ keys_to_delete.append(key)
75
+
76
+ # remove old qkv keys
77
+ for key in keys_to_delete:
78
+ del state_dict[key]
79
+ ```
80
+
81
+ 3. In the `forward` pass, `q`, `k`, and `v` are computed separately while the rest of the attention mechanism remains the same.
82
+
83
+ ```py
84
+ def forward(self, hidden_states: torch.Tensor, output_attentions=False) -> torch.Tensor:
85
+ batch_size, height, width, _ = hidden_states.shape
86
+ qkv_shapes = (batch_size * self.num_attention_heads, height * width, -1)
87
+ query = self.q(hidden_states).reshape((batch_size, height * width,self.num_attention_heads, -1)).permute(0,2,1,3).reshape(qkv_shapes)
88
+ key = self.k(hidden_states).reshape((batch_size, height * width,self.num_attention_heads, -1)).permute(0,2,1,3).reshape(qkv_shapes)
89
+ value = self.v(hidden_states).reshape((batch_size, height * width,self.num_attention_heads, -1)).permute(0,2,1,3).reshape(qkv_shapes)
90
+
91
+ attn_weights = (query * self.scale) @ key.transpose(-2, -1)
92
+
93
+ if self.use_rel_pos:
94
+ attn_weights = self.add_decomposed_rel_pos(
95
+ attn_weights, query, self.rel_pos_h, self.rel_pos_w, (height, width), (height, width)
96
+ )
97
+
98
+ attn_weights = torch.nn.functional.softmax(attn_weights, dtype=torch.float32, dim=-1).to(query.dtype)
99
+ attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
100
+ attn_output = (attn_probs @ value).reshape(batch_size, self.num_attention_heads, height, width, -1)
101
+ attn_output = attn_output.permute(0, 2, 3, 1, 4).reshape(batch_size, height, width, -1)
102
+ attn_output = self.proj(attn_output)
103
+
104
+ if output_attentions:
105
+ outputs = (attn_output, attn_weights)
106
+ else:
107
+ outputs = (attn_output, None)
108
+ return outputs
109
+ ```
110
+
111
+ Assign the custom `SamVisionAttentionSplit` class to the original models `SamVisionAttention` module to replace it. All instances of `SamVisionAttention` in the model is replaced with the split attention version.
112
+
113
+ Load the model with [`~PreTrainedModel.from_pretrained`].
114
+
115
+ ```py
116
+ from transformers import SamModel
117
+ from transformers.models.sam import modeling_sam
118
+
119
+ # replace the attention class in the modeling_sam module
120
+ modeling_sam.SamVisionAttention = SamVisionAttentionSplit
121
+
122
+ # load the pretrained SAM model
123
+ model = SamModel.from_pretrained("facebook/sam-vit-base")
124
+ ```
125
+
126
+ ## LoRA
127
+
128
+ With separate `q`, `k`, and `v` projections, apply LoRA to `q` and `v`.
129
+
130
+ Create a [LoraConfig](https://huggingface.co/docs/peft/package_reference/config#peft.PeftConfig) and specify the rank `r`, `lora_alpha`, `lora_dropout`, `task_type`, and most importantly, the modules to target.
131
+
132
+ ```py
133
+ from peft import LoraConfig, get_peft_model
134
+
135
+ config = LoraConfig(
136
+ r=16,
137
+ lora_alpha=32,
138
+ # apply LoRA to q and v
139
+ target_modules=["q", "v"],
140
+ lora_dropout=0.1,
141
+ task_type="mask-generation"
142
+ )
143
+ ```
144
+
145
+ Pass the model and [LoraConfig](https://huggingface.co/docs/peft/package_reference/config#peft.PeftConfig) to [get_peft_model](https://huggingface.co/docs/peft/package_reference/peft_model#peft.get_peft_model) to apply LoRA to the model.
146
+
147
+ ```py
148
+ model = get_peft_model(model, config)
149
+ ```
150
+
151
+ Call [print_trainable_parameters](https://huggingface.co/docs/peft/package_reference/peft_model#peft.PeftMixedModel.print_trainable_parameters) to view the number of parameters you're training as a result versus the total number of parameters.
152
+
153
+ ```py
154
+ model.print_trainable_parameters()
155
+ "trainable params: 608,256 || all params: 94,343,728 || trainable%: 0.6447"
156
+ ```
docs/transformers/docs/source/en/hpo_train.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2022 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+
11
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
12
+ rendered properly in your Markdown viewer.
13
+
14
+ -->
15
+
16
+ # Hyperparameter search
17
+
18
+ Hyperparameter search discovers an optimal set of hyperparameters that produces the best model performance. [`Trainer`] supports several hyperparameter search backends - [Optuna](https://optuna.readthedocs.io/en/stable/index.html), [SigOpt](https://docs.sigopt.com/), [Weights & Biases](https://docs.wandb.ai/), [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) - through [`~Trainer.hyperparameter_search`] to optimize an objective or even multiple objectives.
19
+
20
+ This guide will go over how to set up a hyperparameter search for each of the backends.
21
+
22
+ ```bash
23
+ pip install optuna/sigopt/wandb/ray[tune]
24
+ ```
25
+
26
+ To use [`~Trainer.hyperparameter_search`], you need to create a `model_init` function. This function includes basic model information (arguments and configuration) because it needs to be reinitialized for each search trial in the run.
27
+
28
+ > [!WARNING]
29
+ > The `model_init` function is incompatible with the [optimizers](./main_classes/trainer#transformers.Trainer.optimizers) parameter. Subclass [`Trainer`] and override the [`~Trainer.create_optimizer_and_scheduler`] method to create a custom optimizer and scheduler.
30
+
31
+ An example `model_init` function is shown below.
32
+
33
+ ```py
34
+ def model_init(trial):
35
+ return AutoModelForSequenceClassification.from_pretrained(
36
+ model_args.model_name_or_path,
37
+ from_tf=bool(".ckpt" in model_args.model_name_or_path),
38
+ config=config,
39
+ cache_dir=model_args.cache_dir,
40
+ revision=model_args.model_revision,
41
+ token=True if model_args.use_auth_token else None,
42
+ )
43
+ ```
44
+
45
+ Pass `model_init` to [`Trainer`] along with everything else you need for training. Then you can call [`~Trainer.hyperparameter_search`] to start the search.
46
+
47
+ [`~Trainer.hyperparameter_search`] accepts a [direction](./main_classes/trainer#transformers.Trainer.hyperparameter_search.direction) parameter to specify whether to minimize, maximize, or minimize and maximize multiple objectives. You'll also need to set the [backend](./main_classes/trainer#transformers.Trainer.hyperparameter_search.backend) you're using, an [object](./main_classes/trainer#transformers.Trainer.hyperparameter_search.hp_space) containing the hyperparameters to optimize for, the [number of trials](./main_classes/trainer#transformers.Trainer.hyperparameter_search.n_trials) to run, and a [compute_objective](./main_classes/trainer#transformers.Trainer.hyperparameter_search.compute_objective) to return the objective values.
48
+
49
+ > [!TIP]
50
+ > If [compute_objective](./main_classes/trainer#transformers.Trainer.hyperparameter_search.compute_objective) isn't defined, the default [compute_objective](./main_classes/trainer#transformers.Trainer.hyperparameter_search.compute_objective) is called which is the sum of an evaluation metric like F1.
51
+
52
+ ```py
53
+ from transformers import Trainer
54
+
55
+ trainer = Trainer(
56
+ model=None,
57
+ args=training_args,
58
+ train_dataset=small_train_dataset,
59
+ eval_dataset=small_eval_dataset,
60
+ compute_metrics=compute_metrics,
61
+ processing_class=tokenizer,
62
+ model_init=model_init,
63
+ data_collator=data_collator,
64
+ )
65
+ trainer.hyperparameter_search(...)
66
+ ```
67
+
68
+ The following examples demonstrate how to perform a hyperparameter search for the learning rate and training batch size using the different backends.
69
+
70
+ <hfoptions id="backends">
71
+ <hfoption id="Optuna">
72
+
73
+ [Optuna](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/002_configurations.html#sphx-glr-tutorial-10-key-features-002-configurations-py) optimizes categories, integers, and floats.
74
+
75
+ ```py
76
+ def optuna_hp_space(trial):
77
+ return {
78
+ "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
79
+ "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [16, 32, 64, 128]),
80
+ }
81
+
82
+ best_trials = trainer.hyperparameter_search(
83
+ direction=["minimize", "maximize"],
84
+ backend="optuna",
85
+ hp_space=optuna_hp_space,
86
+ n_trials=20,
87
+ compute_objective=compute_objective,
88
+ )
89
+ ```
90
+
91
+ </hfoption>
92
+ <hfoption id="Ray Tune">
93
+
94
+ [Ray Tune](https://docs.ray.io/en/latest/tune/api/search_space.html) optimizes floats, integers, and categorical parameters. It also offers multiple sampling distributions for each parameter such as uniform and log-uniform.
95
+
96
+ ```py
97
+ def ray_hp_space(trial):
98
+ return {
99
+ "learning_rate": tune.loguniform(1e-6, 1e-4),
100
+ "per_device_train_batch_size": tune.choice([16, 32, 64, 128]),
101
+ }
102
+
103
+ best_trials = trainer.hyperparameter_search(
104
+ direction=["minimize", "maximize"],
105
+ backend="ray",
106
+ hp_space=ray_hp_space,
107
+ n_trials=20,
108
+ compute_objective=compute_objective,
109
+ )
110
+ ```
111
+
112
+ </hfoption>
113
+ <hfoption id="SigOpt">
114
+
115
+ [SigOpt](https://docs.sigopt.com/ai-module-api-references/api_reference/objects/object_parameter) optimizes double, integer, and categorical parameters.
116
+
117
+ ```py
118
+ def sigopt_hp_space(trial):
119
+ return [
120
+ {"bounds": {"min": 1e-6, "max": 1e-4}, "name": "learning_rate", "type": "double"},
121
+ {
122
+ "categorical_values": ["16", "32", "64", "128"],
123
+ "name": "per_device_train_batch_size",
124
+ "type": "categorical",
125
+ },
126
+ ]
127
+
128
+ best_trials = trainer.hyperparameter_search(
129
+ direction=["minimize", "maximize"],
130
+ backend="sigopt",
131
+ hp_space=sigopt_hp_space,
132
+ n_trials=20,
133
+ compute_objective=compute_objective,
134
+ )
135
+ ```
136
+
137
+ </hfoption>
138
+ <hfoption id="Weights & Biases">
139
+
140
+ [Weights & Biases](https://docs.wandb.ai/guides/sweeps/sweep-config-keys) also optimizes integers, floats, and categorical parameters. It also includes support for different search strategies and distribution options.
141
+
142
+ ```py
143
+ def wandb_hp_space(trial):
144
+ return {
145
+ "method": "random",
146
+ "metric": {"name": "objective", "goal": "minimize"},
147
+ "parameters": {
148
+ "learning_rate": {"distribution": "uniform", "min": 1e-6, "max": 1e-4},
149
+ "per_device_train_batch_size": {"values": [16, 32, 64, 128]},
150
+ },
151
+ }
152
+
153
+ best_trials = trainer.hyperparameter_search(
154
+ direction=["minimize", "maximize"],
155
+ backend="wandb",
156
+ hp_space=wandb_hp_space,
157
+ n_trials=20,
158
+ compute_objective=compute_objective,
159
+ )
160
+ ```
161
+
162
+ </hfoption>
163
+ </hfoptions>
164
+
165
+ ## Distributed Data Parallel
166
+
167
+ [`Trainer`] only supports hyperparameter search for distributed data parallel (DDP) on the Optuna and SigOpt backends. Only the rank-zero process is used to generate the search trial, and the resulting parameters are passed along to the other ranks.
docs/transformers/docs/source/en/image_processors.md ADDED
@@ -0,0 +1,222 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2024 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Image processors
18
+
19
+ Image processors converts images into pixel values, tensors that represent image colors and size. The pixel values are inputs to a vision or video model. To ensure a pretrained model receives the correct input, an image processor can perform the following operations to make sure an image is exactly like the images a model was pretrained on.
20
+
21
+ - [`~BaseImageProcessor.center_crop`] to resize an image
22
+ - [`~BaseImageProcessor.normalize`] or [`~BaseImageProcessor.rescale`] pixel values
23
+
24
+ Use [`~ImageProcessingMixin.from_pretrained`] to load an image processors configuration (image size, whether to normalize and rescale, etc.) from a vision model on the Hugging Face [Hub](https://hf.co) or local directory. The configuration for each pretrained model is saved in a [preprocessor_config.json](https://huggingface.co/google/vit-base-patch16-224/blob/main/preprocessor_config.json) file.
25
+
26
+ ```py
27
+ from transformers import AutoImageProcessor
28
+
29
+ image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
30
+ ```
31
+
32
+ Pass an image to the image processor to transform it into pixel values, and set `return_tensors="pt"` to return PyTorch tensors. Feel free to print out the inputs to see what the image looks like as a tensor.
33
+
34
+ ```py
35
+ from PIL import Image
36
+ import requests
37
+
38
+ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/image_processor_example.png"
39
+ image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
40
+ inputs = image_processor(image, return_tensors="pt")
41
+ ```
42
+
43
+ This guide covers the image processor class and how to preprocess images for vision models.
44
+
45
+ ## Image processor classes
46
+
47
+ Image processors inherit from the [`BaseImageProcessor`] class which provides the [`~BaseImageProcessor.center_crop`], [`~BaseImageProcessor.normalize`], and [`~BaseImageProcessor.rescale`] functions. There are two types of image processors.
48
+
49
+ - [`BaseImageProcessor`] is a Python implementation.
50
+ - [`BaseImageProcessorFast`] is a faster [torchvision-backed](https://pytorch.org/vision/stable/index.html) version. For a batch of [torch.Tensor](https://pytorch.org/docs/stable/tensors.html) inputs, this can be up to 33x faster. [`BaseImageProcessorFast`] is not available for all vision models at the moment. Refer to a models API documentation to check if it is supported.
51
+
52
+ Each image processor subclasses the [`ImageProcessingMixin`] class which provides the [`~ImageProcessingMixin.from_pretrained`] and [`~ImageProcessingMixin.save_pretrained`] methods for loading and saving image processors.
53
+
54
+ There are two ways you can load an image processor, with [`AutoImageProcessor`] or a model-specific image processor.
55
+
56
+ <hfoptions id="image-processor-classes">
57
+ <hfoption id="AutoImageProcessor">
58
+
59
+ The [AutoClass](./model_doc/auto) API provides a convenient method to load an image processor without directly specifying the model the image processor is associated with.
60
+
61
+ Use [`~AutoImageProcessor.from_pretrained`] to load an image processor, and set `use_fast=True` to load a fast image processor if it's supported.
62
+
63
+ ```py
64
+ from transformers import AutoImageProcessor
65
+
66
+ image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224", use_fast=True)
67
+ ```
68
+
69
+ </hfoption>
70
+ <hfoption id="model-specific image processor">
71
+
72
+ Each image processor is associated with a specific pretrained vision model, and the image processors configuration contains the models expected size and whether to normalize and resize.
73
+
74
+ The image processor can be loaded directly from the model-specific class. Check a models API documentation to see whether it supports a fast image processor.
75
+
76
+ ```py
77
+ from transformers import ViTImageProcessor
78
+
79
+ image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
80
+ ```
81
+
82
+ To load a fast image processor, use the fast implementation class.
83
+
84
+ ```py
85
+ from transformers import ViTImageProcessorFast
86
+
87
+ image_processor = ViTImageProcessorFast.from_pretrained("google/vit-base-patch16-224")
88
+ ```
89
+
90
+ </hfoption>
91
+ </hfoptions>
92
+
93
+ ## Fast image processors
94
+
95
+ [`BaseImageProcessorFast`] is based on [torchvision](https://pytorch.org/vision/stable/index.html) and is significantly faster, especially when processing on a GPU. This class can be used as a drop-in replacement for [`BaseImageProcessor`] if it's available for a model because it has the same design. Make sure [torchvision](https://pytorch.org/get-started/locally/#mac-installation) is installed, and set the `use_fast` parameter to `True`.
96
+
97
+ ```py
98
+ from transformers import AutoImageProcessor
99
+
100
+ processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", use_fast=True)
101
+ ```
102
+
103
+ Control which device processing is performed on with the `device` parameter. Processing is performed on the same device as the input by default if the inputs are tensors, otherwise they are processed on the CPU. The example below places the fast processor on a GPU.
104
+
105
+ ```py
106
+ from torchvision.io import read_image
107
+ from transformers import DetrImageProcessorFast
108
+
109
+ images = read_image("image.jpg")
110
+ processor = DetrImageProcessorFast.from_pretrained("facebook/detr-resnet-50")
111
+ images_processed = processor(images, return_tensors="pt", device="cuda")
112
+ ```
113
+
114
+ <details>
115
+ <summary>Benchmarks</summary>
116
+
117
+ The benchmarks are obtained from an [AWS EC2 g5.2xlarge](https://aws.amazon.com/ec2/instance-types/g5/) instance with a NVIDIA A10G Tensor Core GPU.
118
+
119
+ <div class="flex">
120
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/benchmark_results_full_pipeline_detr_fast_padded.png" />
121
+ </div>
122
+ <div class="flex">
123
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/benchmark_results_full_pipeline_detr_fast_batched_compiled.png" />
124
+ </div>
125
+ <div class="flex">
126
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/benchmark_results_full_pipeline_rt_detr_fast_single.png" />
127
+ </div>
128
+ <div class="flex">
129
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/benchmark_results_full_pipeline_rt_detr_fast_batched.png" />
130
+ </div>
131
+ </details>
132
+
133
+ ## Preprocess
134
+
135
+ Transformers' vision models expects the input as PyTorch tensors of pixel values. An image processor handles the conversion of images to pixel values, which is represented by the batch size, number of channels, height, and width. To achieve this, an image is resized (center cropped) and the pixel values are normalized and rescaled to the models expected values.
136
+
137
+ Image preprocessing is not the same as *image augmentation*. Image augmentation makes changes (brightness, colors, rotatation, etc.) to an image for the purpose of either creating new training examples or prevent overfitting. Image preprocessing makes changes to an image for the purpose of matching a pretrained model's expected input format.
138
+
139
+ Typically, images are augmented (to increase performance) and then preprocessed before being passed to a model. You can use any library ([Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb), [Kornia](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb)) for augmentation and an image processor for preprocessing.
140
+
141
+ This guide uses the torchvision [transforms](https://pytorch.org/vision/stable/transforms.html) module for augmentation.
142
+
143
+ Start by loading a small sample of the [food101](https://hf.co/datasets/food101) dataset.
144
+
145
+ ```py
146
+ from datasets import load_dataset
147
+
148
+ dataset = load_dataset("food101", split="train[:100]")
149
+ ```
150
+
151
+ From the [transforms](https://pytorch.org/vision/stable/transforms.html) module, use the [Compose](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) API to chain together [RandomResizedCrop](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [ColorJitter](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html). These transforms randomly crop and resize an image, and randomly adjusts an images colors.
152
+
153
+ The image size to randomly crop to can be retrieved from the image processor. For some models, an exact height and width are expected while for others, only the `shortest_edge` is required.
154
+
155
+ ```py
156
+ from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose
157
+
158
+ size = (
159
+ image_processor.size["shortest_edge"]
160
+ if "shortest_edge" in image_processor.size
161
+ else (image_processor.size["height"], image_processor.size["width"])
162
+ )
163
+ _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)])
164
+ ```
165
+
166
+ Apply the transforms to the images and convert them to the RGB format. Then pass the augmented images to the image processor to return the pixel values.
167
+
168
+ The `do_resize` parameter is set to `False` because the images have already been resized in the augmentation step by [RandomResizedCrop](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html). If you don't augment the images, then the image processor automatically resizes and normalizes the images with the `image_mean` and `image_std` values. These values are found in the preprocessor configuration file.
169
+
170
+ ```py
171
+ def transforms(examples):
172
+ images = [_transforms(img.convert("RGB")) for img in examples["image"]]
173
+ examples["pixel_values"] = image_processor(images, do_resize=False, return_tensors="pt")["pixel_values"]
174
+ return examples
175
+ ```
176
+
177
+ Apply the combined augmentation and preprocessing function to the entire dataset on the fly with [`~datasets.Dataset.set_transform`].
178
+
179
+ ```py
180
+ dataset.set_transform(transforms)
181
+ ```
182
+
183
+ Convert the pixel values back into an image to see how the image has been augmented and preprocessed.
184
+
185
+ ```py
186
+ import numpy as np
187
+ import matplotlib.pyplot as plt
188
+
189
+ img = dataset[0]["pixel_values"]
190
+ plt.imshow(img.permute(1, 2, 0))
191
+ ```
192
+
193
+ <div class="flex gap-4">
194
+ <div>
195
+ <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png" />
196
+ <figcaption class="mt-2 text-center text-sm text-gray-500">before</figcaption>
197
+ </div>
198
+ <div>
199
+ <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png" />
200
+ <figcaption class="mt-2 text-center text-sm text-gray-500">after</figcaption>
201
+ </div>
202
+ </div>
203
+
204
+ For other vision tasks like object detection or segmentation, the image processor includes post-processing methods to convert a models raw output into meaningful predictions like bounding boxes or segmentation maps.
205
+
206
+ ### Padding
207
+
208
+ Some models, like [DETR](./model_doc/detr), applies [scale augmentation](https://paperswithcode.com/method/image-scale-augmentation) during training which can cause images in a batch to have different sizes. Images with different sizes can't be batched together.
209
+
210
+ To fix this, pad the images with the special padding token `0`. Use the [pad](https://github.com/huggingface/transformers/blob/9578c2597e2d88b6f0b304b5a05864fd613ddcc1/src/transformers/models/detr/image_processing_detr.py#L1151) method to pad the images, and define a custom collate function to batch them together.
211
+
212
+ ```py
213
+ def collate_fn(batch):
214
+ pixel_values = [item["pixel_values"] for item in batch]
215
+ encoding = image_processor.pad(pixel_values, return_tensors="pt")
216
+ labels = [item["labels"] for item in batch]
217
+ batch = {}
218
+ batch["pixel_values"] = encoding["pixel_values"]
219
+ batch["pixel_mask"] = encoding["pixel_mask"]
220
+ batch["labels"] = labels
221
+ return batch
222
+ ```
docs/transformers/docs/source/en/index.md ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2024 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+ -->
15
+
16
+ # Transformers
17
+
18
+ Transformers is a library of pretrained natural language processing, computer vision, audio, and multimodal models for inference and training. Use Transformers to train models on your data, build inference applications, and generate text with large language models.
19
+
20
+ Explore the [Hugging Face Hub](https://huggingface.com) today to find a model and use Transformers to help you get started right away.
21
+
22
+ ## Features
23
+
24
+ Transformers provides everything you need for inference or training with state-of-the-art pretrained models. Some of the main features include:
25
+
26
+ - [Pipeline](./pipeline_tutorial): Simple and optimized inference class for many machine learning tasks like text generation, image segmentation, automatic speech recognition, document question answering, and more.
27
+ - [Trainer](./trainer): A comprehensive trainer that supports features such as mixed precision, torch.compile, and FlashAttention for training and distributed training for PyTorch models.
28
+ - [generate](./llm_tutorial): Fast text generation with large language models (LLMs) and vision language models (VLMs), including support for streaming and multiple decoding strategies.
29
+
30
+ ## Design
31
+
32
+ > [!TIP]
33
+ > Read our [Philosophy](./philosophy) to learn more about Transformers' design principles.
34
+
35
+ Transformers is designed for developers and machine learning engineers and researchers. Its main design principles are:
36
+
37
+ 1. Fast and easy to use: Every model is implemented from only three main classes (configuration, model, and preprocessor) and can be quickly used for inference or training with [`Pipeline`] or [`Trainer`].
38
+ 2. Pretrained models: Reduce your carbon footprint, compute cost and time by using a pretrained model instead of training an entirely new one. Each pretrained model is reproduced as closely as possible to the original model and offers state-of-the-art performance.
39
+
40
+ <div class="flex justify-center">
41
+ <a target="_blank" href="https://huggingface.co/support">
42
+ <img alt="HuggingFace Expert Acceleration Program" src="https://hf.co/datasets/huggingface/documentation-images/resolve/81d7d9201fd4ceb537fc4cebc22c29c37a2ed216/transformers/transformers-index.png" style="width: 100%; max-width: 600px; border: 1px solid #eee; border-radius: 4px; box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.05);">
43
+ </a>
44
+ </div>
45
+
docs/transformers/docs/source/en/installation.md ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!---
2
+ Copyright 2024 The HuggingFace Team. All rights reserved.
3
+
4
+ Licensed under the Apache License, Version 2.0 (the "License");
5
+ you may not use this file except in compliance with the License.
6
+ You may obtain a copy of the License at
7
+
8
+ http://www.apache.org/licenses/LICENSE-2.0
9
+
10
+ Unless required by applicable law or agreed to in writing, software
11
+ distributed under the License is distributed on an "AS IS" BASIS,
12
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ See the License for the specific language governing permissions and
14
+ limitations under the License.
15
+
16
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
17
+ rendered properly in your Markdown viewer.
18
+
19
+ -->
20
+
21
+ # Installation
22
+
23
+ Transformers works with [PyTorch](https://pytorch.org/get-started/locally/), [TensorFlow 2.0](https://www.tensorflow.org/install/pip), and [Flax](https://flax.readthedocs.io/en/latest/). It has been tested on Python 3.9+, PyTorch 2.1+, TensorFlow 2.6+, and Flax 0.4.1+.
24
+
25
+ ## Virtual environment
26
+
27
+ A virtual environment helps manage different projects and avoids compatibility issues between dependencies. Take a look at the [Install packages in a virtual environment using pip and venv](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/) guide if you're unfamiliar with Python virtual environments.
28
+
29
+ <hfoptions id="virtual">
30
+ <hfoption id="venv">
31
+
32
+ Create and activate a virtual environment in your project directory with [venv](https://docs.python.org/3/library/venv.html).
33
+
34
+ ```bash
35
+ python -m venv .env
36
+ source .env/bin/activate
37
+ ```
38
+
39
+ </hfoption>
40
+ <hfoption id="uv">
41
+
42
+ [uv](https://docs.astral.sh/uv/) is a fast Rust-based Python package and project manager.
43
+
44
+ ```bash
45
+ uv venv .env
46
+ source .env/bin/activate
47
+ ```
48
+
49
+ </hfoption>
50
+ </hfoptions>
51
+
52
+ ## Python
53
+
54
+ You can install Transformers with pip or uv.
55
+
56
+ <hfoptions id="install">
57
+ <hfoption id="pip">
58
+
59
+ [pip](https://pip.pypa.io/en/stable/) is a package installer for Python. Install Transformers with pip in your newly created virtual environment.
60
+
61
+ ```bash
62
+ pip install transformers
63
+ ```
64
+
65
+ </hfoption>
66
+ <hfoption id="uv">
67
+
68
+ [uv](https://docs.astral.sh/uv/) is a fast Rust-based Python package and project manager.
69
+
70
+ ```bash
71
+ uv pip install transformers
72
+ ```
73
+
74
+ </hfoption>
75
+ </hfoptions>
76
+
77
+ For GPU acceleration, install the appropriate CUDA drivers for [PyTorch](https://pytorch.org/get-started/locally) and [TensorFlow](https://www.tensorflow.org/install/pip).
78
+
79
+ Run the command below to check if your system detects an NVIDIA GPU.
80
+
81
+ ```bash
82
+ nvidia-smi
83
+ ```
84
+
85
+ To install a CPU-only version of Transformers and a machine learning framework, run the following command.
86
+
87
+ <hfoptions id="cpu-only">
88
+ <hfoption id="PyTorch">
89
+
90
+ ```bash
91
+ pip install 'transformers[torch]'
92
+ uv pip install 'transformers[torch]'
93
+ ```
94
+
95
+ </hfoption>
96
+ <hfoption id="TensorFlow">
97
+
98
+ For Apple M1 hardware, you need to install CMake and pkg-config first.
99
+
100
+ ```bash
101
+ brew install cmake
102
+ brew install pkg-config
103
+ ```
104
+
105
+ Install TensorFlow 2.0.
106
+
107
+ ```bash
108
+ pip install 'transformers[tf-cpu]'
109
+ uv pip install 'transformers[tf-cpu]'
110
+ ```
111
+
112
+ </hfoption>
113
+ <hfoption id="Flax">
114
+
115
+ ```bash
116
+ pip install 'transformers[flax]'
117
+ uv pip install 'transformers[flax]'
118
+ ```
119
+
120
+ </hfoption>
121
+ </hfoptions>
122
+
123
+ Test whether the install was successful with the following command. It should return a label and score for the provided text.
124
+
125
+ ```bash
126
+ python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('hugging face is the best'))"
127
+ [{'label': 'POSITIVE', 'score': 0.9998704791069031}]
128
+ ```
129
+
130
+ ### Source install
131
+
132
+ Installing from source installs the *latest* version rather than the *stable* version of the library. It ensures you have the most up-to-date changes in Transformers and it's useful for experimenting with the latest features or fixing a bug that hasn't been officially released in the stable version yet.
133
+
134
+ The downside is that the latest version may not always be stable. If you encounter any problems, please open a [GitHub Issue](https://github.com/huggingface/transformers/issues) so we can fix it as soon as possible.
135
+
136
+ Install from source with the following command.
137
+
138
+ ```bash
139
+ pip install git+https://github.com/huggingface/transformers
140
+ ```
141
+
142
+ Check if the install was successful with the command below. It should return a label and score for the provided text.
143
+
144
+ ```bash
145
+ python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('hugging face is the best'))"
146
+ [{'label': 'POSITIVE', 'score': 0.9998704791069031}]
147
+ ```
148
+
149
+ ### Editable install
150
+
151
+ An [editable install](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs) is useful if you're developing locally with Transformers. It links your local copy of Transformers to the Transformers [repository](https://github.com/huggingface/transformers) instead of copying the files. The files are added to Python's import path.
152
+
153
+ ```bash
154
+ git clone https://github.com/huggingface/transformers.git
155
+ cd transformers
156
+ pip install -e .
157
+ ```
158
+
159
+ > [!WARNING]
160
+ > You must keep the local Transformers folder to keep using it.
161
+
162
+ Update your local version of Transformers with the latest changes in the main repository with the following command.
163
+
164
+ ```bash
165
+ cd ~/transformers/
166
+ git pull
167
+ ```
168
+
169
+ ## conda
170
+
171
+ [conda](https://docs.conda.io/projects/conda/en/stable/#) is a language-agnostic package manager. Install Transformers from the [conda-forge](https://anaconda.org/conda-forge/transformers) channel in your newly created virtual environment.
172
+
173
+ ```bash
174
+ conda install conda-forge::transformers
175
+ ```
176
+
177
+ ## Set up
178
+
179
+ After installation, you can configure the Transformers cache location or set up the library for offline usage.
180
+
181
+ ### Cache directory
182
+
183
+ When you load a pretrained model with [`~PreTrainedModel.from_pretrained`], the model is downloaded from the Hub and locally cached.
184
+
185
+ Every time you load a model, it checks whether the cached model is up-to-date. If it's the same, then the local model is loaded. If it's not the same, the newer model is downloaded and cached.
186
+
187
+ The default directory given by the shell environment variable `TRANSFORMERS_CACHE` is `~/.cache/huggingface/hub`. On Windows, the default directory is `C:\Users\username\.cache\huggingface\hub`.
188
+
189
+ Cache a model in a different directory by changing the path in the following shell environment variables (listed by priority).
190
+
191
+ 1. [HF_HUB_CACHE](https://hf.co/docs/huggingface_hub/package_reference/environment_variables#hfhubcache) or `TRANSFORMERS_CACHE` (default)
192
+ 2. [HF_HOME](https://hf.co/docs/huggingface_hub/package_reference/environment_variables#hfhome)
193
+ 3. [XDG_CACHE_HOME](https://hf.co/docs/huggingface_hub/package_reference/environment_variables#xdgcachehome) + `/huggingface` (only if `HF_HOME` is not set)
194
+
195
+ Older versions of Transformers uses the shell environment variables `PYTORCH_TRANSFORMERS_CACHE` or `PYTORCH_PRETRAINED_BERT_CACHE`. You should keep these unless you specify the newer shell environment variable `TRANSFORMERS_CACHE`.
196
+
197
+ ### Offline mode
198
+
199
+ To use Transformers in an offline or firewalled environment requires the downloaded and cached files ahead of time. Download a model repository from the Hub with the [`~huggingface_hub.snapshot_download`] method.
200
+
201
+ > [!TIP]
202
+ > Refer to the [Download files from the Hub](https://hf.co/docs/huggingface_hub/guides/download) guide for more options for downloading files from the Hub. You can download files from specific revisions, download from the CLI, and even filter which files to download from a repository.
203
+
204
+ ```py
205
+ from huggingface_hub import snapshot_download
206
+
207
+ snapshot_download(repo_id="meta-llama/Llama-2-7b-hf", repo_type="model")
208
+ ```
209
+
210
+ Set the environment variable `HF_HUB_OFFLINE=1` to prevent HTTP calls to the Hub when loading a model.
211
+
212
+ ```bash
213
+ HF_HUB_OFFLINE=1 \
214
+ python examples/pytorch/language-modeling/run_clm.py --model_name_or_path meta-llama/Llama-2-7b-hf --dataset_name wikitext ...
215
+ ```
216
+
217
+ Another option for only loading cached files is to set `local_files_only=True` in [`~PreTrainedModel.from_pretrained`].
218
+
219
+ ```py
220
+ from transformers import LlamaForCausalLM
221
+
222
+ model = LlamaForCausalLM.from_pretrained("./path/to/local/directory", local_files_only=True)
223
+ ```
docs/transformers/docs/source/en/internal/audio_utils.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2023 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Utilities for `FeatureExtractors`
18
+
19
+ This page lists all the utility functions that can be used by the audio [`FeatureExtractor`] in order to compute special features from a raw audio using common algorithms such as *Short Time Fourier Transform* or *log mel spectrogram*.
20
+
21
+ Most of those are only useful if you are studying the code of the audio processors in the library.
22
+
23
+ ## Audio Transformations
24
+
25
+ [[autodoc]] audio_utils.hertz_to_mel
26
+
27
+ [[autodoc]] audio_utils.mel_to_hertz
28
+
29
+ [[autodoc]] audio_utils.mel_filter_bank
30
+
31
+ [[autodoc]] audio_utils.optimal_fft_length
32
+
33
+ [[autodoc]] audio_utils.window_function
34
+
35
+ [[autodoc]] audio_utils.spectrogram
36
+
37
+ [[autodoc]] audio_utils.power_to_db
38
+
39
+ [[autodoc]] audio_utils.amplitude_to_db
docs/transformers/docs/source/en/internal/file_utils.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2021 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # General Utilities
18
+
19
+ This page lists all of Transformers general utility functions that are found in the file `utils.py`.
20
+
21
+ Most of those are only useful if you are studying the general code in the library.
22
+
23
+
24
+ ## Enums and namedtuples
25
+
26
+ [[autodoc]] utils.ExplicitEnum
27
+
28
+ [[autodoc]] utils.PaddingStrategy
29
+
30
+ [[autodoc]] utils.TensorType
31
+
32
+ ## Special Decorators
33
+
34
+ [[autodoc]] utils.add_start_docstrings
35
+
36
+ [[autodoc]] utils.add_start_docstrings_to_model_forward
37
+
38
+ [[autodoc]] utils.add_end_docstrings
39
+
40
+ [[autodoc]] utils.add_code_sample_docstrings
41
+
42
+ [[autodoc]] utils.replace_return_docstrings
43
+
44
+ ## Special Properties
45
+
46
+ [[autodoc]] utils.cached_property
47
+
48
+ ## Other Utilities
49
+
50
+ [[autodoc]] utils._LazyModule
docs/transformers/docs/source/en/internal/generation_utils.md ADDED
@@ -0,0 +1,446 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Utilities for Generation
18
+
19
+ This page lists all the utility functions used by [`~generation.GenerationMixin.generate`].
20
+
21
+ ## Generate Outputs
22
+
23
+ The output of [`~generation.GenerationMixin.generate`] is an instance of a subclass of
24
+ [`~utils.ModelOutput`]. This output is a data structure containing all the information returned
25
+ by [`~generation.GenerationMixin.generate`], but that can also be used as tuple or dictionary.
26
+
27
+ Here's an example:
28
+
29
+ ```python
30
+ from transformers import GPT2Tokenizer, GPT2LMHeadModel
31
+
32
+ tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
33
+ model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2")
34
+
35
+ inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt")
36
+ generation_output = model.generate(**inputs, return_dict_in_generate=True, output_scores=True)
37
+ ```
38
+
39
+ The `generation_output` object is a [`~generation.GenerateDecoderOnlyOutput`], as we can
40
+ see in the documentation of that class below, it means it has the following attributes:
41
+
42
+ - `sequences`: the generated sequences of tokens
43
+ - `scores` (optional): the prediction scores of the language modelling head, for each generation step
44
+ - `hidden_states` (optional): the hidden states of the model, for each generation step
45
+ - `attentions` (optional): the attention weights of the model, for each generation step
46
+
47
+ Here we have the `scores` since we passed along `output_scores=True`, but we don't have `hidden_states` and
48
+ `attentions` because we didn't pass `output_hidden_states=True` or `output_attentions=True`.
49
+
50
+ You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you
51
+ will get `None`. Here for instance `generation_output.scores` are all the generated prediction scores of the
52
+ language modeling head, and `generation_output.attentions` is `None`.
53
+
54
+ When using our `generation_output` object as a tuple, it only keeps the attributes that don't have `None` values.
55
+ Here, for instance, it has two elements, `loss` then `logits`, so
56
+
57
+ ```python
58
+ generation_output[:2]
59
+ ```
60
+
61
+ will return the tuple `(generation_output.sequences, generation_output.scores)` for instance.
62
+
63
+ When using our `generation_output` object as a dictionary, it only keeps the attributes that don't have `None`
64
+ values. Here, for instance, it has two keys that are `sequences` and `scores`.
65
+
66
+ We document here all output types.
67
+
68
+
69
+ ### PyTorch
70
+
71
+ [[autodoc]] generation.GenerateDecoderOnlyOutput
72
+
73
+ [[autodoc]] generation.GenerateEncoderDecoderOutput
74
+
75
+ [[autodoc]] generation.GenerateBeamDecoderOnlyOutput
76
+
77
+ [[autodoc]] generation.GenerateBeamEncoderDecoderOutput
78
+
79
+ ### TensorFlow
80
+
81
+ [[autodoc]] generation.TFGreedySearchEncoderDecoderOutput
82
+
83
+ [[autodoc]] generation.TFGreedySearchDecoderOnlyOutput
84
+
85
+ [[autodoc]] generation.TFSampleEncoderDecoderOutput
86
+
87
+ [[autodoc]] generation.TFSampleDecoderOnlyOutput
88
+
89
+ [[autodoc]] generation.TFBeamSearchEncoderDecoderOutput
90
+
91
+ [[autodoc]] generation.TFBeamSearchDecoderOnlyOutput
92
+
93
+ [[autodoc]] generation.TFBeamSampleEncoderDecoderOutput
94
+
95
+ [[autodoc]] generation.TFBeamSampleDecoderOnlyOutput
96
+
97
+ [[autodoc]] generation.TFContrastiveSearchEncoderDecoderOutput
98
+
99
+ [[autodoc]] generation.TFContrastiveSearchDecoderOnlyOutput
100
+
101
+ ### FLAX
102
+
103
+ [[autodoc]] generation.FlaxSampleOutput
104
+
105
+ [[autodoc]] generation.FlaxGreedySearchOutput
106
+
107
+ [[autodoc]] generation.FlaxBeamSearchOutput
108
+
109
+ ## LogitsProcessor
110
+
111
+ A [`LogitsProcessor`] can be used to modify the prediction scores of a language model head for
112
+ generation.
113
+
114
+ ### PyTorch
115
+
116
+ [[autodoc]] AlternatingCodebooksLogitsProcessor
117
+ - __call__
118
+
119
+ [[autodoc]] ClassifierFreeGuidanceLogitsProcessor
120
+ - __call__
121
+
122
+ [[autodoc]] EncoderNoRepeatNGramLogitsProcessor
123
+ - __call__
124
+
125
+ [[autodoc]] EncoderRepetitionPenaltyLogitsProcessor
126
+ - __call__
127
+
128
+ [[autodoc]] EpsilonLogitsWarper
129
+ - __call__
130
+
131
+ [[autodoc]] EtaLogitsWarper
132
+ - __call__
133
+
134
+ [[autodoc]] ExponentialDecayLengthPenalty
135
+ - __call__
136
+
137
+ [[autodoc]] ForcedBOSTokenLogitsProcessor
138
+ - __call__
139
+
140
+ [[autodoc]] ForcedEOSTokenLogitsProcessor
141
+ - __call__
142
+
143
+ [[autodoc]] HammingDiversityLogitsProcessor
144
+ - __call__
145
+
146
+ [[autodoc]] InfNanRemoveLogitsProcessor
147
+ - __call__
148
+
149
+ [[autodoc]] LogitNormalization
150
+ - __call__
151
+
152
+ [[autodoc]] LogitsProcessor
153
+ - __call__
154
+
155
+ [[autodoc]] LogitsProcessorList
156
+ - __call__
157
+
158
+ [[autodoc]] MinLengthLogitsProcessor
159
+ - __call__
160
+
161
+ [[autodoc]] MinNewTokensLengthLogitsProcessor
162
+ - __call__
163
+
164
+ [[autodoc]] MinPLogitsWarper
165
+ - __call__
166
+
167
+ [[autodoc]] NoBadWordsLogitsProcessor
168
+ - __call__
169
+
170
+ [[autodoc]] NoRepeatNGramLogitsProcessor
171
+ - __call__
172
+
173
+ [[autodoc]] PrefixConstrainedLogitsProcessor
174
+ - __call__
175
+
176
+ [[autodoc]] RepetitionPenaltyLogitsProcessor
177
+ - __call__
178
+
179
+ [[autodoc]] SequenceBiasLogitsProcessor
180
+ - __call__
181
+
182
+ [[autodoc]] SuppressTokensAtBeginLogitsProcessor
183
+ - __call__
184
+
185
+ [[autodoc]] SuppressTokensLogitsProcessor
186
+ - __call__
187
+
188
+ [[autodoc]] SynthIDTextWatermarkLogitsProcessor
189
+ - __call__
190
+
191
+ [[autodoc]] TemperatureLogitsWarper
192
+ - __call__
193
+
194
+ [[autodoc]] TopKLogitsWarper
195
+ - __call__
196
+
197
+ [[autodoc]] TopPLogitsWarper
198
+ - __call__
199
+
200
+ [[autodoc]] TypicalLogitsWarper
201
+ - __call__
202
+
203
+ [[autodoc]] UnbatchedClassifierFreeGuidanceLogitsProcessor
204
+ - __call__
205
+
206
+ [[autodoc]] WhisperTimeStampLogitsProcessor
207
+ - __call__
208
+
209
+ [[autodoc]] WatermarkLogitsProcessor
210
+ - __call__
211
+
212
+
213
+ ### TensorFlow
214
+
215
+ [[autodoc]] TFForcedBOSTokenLogitsProcessor
216
+ - __call__
217
+
218
+ [[autodoc]] TFForcedEOSTokenLogitsProcessor
219
+ - __call__
220
+
221
+ [[autodoc]] TFForceTokensLogitsProcessor
222
+ - __call__
223
+
224
+ [[autodoc]] TFLogitsProcessor
225
+ - __call__
226
+
227
+ [[autodoc]] TFLogitsProcessorList
228
+ - __call__
229
+
230
+ [[autodoc]] TFLogitsWarper
231
+ - __call__
232
+
233
+ [[autodoc]] TFMinLengthLogitsProcessor
234
+ - __call__
235
+
236
+ [[autodoc]] TFNoBadWordsLogitsProcessor
237
+ - __call__
238
+
239
+ [[autodoc]] TFNoRepeatNGramLogitsProcessor
240
+ - __call__
241
+
242
+ [[autodoc]] TFRepetitionPenaltyLogitsProcessor
243
+ - __call__
244
+
245
+ [[autodoc]] TFSuppressTokensAtBeginLogitsProcessor
246
+ - __call__
247
+
248
+ [[autodoc]] TFSuppressTokensLogitsProcessor
249
+ - __call__
250
+
251
+ [[autodoc]] TFTemperatureLogitsWarper
252
+ - __call__
253
+
254
+ [[autodoc]] TFTopKLogitsWarper
255
+ - __call__
256
+
257
+ [[autodoc]] TFTopPLogitsWarper
258
+ - __call__
259
+
260
+ ### FLAX
261
+
262
+ [[autodoc]] FlaxForcedBOSTokenLogitsProcessor
263
+ - __call__
264
+
265
+ [[autodoc]] FlaxForcedEOSTokenLogitsProcessor
266
+ - __call__
267
+
268
+ [[autodoc]] FlaxForceTokensLogitsProcessor
269
+ - __call__
270
+
271
+ [[autodoc]] FlaxLogitsProcessor
272
+ - __call__
273
+
274
+ [[autodoc]] FlaxLogitsProcessorList
275
+ - __call__
276
+
277
+ [[autodoc]] FlaxLogitsWarper
278
+ - __call__
279
+
280
+ [[autodoc]] FlaxMinLengthLogitsProcessor
281
+ - __call__
282
+
283
+ [[autodoc]] FlaxSuppressTokensAtBeginLogitsProcessor
284
+ - __call__
285
+
286
+ [[autodoc]] FlaxSuppressTokensLogitsProcessor
287
+ - __call__
288
+
289
+ [[autodoc]] FlaxTemperatureLogitsWarper
290
+ - __call__
291
+
292
+ [[autodoc]] FlaxTopKLogitsWarper
293
+ - __call__
294
+
295
+ [[autodoc]] FlaxTopPLogitsWarper
296
+ - __call__
297
+
298
+ [[autodoc]] FlaxWhisperTimeStampLogitsProcessor
299
+ - __call__
300
+
301
+ ## StoppingCriteria
302
+
303
+ A [`StoppingCriteria`] can be used to change when to stop generation (other than EOS token). Please note that this is exclusively available to our PyTorch implementations.
304
+
305
+ [[autodoc]] StoppingCriteria
306
+ - __call__
307
+
308
+ [[autodoc]] StoppingCriteriaList
309
+ - __call__
310
+
311
+ [[autodoc]] MaxLengthCriteria
312
+ - __call__
313
+
314
+ [[autodoc]] MaxTimeCriteria
315
+ - __call__
316
+
317
+ [[autodoc]] StopStringCriteria
318
+ - __call__
319
+
320
+ [[autodoc]] EosTokenCriteria
321
+ - __call__
322
+
323
+ ## Constraints
324
+
325
+ A [`Constraint`] can be used to force the generation to include specific tokens or sequences in the output. Please note that this is exclusively available to our PyTorch implementations.
326
+
327
+ [[autodoc]] Constraint
328
+
329
+ [[autodoc]] PhrasalConstraint
330
+
331
+ [[autodoc]] DisjunctiveConstraint
332
+
333
+ [[autodoc]] ConstraintListState
334
+
335
+ ## BeamSearch
336
+
337
+ [[autodoc]] BeamScorer
338
+ - process
339
+ - finalize
340
+
341
+ [[autodoc]] BeamSearchScorer
342
+ - process
343
+ - finalize
344
+
345
+ [[autodoc]] ConstrainedBeamSearchScorer
346
+ - process
347
+ - finalize
348
+
349
+ ## Streamers
350
+
351
+ [[autodoc]] TextStreamer
352
+
353
+ [[autodoc]] TextIteratorStreamer
354
+
355
+ [[autodoc]] AsyncTextIteratorStreamer
356
+
357
+ ## Caches
358
+
359
+ [[autodoc]] Cache
360
+ - update
361
+
362
+ [[autodoc]] CacheConfig
363
+ - update
364
+
365
+ [[autodoc]] QuantizedCacheConfig
366
+ - validate
367
+
368
+ [[autodoc]] DynamicCache
369
+ - update
370
+ - get_seq_length
371
+ - reorder_cache
372
+ - to_legacy_cache
373
+ - from_legacy_cache
374
+
375
+ [[autodoc]] QuantizedCache
376
+ - update
377
+ - get_seq_length
378
+
379
+ [[autodoc]] QuantoQuantizedCache
380
+
381
+ [[autodoc]] HQQQuantizedCache
382
+
383
+ [[autodoc]] SinkCache
384
+ - update
385
+ - get_seq_length
386
+ - reorder_cache
387
+
388
+ [[autodoc]] OffloadedCache
389
+ - update
390
+ - prefetch_layer
391
+ - evict_previous_layer
392
+
393
+ [[autodoc]] StaticCache
394
+ - update
395
+ - get_seq_length
396
+ - reset
397
+
398
+ [[autodoc]] OffloadedStaticCache
399
+ - update
400
+ - get_seq_length
401
+ - reset
402
+
403
+ [[autodoc]] HybridCache
404
+ - update
405
+ - get_seq_length
406
+ - reset
407
+
408
+ [[autodoc]] SlidingWindowCache
409
+ - update
410
+ - reset
411
+
412
+ [[autodoc]] EncoderDecoderCache
413
+ - get_seq_length
414
+ - to_legacy_cache
415
+ - from_legacy_cache
416
+ - reset
417
+ - reorder_cache
418
+
419
+ [[autodoc]] MambaCache
420
+ - update_conv_state
421
+ - update_ssm_state
422
+ - reset
423
+
424
+ ## Watermark Utils
425
+
426
+ [[autodoc]] WatermarkingConfig
427
+ - __call__
428
+
429
+ [[autodoc]] WatermarkDetector
430
+ - __call__
431
+
432
+ [[autodoc]] BayesianDetectorConfig
433
+
434
+ [[autodoc]] BayesianDetectorModel
435
+ - forward
436
+
437
+ [[autodoc]] SynthIDTextWatermarkingConfig
438
+
439
+ [[autodoc]] SynthIDTextWatermarkDetector
440
+ - __call__
441
+
442
+ ## Compile Utils
443
+
444
+ [[autodoc]] CompileConfig
445
+ - __call__
446
+
docs/transformers/docs/source/en/internal/image_processing_utils.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2022 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Utilities for Image Processors
18
+
19
+ This page lists all the utility functions used by the image processors, mainly the functional
20
+ transformations used to process the images.
21
+
22
+ Most of those are only useful if you are studying the code of the image processors in the library.
23
+
24
+ ## Image Transformations
25
+
26
+ [[autodoc]] image_transforms.center_crop
27
+
28
+ [[autodoc]] image_transforms.center_to_corners_format
29
+
30
+ [[autodoc]] image_transforms.corners_to_center_format
31
+
32
+ [[autodoc]] image_transforms.id_to_rgb
33
+
34
+ [[autodoc]] image_transforms.normalize
35
+
36
+ [[autodoc]] image_transforms.pad
37
+
38
+ [[autodoc]] image_transforms.rgb_to_id
39
+
40
+ [[autodoc]] image_transforms.rescale
41
+
42
+ [[autodoc]] image_transforms.resize
43
+
44
+ [[autodoc]] image_transforms.to_pil_image
45
+
46
+ ## ImageProcessingMixin
47
+
48
+ [[autodoc]] image_processing_utils.ImageProcessingMixin
docs/transformers/docs/source/en/internal/import_utils.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2025 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Import Utilities
18
+
19
+ This page goes through the transformers utilities to enable lazy and fast object import.
20
+ While we strive for minimal dependencies, some models have specific dependencies requirements that cannot be
21
+ worked around. We don't want for all users of `transformers` to have to install those dependencies to use other models,
22
+ we therefore mark those as soft dependencies rather than hard dependencies.
23
+
24
+ The transformers toolkit is not made to error-out on import of a model that has a specific dependency; instead, an
25
+ object for which you are lacking a dependency will error-out when calling any method on it. As an example, if
26
+ `torchvision` isn't installed, the fast image processors will not be available.
27
+
28
+ This object is still importable:
29
+
30
+ ```python
31
+ >>> from transformers import DetrImageProcessorFast
32
+ >>> print(DetrImageProcessorFast)
33
+ <class 'DetrImageProcessorFast'>
34
+ ```
35
+
36
+ However, no method can be called on that object:
37
+
38
+ ```python
39
+ >>> DetrImageProcessorFast.from_pretrained()
40
+ ImportError:
41
+ DetrImageProcessorFast requires the Torchvision library but it was not found in your environment. Checkout the instructions on the
42
+ installation page: https://pytorch.org/get-started/locally/ and follow the ones that match your environment.
43
+ Please note that you may need to restart your runtime after installation.
44
+ ```
45
+
46
+ Let's see how to specify specific object dependencies.
47
+
48
+ ## Specifying Object Dependencies
49
+
50
+ ### Filename-based
51
+
52
+ All objects under a given filename have an automatic dependency to the tool linked to the filename
53
+
54
+ **TensorFlow**: All files starting with `modeling_tf_` have an automatic TensorFlow dependency.
55
+
56
+ **Flax**: All files starting with `modeling_flax_` have an automatic Flax dependency
57
+
58
+ **PyTorch**: All files starting with `modeling_` and not valid with the above (TensorFlow and Flax) have an automatic
59
+ PyTorch dependency
60
+
61
+ **Tokenizers**: All files starting with `tokenization_` and ending with `_fast` have an automatic `tokenizers` dependency
62
+
63
+ **Vision**: All files starting with `image_processing_` have an automatic dependency to the `vision` dependency group;
64
+ at the time of writing, this only contains the `pillow` dependency.
65
+
66
+ **Vision + Torch + Torchvision**: All files starting with `image_processing_` and ending with `_fast` have an automatic
67
+ dependency to `vision`, `torch`, and `torchvision`.
68
+
69
+ All of these automatic dependencies are added on top of the explicit dependencies that are detailed below.
70
+
71
+ ### Explicit Object Dependencies
72
+
73
+ We add a method called `requires` that is used to explicitly specify the dependencies of a given object. As an
74
+ example, the `Trainer` class has two hard dependencies: `torch` and `accelerate`. Here is how we specify these
75
+ required dependencies:
76
+
77
+ ```python
78
+ from .utils.import_utils import requires
79
+
80
+ @requires(backends=("torch", "accelerate"))
81
+ class Trainer:
82
+ ...
83
+ ```
84
+
85
+ Backends that can be added here are all the backends that are available in the `import_utils.py` module.
86
+
87
+ ## Methods
88
+
89
+ [[autodoc]] utils.import_utils.define_import_structure
90
+
91
+ [[autodoc]] utils.import_utils.requires
docs/transformers/docs/source/en/internal/model_debugging_utils.md ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2025 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Model debugging toolboxes
18
+
19
+ This page lists all the debugging and model adding tools used by the library, as well as the utility functions it provides for it.
20
+
21
+ Most of those are only useful if you are adding new models in the library.
22
+
23
+
24
+ ## Model addition debuggers
25
+
26
+
27
+ ### Model addition debugger - context manager for model adders
28
+
29
+ This context manager is a power user tool intended for model adders.
30
+ It tracks all forward calls within a model forward and logs a slice of each input and output on a nested Json.
31
+ To note, this context manager enforces `torch.no_grad()`.
32
+
33
+ ### Rationale
34
+
35
+ Because when porting models to transformers, even from python to python, model adders often have to do a lot of manual operations, involving saving and loading tensors, comparing dtypes, etc. This small tool can hopefully shave off some time.
36
+
37
+ ### Usage
38
+
39
+ Add this context manager as follows to debug a model:
40
+
41
+ ```python
42
+ import torch
43
+ from PIL import Image
44
+ import requests
45
+ from transformers import LlavaProcessor, LlavaForConditionalGeneration
46
+ from transformers.model_debugging_utils import model_addition_debugger_context
47
+ torch.random.manual_seed(673)
48
+
49
+ # load pretrained model and processor
50
+ model_id = "llava-hf/llava-1.5-7b-hf"
51
+ processor = LlavaProcessor.from_pretrained(model_id)
52
+ model = LlavaForConditionalGeneration.from_pretrained(model_id, low_cpu_mem_usage=True)
53
+
54
+ # create random image input
55
+ random_image = Image.fromarray(torch.randint(0, 256, (224, 224, 3), dtype=torch.uint8).numpy())
56
+
57
+ # prompt
58
+ prompt = "<image>Describe this image."
59
+
60
+ # process inputs
61
+ inputs = processor(text=prompt, images=random_image, return_tensors="pt")
62
+
63
+ # call forward method (not .generate!)
64
+ with model_addition_debugger_context(
65
+ model,
66
+ debug_path="optional_path_to_your_directory",
67
+ do_prune_layers=False # This will output ALL the layers of a model.
68
+ ):
69
+ output = model.forward(**inputs)
70
+
71
+ ```
72
+
73
+
74
+ ### Reading results
75
+
76
+ The debugger generates two files from the forward call, both with the same base name,
77
+ but ending either with `_SUMMARY.json` or with `_FULL_TENSORS.json`.
78
+
79
+ The first one will contain a summary of each module's _input_ and _output_ tensor values and shapes.
80
+
81
+ ```json
82
+ {
83
+ "module_path": "MolmoForConditionalGeneration",
84
+ "inputs": {
85
+ "args": [],
86
+ "kwargs": {
87
+ "input_ids": {
88
+ "shape": "torch.Size([1, 589])",
89
+ "dtype": "torch.int64"
90
+ },
91
+ "attention_mask": {
92
+ "shape": "torch.Size([1, 589])",
93
+ "dtype": "torch.int64"
94
+ },
95
+ "pixel_values": {
96
+ "shape": "torch.Size([1, 5, 576, 588])",
97
+ "dtype": "torch.float32",
98
+ "mean": "tensor(-8.9514e-01, device='cuda:0')",
99
+ "std": "tensor(9.2586e-01, device='cuda:0')",
100
+ "min": "tensor(-1.7923e+00, device='cuda:0')",
101
+ "max": "tensor(1.8899e+00, device='cuda:0')"
102
+ }
103
+ },
104
+ "children": [
105
+ {
106
+ "module_path": "MolmoForConditionalGeneration.language_model.model.embed_tokens",
107
+ "inputs": {
108
+ "args": [
109
+ {
110
+ "shape": "torch.Size([1, 589])",
111
+ "dtype": "torch.int64"
112
+ }
113
+ ]
114
+ },
115
+ "outputs": {
116
+ "shape": "torch.Size([1, 589, 3584])",
117
+ "dtype": "torch.float32",
118
+ "mean": "tensor(6.5460e-06, device='cuda:0')",
119
+ "std": "tensor(2.3807e-02, device='cuda:0')",
120
+ "min": "tensor(-3.3398e-01, device='cuda:0')",
121
+ "max": "tensor(3.9453e-01, device='cuda:0')"
122
+ }
123
+ },
124
+ {
125
+ "module_path": "MolmoForConditionalGeneration.vision_tower",
126
+ "inputs": {
127
+ "args": [
128
+ {
129
+ "shape": "torch.Size([5, 1, 576, 588])",
130
+ "dtype": "torch.float32",
131
+ "mean": "tensor(-8.9514e-01, device='cuda:0')",
132
+ "std": "tensor(9.2586e-01, device='cuda:0')",
133
+ "min": "tensor(-1.7923e+00, device='cuda:0')",
134
+ "max": "tensor(1.8899e+00, device='cuda:0')"
135
+ }
136
+ ],
137
+ "kwargs": {
138
+ "output_hidden_states": "True"
139
+ }
140
+ },
141
+ "children": [
142
+ { ... and so on
143
+ ```
144
+
145
+ The `_FULL_TENSORS.json` file will display a full view of all tensors, which is useful
146
+ for comparing two files.
147
+ ```json
148
+ "pixel_values": {
149
+ "shape": "torch.Size([1, 5, 576, 588])",
150
+ "dtype": "torch.float32",
151
+ "value": [
152
+ "tensor([[[[-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
153
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
154
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
155
+ " ...,",
156
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
157
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
158
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00]],",
159
+ "",
160
+ " [[-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
161
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
162
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
163
+ " ...,",
164
+ " [-1.4857e+00, -1.4820e+00, -1.2100e+00, ..., -6.0979e-01, -5.9650e-01, -3.8527e-01],",
165
+ " [-1.6755e+00, -1.7221e+00, -1.4518e+00, ..., -7.5577e-01, -7.4658e-01, -5.5592e-01],",
166
+ " [-7.9957e-01, -8.2162e-01, -5.7014e-01, ..., -1.3689e+00, -1.3169e+00, -1.0678e+00]],",
167
+ "",
168
+ " [[-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
169
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
170
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
171
+ " ...,",
172
+ " [-3.0322e-01, -5.0645e-01, -5.8436e-01, ..., -6.2439e-01, -7.9160e-01, -8.1188e-01],",
173
+ " [-4.4921e-01, -6.5653e-01, -7.2656e-01, ..., -3.4702e-01, -5.2146e-01, -5.1326e-01],",
174
+ " [-3.4702e-01, -5.3647e-01, -5.4170e-01, ..., -1.0915e+00, -1.1968e+00, -1.0252e+00]],",
175
+ "",
176
+ " [[-1.1207e+00, -1.2718e+00, -1.0678e+00, ..., 1.2013e-01, -1.3126e-01, -1.7197e-01],",
177
+ " [-6.9738e-01, -9.1166e-01, -8.5454e-01, ..., -5.5050e-02, -2.8134e-01, -4.2793e-01],",
178
+ " [-3.4702e-01, -5.5148e-01, -5.8436e-01, ..., 1.9312e-01, -8.6235e-02, -2.1463e-01],",
179
+ " ...,",
180
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
181
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
182
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00]],",
183
+ "",
184
+ " [[-1.0039e+00, -9.5669e-01, -6.5546e-01, ..., -1.4711e+00, -1.4219e+00, -1.1389e+00],",
185
+ " [-1.0039e+00, -9.5669e-01, -6.5546e-01, ..., -1.7193e+00, -1.6771e+00, -1.4091e+00],",
186
+ " [-1.6317e+00, -1.6020e+00, -1.2669e+00, ..., -1.2667e+00, -1.2268e+00, -8.9720e-01],",
187
+ " ...,",
188
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
189
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
190
+ " [-1.7923e+00, -1.7521e+00, -1.4802e+00, ..., -1.7923e+00, -1.7521e+00, -1.4802e+00]]]], device='cuda:0')"
191
+ ],
192
+ "mean": "tensor(-8.9514e-01, device='cuda:0')",
193
+ "std": "tensor(9.2586e-01, device='cuda:0')",
194
+ "min": "tensor(-1.7923e+00, device='cuda:0')",
195
+ "max": "tensor(1.8899e+00, device='cuda:0')"
196
+ },
197
+ ```
198
+
199
+ ### Comparing between implementations
200
+
201
+ Once the forward passes of two models have been traced by the debugger, one can compare the `json` output files. See below: we can see slight differences between these two implementations' key projection layer. Inputs are mostly identical, but not quite. Looking through the file differences makes it easier to pinpoint which layer is wrong.
202
+
203
+
204
+ ![download-icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/files_difference_debugging.png)
205
+
206
+
207
+ ### Limitations and scope
208
+
209
+ This feature will only work for torch-based models, and would require more work and case-by-case approach for say `jax`-based models that are usually compiled. Models relying heavily on external kernel calls may work, but trace will probably miss some things. Regardless, any python implementation that aims at mimicking another implementation can be traced once instead of reran N times with breakpoints.
210
+
211
+ If you pass `do_prune_layers=False` to your model debugger, ALL the layers will be outputted to `json`. Else, only the first and last layer will be shown. This is useful when some layers (typically cross-attention) appear only after N layers.
212
+
213
+ [[autodoc]] model_addition_debugger_context
docs/transformers/docs/source/en/internal/modeling_utils.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Custom Layers and Utilities
18
+
19
+ This page lists all the custom layers used by the library, as well as the utility functions and classes it provides for modeling.
20
+
21
+ Most of those are only useful if you are studying the code of the models in the library.
22
+
23
+ ## Layers
24
+
25
+ [[autodoc]] GradientCheckpointingLayer
26
+
27
+ ## Attention Functions
28
+
29
+ [[autodoc]] AttentionInterface
30
+ - register
31
+
32
+ ## Rotary Position Embedding Functions
33
+
34
+ [[autodoc]] dynamic_rope_update
35
+
36
+ ## Pytorch custom modules
37
+
38
+ [[autodoc]] pytorch_utils.Conv1D
39
+
40
+ ## PyTorch Helper Functions
41
+
42
+ [[autodoc]] pytorch_utils.apply_chunking_to_forward
43
+
44
+ [[autodoc]] pytorch_utils.find_pruneable_heads_and_indices
45
+
46
+ [[autodoc]] pytorch_utils.prune_layer
47
+
48
+ [[autodoc]] pytorch_utils.prune_conv1d_layer
49
+
50
+ [[autodoc]] pytorch_utils.prune_linear_layer
51
+
52
+ ## TensorFlow custom layers
53
+
54
+ [[autodoc]] modeling_tf_utils.TFConv1D
55
+
56
+ [[autodoc]] modeling_tf_utils.TFSequenceSummary
57
+
58
+ ## TensorFlow loss functions
59
+
60
+ [[autodoc]] modeling_tf_utils.TFCausalLanguageModelingLoss
61
+
62
+ [[autodoc]] modeling_tf_utils.TFMaskedLanguageModelingLoss
63
+
64
+ [[autodoc]] modeling_tf_utils.TFMultipleChoiceLoss
65
+
66
+ [[autodoc]] modeling_tf_utils.TFQuestionAnsweringLoss
67
+
68
+ [[autodoc]] modeling_tf_utils.TFSequenceClassificationLoss
69
+
70
+ [[autodoc]] modeling_tf_utils.TFTokenClassificationLoss
71
+
72
+ ## TensorFlow Helper Functions
73
+
74
+ [[autodoc]] modeling_tf_utils.get_initializer
75
+
76
+ [[autodoc]] modeling_tf_utils.keras_serializable
77
+
78
+ [[autodoc]] modeling_tf_utils.shape_list
docs/transformers/docs/source/en/internal/pipelines_utils.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Utilities for pipelines
18
+
19
+ This page lists all the utility functions the library provides for pipelines.
20
+
21
+ Most of those are only useful if you are studying the code of the models in the library.
22
+
23
+
24
+ ## Argument handling
25
+
26
+ [[autodoc]] pipelines.ArgumentHandler
27
+
28
+ [[autodoc]] pipelines.ZeroShotClassificationArgumentHandler
29
+
30
+ [[autodoc]] pipelines.QuestionAnsweringArgumentHandler
31
+
32
+ ## Data format
33
+
34
+ [[autodoc]] pipelines.PipelineDataFormat
35
+
36
+ [[autodoc]] pipelines.CsvPipelineDataFormat
37
+
38
+ [[autodoc]] pipelines.JsonPipelineDataFormat
39
+
40
+ [[autodoc]] pipelines.PipedPipelineDataFormat
41
+
42
+ ## Utilities
43
+
44
+ [[autodoc]] pipelines.PipelineException
docs/transformers/docs/source/en/internal/time_series_utils.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2023 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Time Series Utilities
18
+
19
+ This page lists all the utility functions and classes that can be used for Time Series based models.
20
+
21
+ Most of those are only useful if you are studying the code of the time series models or you wish to add to the collection of distributional output classes.
22
+
23
+ ## Distributional Output
24
+
25
+ [[autodoc]] time_series_utils.NormalOutput
26
+
27
+ [[autodoc]] time_series_utils.StudentTOutput
28
+
29
+ [[autodoc]] time_series_utils.NegativeBinomialOutput
docs/transformers/docs/source/en/internal/tokenization_utils.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Utilities for Tokenizers
18
+
19
+ This page lists all the utility functions used by the tokenizers, mainly the class
20
+ [`~tokenization_utils_base.PreTrainedTokenizerBase`] that implements the common methods between
21
+ [`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`] and the mixin
22
+ [`~tokenization_utils_base.SpecialTokensMixin`].
23
+
24
+ Most of those are only useful if you are studying the code of the tokenizers in the library.
25
+
26
+ ## PreTrainedTokenizerBase
27
+
28
+ [[autodoc]] tokenization_utils_base.PreTrainedTokenizerBase
29
+ - __call__
30
+ - all
31
+
32
+ ## SpecialTokensMixin
33
+
34
+ [[autodoc]] tokenization_utils_base.SpecialTokensMixin
35
+
36
+ ## Enums and namedtuples
37
+
38
+ [[autodoc]] tokenization_utils_base.TruncationStrategy
39
+
40
+ [[autodoc]] tokenization_utils_base.CharSpan
41
+
42
+ [[autodoc]] tokenization_utils_base.TokenSpan
docs/transformers/docs/source/en/internal/trainer_utils.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Utilities for Trainer
18
+
19
+ This page lists all the utility functions used by [`Trainer`].
20
+
21
+ Most of those are only useful if you are studying the code of the Trainer in the library.
22
+
23
+ ## Utilities
24
+
25
+ [[autodoc]] EvalPrediction
26
+
27
+ [[autodoc]] IntervalStrategy
28
+
29
+ [[autodoc]] enable_full_determinism
30
+
31
+ [[autodoc]] set_seed
32
+
33
+ [[autodoc]] torch_distributed_zero_first
34
+
35
+ ## Callbacks internals
36
+
37
+ [[autodoc]] trainer_callback.CallbackHandler
38
+
39
+ ## Distributed Evaluation
40
+
41
+ [[autodoc]] trainer_pt_utils.DistributedTensorGatherer
42
+
43
+ ## Trainer Argument Parser
44
+
45
+ [[autodoc]] HfArgumentParser
46
+
47
+ ## Debug Utilities
48
+
49
+ [[autodoc]] debug_utils.DebugUnderflowOverflow
docs/transformers/docs/source/en/kv_cache.md ADDED
@@ -0,0 +1,359 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2024 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # KV cache strategies
18
+
19
+ The key-value (KV) vectors are used to calculate attention scores. For autoregressive models, KV scores are calculated *every* time because the model predicts one token at a time. Each prediction depends on the previous tokens, which means the model performs the same computations each time.
20
+
21
+ A KV *cache* stores these calculations so they can be reused without recomputing them. Efficient caching is crucial for optimizing model performance because it reduces computation time and improves response rates. Refer to the [Caching](./cache_explanation) doc for a more detailed explanation about how a cache works.
22
+
23
+ Transformers offers several [`Cache`] classes that implement different caching mechanisms. Some of these [`Cache`] classes are optimized to save memory while others are designed to maximize generation speed. Refer to the table below to compare cache types and use it to help you select the best cache for your use case.
24
+
25
+ | Cache Type | Memory Efficient  | Supports torch.compile() | Initialization Recommended | Latency | Long Context Generation |
26
+ |------------------------|------------------|--------------------------|----------------------------|---------|-------------------------|
27
+ | Dynamic Cache | No | No | No | Mid | No |
28
+ | Static Cache | No | Yes | Yes | High | No |
29
+ | Offloaded Cache | Yes | No | No | Low | Yes |
30
+ | Offloaded Static Cache | No | Yes | Yes | High | Yes |
31
+ | Quantized Cache | Yes | No | No | Low | Yes |
32
+ | Sliding Window Cache | No | Yes | Yes | High | No |
33
+ | Sink Cache | Yes | No | Yes | Mid | Yes |
34
+
35
+ This guide introduces you to the different [`Cache`] classes and shows you how to use them for generation.
36
+
37
+ ## Default cache
38
+
39
+ The [`DynamicCache`] is the default cache class for most models. It allows the cache size to grow dynamically in order to store an increasing number of keys and values as generation progresses.
40
+
41
+ Disable the cache by configuring `use_cache=False` in [`~GenerationMixin.generate`].
42
+
43
+ ```py
44
+ import torch
45
+ from transformers import AutoTokenizer, AutoModelForCausalLM
46
+
47
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
48
+ model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
49
+ inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
50
+
51
+ model.generate(**inputs, do_sample=False, max_new_tokens=20, use_cache=False)
52
+ ```
53
+
54
+ Cache classes can also be initialized first before calling and passing it to the models [past_key_values](https://hf.co/docs/transformers/internal/generation_utils#transformers.generation.GenerateDecoderOnlyOutput.past_key_values) parameter. This cache initialization strategy is only recommended for some cache types.
55
+
56
+ In most other cases, it's easier to define the cache strategy in the [cache_implementation](https://hf.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.cache_implementation) parameter.
57
+
58
+ ```py
59
+ import torch
60
+ from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
61
+
62
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
63
+ model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
64
+ inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
65
+
66
+ past_key_values = DynamicCache()
67
+ out = model.generate(**inputs, do_sample=False, max_new_tokens=20, past_key_values=past_key_values)
68
+ ```
69
+
70
+ ## Memory efficient caches
71
+
72
+ The KV cache can occupy a significant portion of memory and become a [bottleneck](https://hf.co/blog/llama31#inference-memory-requirements) for long-context generation. Memory efficient caches focus on trading off speed for reduced memory usage. This is especially important for large language models (LLMs) and if your hardware is memory constrained.
73
+
74
+ ### Offloaded cache
75
+
76
+ The [`OffloadedCache`] saves GPU memory by moving the KV cache for most model layers to the CPU. Only the current layer cache is maintained on the GPU during a models `forward` iteration over the layers. [`OffloadedCache`] asynchronously prefetches the next layer cache and sends the previous layer cache back to the CPU.
77
+
78
+ This cache strategy always generates the same result as [`DynamicCache`] and works as a drop-in replacement or fallback. You may want to use [`OffloadedCache`] if you have a GPU and you're getting out-of-memory (OOM) errors.
79
+
80
+ > [!WARNING]
81
+ > You may notice a small degradation in generation throughput compared to [`DynamicCache`] depending on your model and generation choices (context size, number of generated tokens, number of beams, etc.).
82
+
83
+ Enable [`OffloadedCache`] by configuring `cache_implementation="offloaded"` in either [`GenerationConfig`] or [`~GenerationMixin.generate`].
84
+
85
+ ```py
86
+ import torch
87
+ from transformers import AutoTokenizer, AutoModelForCausalLM
88
+
89
+ ckpt = "microsoft/Phi-3-mini-4k-instruct"
90
+ tokenizer = AutoTokenizer.from_pretrained(ckpt)
91
+ model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0")
92
+ inputs = tokenizer("Fun fact: The shortest", return_tensors="pt").to(model.device)
93
+
94
+ out = model.generate(**inputs, do_sample=False, max_new_tokens=23, cache_implementation="offloaded")
95
+ print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
96
+ Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896.
97
+ ```
98
+
99
+ The example below shows how you can fallback on [`OffloadedCache`] if you run out of memory.
100
+
101
+ ```py
102
+ import torch
103
+ from transformers import AutoTokenizer, AutoModelForCausalLM
104
+
105
+ def resilient_generate(model, *args, **kwargs):
106
+ oom = False
107
+ try:
108
+ return model.generate(*args, **kwargs)
109
+ except torch.cuda.OutOfMemoryError as e:
110
+ print(e)
111
+ print("retrying with cache_implementation='offloaded'")
112
+ oom = True
113
+ if oom:
114
+ torch.cuda.empty_cache()
115
+ kwargs["cache_implementation"] = "offloaded"
116
+ return model.generate(*args, **kwargs)
117
+
118
+ ckpt = "microsoft/Phi-3-mini-4k-instruct"
119
+ tokenizer = AutoTokenizer.from_pretrained(ckpt)
120
+ model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0")
121
+ prompt = ["okay "*1000 + "Fun fact: The most"]
122
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
123
+ beams = { "num_beams": 40, "num_beam_groups": 40, "num_return_sequences": 40, "diversity_penalty": 1.0, "max_new_tokens": 23, "early_stopping": True, }
124
+ out = resilient_generate(model, **inputs, **beams)
125
+ responses = tokenizer.batch_decode(out[:,-28:], skip_special_tokens=True)
126
+ ```
127
+
128
+ ### Quantized cache
129
+
130
+ The [`QuantizedCache`] reduces memory requirements by quantizing the KV values to a lower precision. [`QuantizedCache`] currently supports two quantization backends.
131
+
132
+ - [`HQQQuantizedCache`] supports int2, int4, and int8 datatypes.
133
+ - [`QuantoQuantizedCache`] supports int2 and int4 datatypes. This is the default quantization backend.
134
+
135
+ > [!WARNING]
136
+ > Quantizing the cache can harm latency if the context length is short and there is enough GPU memory available for generation without enabling cache quantization. Try to find a balance between memory efficiency and latency.
137
+
138
+ Enable [`QuantizedCache`] by configuring `cache_implementation="quantized"` in [`GenerationConfig`], and indicate the quantization backend in [`QuantizedCacheConfig`]. Any additional quantization related parameters should also be passed either as a dict or an instance of [`QuantizedCacheConfig`]. You should use the default values for these additional parameters unless you're running out-of-memory. In that case, consider decreasing the residual length.
139
+
140
+ <hfoptions id="quantized-cache">
141
+ <hfoption id="HQQQuantizedCache">
142
+
143
+ For [`HQQQuantizedCache`], we recommend setting the `axis-key` and `axis-value` parameters to `1`.
144
+
145
+ ```py
146
+ from transformers import AutoTokenizer, AutoModelForCausalLM, HQQQuantizedCache, QuantizedCacheConfig
147
+
148
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
149
+ model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
150
+ inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
151
+
152
+ out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"axis-key": 1, "axis-value": 1, "backend": "hqq"})
153
+ print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
154
+ I like rock music because it's loud and energetic. It's a great way to express myself and rel
155
+ ```
156
+
157
+ </hfoption>
158
+ <hfoption id="Quanto">
159
+
160
+ For [`QuantoQuantizedCache`], we recommend setting the `axis-key` and `axis-value` parameters to `0`.
161
+
162
+ ```py
163
+ from transformers import AutoTokenizer, AutoModelForCausalLM, QuantoQuantizedCache, QuantizedCacheConfig
164
+
165
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
166
+ model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
167
+ inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
168
+
169
+ out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "axis-key": 0, "axis-value": 0, "backend": "quanto"})
170
+ print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
171
+ I like rock music because it's loud and energetic. It's a great way to express myself and rel
172
+ ```
173
+
174
+ </hfoption>
175
+ </hfoptions>
176
+
177
+ ### Sink cache
178
+
179
+ [`SinkCache`] is capable of generating very long sequences ("infinite length" according to the paper) by only retaining a few initial tokens from the sequence. These are called the *sink tokens* because they account for a significant portion of the attention scores during generation. Subsequent tokens are discarded on a sliding windowed basis, and only the latest `window_size` tokens are kept. This means most of the previous knowledge is discarded.
180
+
181
+ The sink tokens allow a model to maintain stable performance even when it's dealing with very long text sequences.
182
+
183
+ Enable [`SinkCache`] by initializing it first with the [window_length](https://hf.co/docs/transformers/main/en/internal/generation_utils#transformers.SinkCache.window_length) and [num_sink_tokens](https://hf.co/docs/transformers/main/en/internal/generation_utils#transformers.SinkCache.num_sink_tokens) parameters before passing it to [past_key_values](https://hf.co/docs/transformers/internal/generation_utils#transformers.generation.GenerateDecoderOnlyOutput.past_key_values) in [`~GenerationMixin.generate`].
184
+
185
+ ```py
186
+ import torch
187
+ from transformers import AutoTokenizer, AutoModelForCausalLM, SinkCache
188
+
189
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
190
+ model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
191
+ inputs = tokenizer("This is a long story about unicorns, fairies and magic.", return_tensors="pt").to(model.device)
192
+
193
+ past_key_values = SinkCache(window_length=256, num_sink_tokens=4)
194
+ out = model.generate(**inputs, do_sample=False, max_new_tokens=30, past_key_values=past_key_values)
195
+ tokenizer.batch_decode(out, skip_special_tokens=True)[0]
196
+ "This is a long story about unicorns, fairies and magic. It is a fantasy world where unicorns and fairies live together in harmony. The story follows a young girl named Lily"
197
+ ```
198
+
199
+ ## Speed optimized caches
200
+
201
+ The default [`DynamicCache`] prevents you from taking advantage of just-in-time (JIT) optimizations because the cache size isn't fixed. JIT optimizations enable you to maximize latency at the expense of memory usage. All of the following cache types are compatible with JIT optimizations like [torch.compile](./llm_optims#static-kv-cache-and-torchcompile) to accelerate generation.
202
+
203
+ ### Static cache
204
+
205
+ A [`StaticCache`] pre-allocates a specific maximum cache size for the kv pairs. You can generate up to the maximum cache size without needing to modify it.
206
+
207
+ Enable [`StaticCache`] by configuring `cache_implementation="static"` in [`~GenerationMixin.generate`].
208
+
209
+ ```py
210
+ import torch
211
+ from transformers import AutoTokenizer, AutoModelForCausalLM
212
+
213
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
214
+ model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto")
215
+ inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
216
+
217
+ out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="static")
218
+ tokenizer.batch_decode(out, skip_special_tokens=True)[0]
219
+ "Hello, my name is [Your Name], and I am a [Your Profession] with [Number of Years] of"
220
+ ```
221
+
222
+ ### Offloaded static cache
223
+
224
+ The [`OffloadedStaticCache`] is very similar to the [OffloadedCache](#offloaded-cache) except the cache size is set to a maximum cache size. Otherwise, [`OffloadedStaticCache`] only keeps the current layer cache on the GPU and the rest are moved to the CPU.
225
+
226
+ Enable [`OffloadedStaticCache`] by configuring `cache_implementation="offloaded_static"` in [`~GenerationMixin.generate`].
227
+
228
+ ```py
229
+ import torch
230
+ from transformers import AutoTokenizer, AutoModelForCausalLM
231
+
232
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
233
+ model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto")
234
+ inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
235
+
236
+ out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="offloaded_static")
237
+ tokenizer.batch_decode(out, skip_special_tokens=True)[0]
238
+ "Hello, my name is [Your Name], and I am a [Your Profession] with [Number of Years] of"
239
+ ```
240
+ Cache offloading requires a CUDA GPU.
241
+
242
+ ### Sliding window cache
243
+
244
+ [`SlidingWindowCache`] implements a sliding window over the previous kv pairs, and only keeps the last `sliding_window` tokens. This cache type is designed to only work with models that support *sliding window attention*, such as [Mistral](./model_doc/mistral). Older kv states are discarded and replaced by new kv states.
245
+
246
+ Enable [`SlidingWindowCache`] by configuring `cache_implementation="sliding_window"` in [`~GenerationMixin.generate`].
247
+
248
+ ```py
249
+ import torch
250
+ from transformers import AutoTokenizer, AutoModelForCausalLM, SinkCache
251
+
252
+ tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
253
+ model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16).to("cuda:0")
254
+ inputs = tokenizer("Yesterday I was on a rock concert and.", return_tensors="pt").to(model.device)
255
+
256
+ out = model.generate(**inputs, do_sample=False, max_new_tokens=30, cache_implementation="sliding_window")
257
+ tokenizer.batch_decode(out, skip_special_tokens=True)[0]
258
+ ```
259
+
260
+ ## Model caches
261
+
262
+ Some model types, like encoder-decoder models or [Gemma2](./model_doc/gemma2) and [Mamba](./model_doc/mamba), have dedicated cache classes.
263
+
264
+ ### Encoder-decoder cache
265
+
266
+ [`EncoderDecoderCache`] is designed for encoder-decoder models. It manages both the self-attention and cross-attention caches to ensure storage and retrieval of previous kv pairs. It is possible to individually set a different cache type for the encoder and decoder.
267
+
268
+ This cache type doesn't require any setup. It can be used when calling [`~GenerationMixin.generate`] or a models `forward` method.
269
+
270
+ > [!TIP]
271
+ > The [`EncoderDecoderCache`] currently only supports [Whisper](./model_doc/whisper).
272
+
273
+ ### Model-specific caches
274
+
275
+ Some models have a unique way of storing past kv pairs or states that is not compatible with any other cache classes.
276
+
277
+ [Gemma2](./model_doc/gemma2) requires [`HybridCache`], which uses a combination of [`SlidingWindowCache`] for sliding window attention and [`StaticCache`] for global attention under the hood.
278
+
279
+ [Mamba](./model_doc/mamba) requires [`MambaCache`] because the model doesn't have an attention mechanism or kv states.
280
+
281
+ ## Iterative generation
282
+
283
+ A cache can also work in iterative generation settings where there is back-and-forth interaction with a model (chatbots). Like regular generation, iterative generation with a cache allows a model to efficiently handle ongoing conversations without recomputing the entire context at each step.
284
+
285
+ For iterative generation with a cache, start by initializing an empty cache class and then you can feed in your new prompts. Keep track of dialogue history with a [chat template](./chat_templating).
286
+
287
+ If you're using [`SinkCache`], the inputs need to be truncated to the maximum length because [`SinkCache`] can generate text that exceeds its maximum window size. However, the first input shouldn't exceed the maximum cache length.
288
+
289
+ The example below demonstrates how to use a cache for iterative generation.
290
+
291
+ ```py
292
+ import torch
293
+ from transformers import AutoTokenizer,AutoModelForCausalLM
294
+ from transformers.cache_utils import (
295
+ DynamicCache,
296
+ SinkCache,
297
+ StaticCache,
298
+ SlidingWindowCache,
299
+ QuantoQuantizedCache,
300
+ QuantizedCacheConfig,
301
+ )
302
+
303
+ model_id = "meta-llama/Llama-2-7b-chat-hf"
304
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map='auto')
305
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
306
+
307
+ user_prompts = ["Hello, what's your name?", "Btw, yesterday I was on a rock concert."]
308
+
309
+ past_key_values = DynamicCache()
310
+ max_cache_length = past_key_values.get_max_length()
311
+
312
+ messages = []
313
+ for prompt in user_prompts:
314
+ messages.append({"role": "user", "content": prompt})
315
+ inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device)
316
+ if isinstance(past_key_values, SinkCache):
317
+ inputs = {k: v[:, -max_cache_length:] for k, v in inputs.items()}
318
+ input_length = inputs["input_ids"].shape[1]
319
+ outputs = model.generate(**inputs, do_sample=False, max_new_tokens=256, past_key_values=past_key_values)
320
+ completion = tokenizer.decode(outputs[0, input_length: ], skip_special_tokens=True)
321
+ messages.append({"role": "assistant", "content": completion})
322
+ ```
323
+
324
+ ## Prefill a cache
325
+
326
+ In some situations, you may want to fill a [`Cache`] with kv pairs for a certain prefix prompt and reuse it to generate different sequences.
327
+
328
+ The example below initializes a [`StaticCache`], and then caches an initial prompt. Now you can generate several sequences from the prefilled prompt.
329
+
330
+ ```py
331
+ import copy
332
+ import torch
333
+ from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache, StaticCache
334
+
335
+ model_id = "meta-llama/Llama-2-7b-chat-hf"
336
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda")
337
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
338
+
339
+ # Init StaticCache with big enough max-length (1024 tokens for the below example)
340
+ # You can also init a DynamicCache, if that suits you better
341
+ prompt_cache = StaticCache(config=model.config, max_batch_size=1, max_cache_len=1024, device="cuda", dtype=torch.bfloat16)
342
+
343
+ INITIAL_PROMPT = "You are a helpful assistant. "
344
+ inputs_initial_prompt = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda")
345
+ # This is the common prompt cached, we need to run forward without grad to be able to copy
346
+ with torch.no_grad():
347
+ prompt_cache = model(**inputs_initial_prompt, past_key_values = prompt_cache).past_key_values
348
+
349
+ prompts = ["Help me to write a blogpost about travelling.", "What is the capital of France?"]
350
+ responses = []
351
+ for prompt in prompts:
352
+ new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
353
+ past_key_values = copy.deepcopy(prompt_cache)
354
+ outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20)
355
+ response = tokenizer.batch_decode(outputs)[0]
356
+ responses.append(response)
357
+
358
+ print(responses)
359
+ ```
docs/transformers/docs/source/en/llm_optims.md ADDED
@@ -0,0 +1,420 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2024 The HuggingFace Team. All rights reserved.
2
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
3
+ the License. You may obtain a copy of the License at
4
+ http://www.apache.org/licenses/LICENSE-2.0
5
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
6
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
7
+ specific language governing permissions and limitations under the License.
8
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
9
+ rendered properly in your Markdown viewer.
10
+ -->
11
+
12
+ # Optimizing inference
13
+
14
+ Inference with large language models (LLMs) can be challenging because they have to store and handle billions of parameters. To load a 70B parameter [Llama 2](https://hf.co/meta-llama/Llama-2-70b-hf) model, it requires 256GB of memory for full precision weights and 128GB of memory for half-precision weights. The most powerful GPUs today - the A100 and H100 - only have 80GB of memory.
15
+
16
+ On top of the memory requirements, inference is slow because LLMs are called repeatedly to generate the next token. The input sequence increases as generation progresses, which takes longer and longer to process.
17
+
18
+ This guide will show you how to optimize LLM inference to accelerate generation and reduce memory usage.
19
+
20
+ > [!TIP]
21
+ > Try out [Text Generation Inference (TGI)](https://hf.co/docs/text-generation-inference), a Hugging Face library dedicated to deploying and serving highly optimized LLMs for inference.
22
+
23
+ ## Static kv-cache and torch.compile
24
+
25
+ LLMs compute key-value (kv) values for each input token, and it performs the same kv computation each time because the generated output becomes part of the input. However, performing the same kv computation every time is not very efficient.
26
+
27
+ A *kv-cache* stores the past keys and values instead of recomputing them each time. As a result, the kv-cache is dynamic and it grows with each generation step which prevents you from taking advantage of [torch.compile](./perf_torch_compile), a powerful optimization method that fuses PyTorch code into optimized kernels.
28
+
29
+ The *static kv-cache* solves this issue by pre-allocating the kv-cache size to a maximum value, so you can combine it with [torch.compile](./perf_torch_compile) for up to a 4x speed up. Your speed up may vary depending on the model size (larger models have a smaller speed up) and hardware.
30
+
31
+ > [!WARNING]
32
+ > Follow this [issue](https://github.com/huggingface/transformers/issues/28981) to track which models (Llama, Gemma, Mistral, etc.) support a static kv-cache and torch.compile.
33
+
34
+ Depending on your task, there are several ways you can use the static kv-cache.
35
+
36
+ 1. For basic use cases, set [cache_implementation](https://hf.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.cache_implementation) to `"static"` (recommended).
37
+ 2. For multi-turn generation or a custom generation loop, initialize and handle [`StaticCache`] directly.
38
+ 3. For more unique hardware or use cases, it may be better to compile the entire [`~GenerationMixin.generate`] function into a single graph.
39
+
40
+ > [!TIP]
41
+ > Regardless of how you use the static kv-cache and torch.compile, left-pad your inputs with [pad_to_multiple_of](https://hf.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__.pad_to_multiple_of) to a limited set of values to avoid shape-related recompilations.
42
+
43
+ <hfoptions id="static-kv">
44
+ <hfoption id="1. cache_implementation">
45
+
46
+ 1. Set the [cache_implementation](https://hf.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.cache_implementation) to `"static"` in a models [`GenerationConfig`].
47
+ 2. Call [torch.compile](./perf_torch_compile) to compile the forward pass with the static kv-cache.
48
+
49
+ ```py
50
+ from transformers import AutoTokenizer, AutoModelForCausalLM
51
+ import torch
52
+ import os
53
+ os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :)
54
+
55
+ tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
56
+ model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", torch_dtype="auto", device_map="auto")
57
+
58
+ model.generation_config.cache_implementation = "static"
59
+
60
+ model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
61
+ input_text = "The theory of special relativity states "
62
+ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device.type)
63
+
64
+ outputs = model.generate(**input_ids)
65
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
66
+ ['The theory of special relativity states 1. The speed of light is constant in all inertial reference']
67
+ ```
68
+
69
+ Under the hood, [`~GenerationMixin.generate`] attempts to reuse the same cache object to avoid recompilation at each call, which is critical to get the most out of [torch.compile](./perf_torch_compile). Be aware of the following to avoid triggering recompilation or if generation is slower than expected.
70
+
71
+ 1. If the batch size changes or the maximum output length increases between calls, the cache is reinitialized and recompiled.
72
+ 2. The first several calls of the compiled function are slower because it is being compiled.
73
+
74
+ </hfoption>
75
+ <hfoption id="2. StaticCache">
76
+
77
+ Directly initialize a [`StaticCache`] object and pass it to the `past_key_values` parameter in [`~GenerationMixin.generate`]. The [`StaticCache`] keeps the cache contents, so you can pass it to a new [`~GenerationMixin.generate`] call to continue generation, similar to a dynamic cache.
78
+
79
+ ```py
80
+ from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache
81
+ import torch
82
+ import os
83
+ os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :)
84
+
85
+ tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
86
+ model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", torch_dtype="auto", device_map="auto")
87
+
88
+ model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
89
+ input_text = "The theory of special relativity states "
90
+ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device.type)
91
+ prompt_length = input_ids.input_ids.shape[1]
92
+ model.generation_config.max_new_tokens = 16
93
+
94
+ past_key_values = StaticCache(
95
+ config=model.config,
96
+ max_batch_size=1,
97
+ # If you plan to reuse the cache, make sure the cache length is large enough for all cases
98
+ max_cache_len=prompt_length+(model.generation_config.max_new_tokens*2),
99
+ device=model.device,
100
+ dtype=model.dtype
101
+ )
102
+ outputs = model.generate(**input_ids, past_key_values=past_key_values)
103
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
104
+ ['The theory of special relativity states 1. The speed of light is constant in all inertial reference frames. 2']
105
+
106
+ # pass in the generated text and the same cache object to continue generation from where it left off. Optionally, in a
107
+ # multi-turn conversation, append the new user input to the generated text.
108
+ new_input_ids = outputs
109
+ outputs = model.generate(new_input_ids, past_key_values=past_key_values)
110
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
111
+ ['The theory of special relativity states 1. The speed of light is constant in all inertial reference frames. 2. The speed of light is constant in all inertial reference frames. 3.']
112
+ ```
113
+
114
+ > [!TIP]
115
+ > To reuse [`StaticCache`] on a new prompt, use [`~StaticCache.reset`] to reset the cache contents between calls.
116
+
117
+ Another option for using [`StaticCache`] is to pass it to a models forward pass using the same `past_key_values` argument. This allows you to write your own custom decoding function to decode the next token given the current token, position, and cache position of previously generated tokens.
118
+
119
+ ```py
120
+ from transformers import LlamaTokenizer, LlamaForCausalLM, StaticCache, logging
121
+ from transformers.testing_utils import CaptureLogger
122
+ import torch
123
+ from accelerate.test_utils.testing import get_backend
124
+
125
+ prompts = [
126
+ "Simply put, the theory of relativity states that ",
127
+ "My favorite all time favorite condiment is ketchup.",
128
+ ]
129
+
130
+ NUM_TOKENS_TO_GENERATE = 40
131
+ torch_device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
132
+
133
+ tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", pad_token="</s>", padding_side="right")
134
+ model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="sequential")
135
+ inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
136
+
137
+ def decode_one_tokens(model, cur_token, input_pos, cache_position, past_key_values):
138
+ logits = model(
139
+ cur_token,
140
+ position_ids=input_pos,
141
+ cache_position=cache_position,
142
+ past_key_values=past_key_values,
143
+ return_dict=False,
144
+ use_cache=True
145
+ )[0]
146
+ new_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
147
+ return new_token
148
+ ```
149
+
150
+ To enable static kv-cache and [torch.compile](./perf_torch_compile) with [`StaticCache`], follow the steps below.
151
+
152
+ 1. Initialize [`StaticCache`] before using the model for inference to configure parameters like the maximum batch size and sequence length.
153
+ 2. Call [torch.compile](./perf_torch_compile) on the model to compile the forward pass with the static kv-cache.
154
+ 3. se SDPBackend.MATH in the [torch.nn.attention.sdpa_kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html) context manager to enable the native PyTorch C++ implementation of scaled dot product attention to speed up inference even more.
155
+
156
+ ```py
157
+ from torch.nn.attention import SDPBackend, sdpa_kernel
158
+
159
+ batch_size, seq_length = inputs["input_ids"].shape
160
+ with torch.no_grad():
161
+ past_key_values = StaticCache(
162
+ config=model.config, max_batch_size=2, max_cache_len=4096, device=torch_device, dtype=model.dtype
163
+ )
164
+ cache_position = torch.arange(seq_length, device=torch_device)
165
+ generated_ids = torch.zeros(
166
+ batch_size, seq_length + NUM_TOKENS_TO_GENERATE + 1, dtype=torch.int, device=torch_device
167
+ )
168
+ generated_ids[:, cache_position] = inputs["input_ids"].to(torch_device).to(torch.int)
169
+
170
+ logits = model(
171
+ **inputs, cache_position=cache_position, past_key_values=past_key_values,return_dict=False, use_cache=True
172
+ )[0]
173
+ next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
174
+ generated_ids[:, seq_length] = next_token[:, 0]
175
+
176
+ decode_one_tokens = torch.compile(decode_one_tokens, mode="reduce-overhead", fullgraph=True)
177
+ cache_position = torch.tensor([seq_length + 1], device=torch_device)
178
+ for _ in range(1, NUM_TOKENS_TO_GENERATE):
179
+ with sdpa_kernel(SDPBackend.MATH):
180
+ next_token = decode_one_tokens(model, next_token.clone(), None, cache_position, past_key_values)
181
+ generated_ids[:, cache_position] = next_token.int()
182
+ cache_position += 1
183
+
184
+ text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
185
+ text
186
+ ['Simply put, the theory of relativity states that 1) the speed of light is constant, 2) the speed of light is the same for all observers, and 3) the laws of physics are the same for all observers.',
187
+ 'My favorite all time favorite condiment is ketchup. I love it on everything. I love it on my eggs, my fries, my chicken, my burgers, my hot dogs, my sandwiches, my salads, my p']
188
+ ```
189
+
190
+ </hfoption>
191
+ <hfoption id="3. compile entire generate function">
192
+
193
+ Compiling the entire [`~GenerationMixin.generate`] function also compiles the input preparation logit processor operations, and more, in addition to the forward pass. With this approach, you don't need to initialize [`StaticCache`] or set the [cache_implementation](https://hf.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.cache_implementation) parameter.
194
+
195
+ ```py
196
+ from transformers import AutoTokenizer, AutoModelForCausalLM
197
+ import torch
198
+ import os
199
+ os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :)
200
+
201
+ tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
202
+ model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", torch_dtype="auto", device_map="auto")
203
+
204
+ model.generate = torch.compile(model.generate, mode="reduce-overhead", fullgraph=True)
205
+ input_text = "The theory of special relativity states "
206
+ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device.type)
207
+
208
+ outputs = model.generate(**input_ids)
209
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
210
+ ['The theory of special relativity states 1. The speed of light is constant in all inertial reference']
211
+ ```
212
+
213
+ This usage pattern is more appropriate for unique hardware or use cases, but there are several drawbacks to consider.
214
+
215
+ 1. Compilation is much slower.
216
+ 2. Parameters must be configured through [`GenerationConfig`].
217
+ 3. Many warnings and exceptions are suppressed. We recommend testing the uncompiled model first.
218
+ 4. Many features are unavailable at the moment. For example, generation does not stop if an `EOS` token is selected.
219
+
220
+ </hfoption>
221
+ </hfoptions>
222
+
223
+ ## Decoding strategies
224
+
225
+ Decoding can also be optimized to accelerate generation. You can use a lightweight assistant model to generate candidate tokens faster than the LLM itself or you can use a variant of this decoding strategy that works especially well for input-grounded tasks.
226
+
227
+ ### Speculative decoding
228
+
229
+ > [!TIP]
230
+ > For a more in-depth explanation, take a look at the [Assisted Generation: a new direction toward low-latency text generation](https://hf.co/blog/assisted-generation) blog post!
231
+
232
+ For each input token, the model weights are loaded each time during the forward pass, which is slow and cumbersome when a model has billions of parameters. Speculative decoding alleviates this slowdown by using a second smaller and faster assistant model to generate candidate tokens that are verified by the larger model in a single forward pass. If the verified tokens are correct, the LLM essentially gets them for "free" without having to generate them itself. There is no degradation in accuracy because the verification forward pass ensures the same outputs are generated as if the LLM had generated them on its own.
233
+
234
+ To get the largest speed up, the assistant model should be a lot smaller than the LLM so that it can generate tokens quickly. The assistant and LLM model must also share the same tokenizer to avoid re-encoding and decoding tokens.
235
+
236
+ > [!WARNING]
237
+ > Speculative decoding is only supported for the greedy search and sampling decoding strategies, and it doesn't support batched inputs.
238
+
239
+ Enable speculative decoding by loading an assistant model and passing it to [`~GenerationMixin.generate`].
240
+
241
+ <hfoptions id="spec-decoding">
242
+ <hfoption id="greedy search">
243
+
244
+ ```py
245
+ from transformers import AutoModelForCausalLM, AutoTokenizer
246
+ import torch
247
+ from accelerate.test_utils.testing import get_backend
248
+
249
+ device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
250
+
251
+ tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
252
+ inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(device)
253
+
254
+ model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto").to(device)
255
+ assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device)
256
+ outputs = model.generate(**inputs, assistant_model=assistant_model)
257
+ tokenizer.batch_decode(outputs, skip_special_tokens=True)
258
+ ["Einstein's theory of relativity states that the speed of light is constant. "]
259
+ ```
260
+
261
+ </hfoption>
262
+ <hfoption id="sampling">
263
+
264
+ For speculative sampling decoding, add the [do_sample](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.do_sample) and [temperature](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.temperature) parameters to [`~GenerationMixin.generate`].
265
+
266
+ ```py
267
+ from transformers import AutoModelForCausalLM, AutoTokenizer
268
+ import torch
269
+ from accelerate.test_utils.testing import get_backend
270
+
271
+ device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
272
+
273
+ tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
274
+ inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(device)
275
+
276
+ model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto").to(device)
277
+ assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device)
278
+ outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.7)
279
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
280
+ ["Einstein's theory of relativity states that motion in the universe is not a straight line.\n"]
281
+ ```
282
+
283
+ </hfoption>
284
+ </hfoptions>
285
+
286
+ ### Prompt lookup decoding
287
+
288
+ Prompt lookup decoding is a variant of speculative decoding that is also compatible with greedy search and sampling. Prompt lookup works especially well for input-grounded tasks - such as summarization - where there is often overlapping words between the prompt and output. These overlapping n-grams are used as the LLM candidate tokens.
289
+
290
+ To enable prompt lookup decoding, specify the number of tokens that should be overlapping in the [prompt_lookup_num_tokens](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.prompt_lookup_num_tokens) parameter. Then pass this parameter to [`~GenerationMixin.generate`].
291
+
292
+ <hfoptions id="pld">
293
+ <hfoption id="greedy decoding">
294
+
295
+ ```py
296
+ from transformers import AutoModelForCausalLM, AutoTokenizer
297
+ import torch
298
+ from accelerate.test_utils.testing import get_backend
299
+
300
+ device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
301
+
302
+ tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
303
+ inputs = tokenizer("The second law of thermodynamics states", return_tensors="pt").to(device)
304
+
305
+ model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto").to(device)
306
+ assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device)
307
+ outputs = model.generate(**inputs, prompt_lookup_num_tokens=3)
308
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
309
+ ['The second law of thermodynamics states that entropy increases with temperature. ']
310
+ ```
311
+
312
+ </hfoption>
313
+ <hfoption id="sampling">
314
+
315
+ For prompt lookup decoding with sampling, add the [do_sample](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.do_sample) and [temperature](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.temperature) parameters to [`~GenerationMixin.generate`].
316
+
317
+ ```py
318
+ from transformers import AutoModelForCausalLM, AutoTokenizer
319
+ import torch
320
+ from accelerate.test_utils.testing import get_backend
321
+
322
+ device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
323
+
324
+ tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
325
+ inputs = tokenizer("The second law of thermodynamics states", return_tensors="pt").to(device)
326
+
327
+ model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto").to(device)
328
+ outputs = model.generate(**inputs, prompt_lookup_num_tokens=3, do_sample=True, temperature=0.7)
329
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
330
+ ["The second law of thermodynamics states that energy cannot be created nor destroyed. It's not a"]
331
+ ```
332
+
333
+ </hfoption>
334
+ </hfoptions>
335
+
336
+ ## Attention
337
+
338
+ A known issue with transformer models is that the self-attention mechanism grows quadratically in compute and memory with the number of input tokens. This limitation is only magnified in LLMs which handles much longer sequences. To address this, try FlashAttention2 or PyTorch's scaled dot product attention (SDPA), which are more memory efficient attention implementations.
339
+
340
+ ### FlashAttention-2
341
+
342
+ FlashAttention and [FlashAttention-2](./perf_infer_gpu_one#flashattention-2) break up the attention computation into smaller chunks and reduces the number of intermediate read/write operations to the GPU memory to speed up inference. FlashAttention-2 improves on the original FlashAttention algorithm by also parallelizing over sequence length dimension and better partitioning work on the hardware to reduce synchronization and communication overhead.
343
+
344
+ To use FlashAttention-2, set [attn_implementation](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.PreTrainedModel.from_pretrained.attn_implementation) to `"flash_attention_2"` in [`~PreTrainedModel.from_pretrained`].
345
+
346
+ ```py
347
+ from transformers import AutoModelForCausalLM, BitsAndBytesConfig
348
+
349
+ quant_config = BitsAndBytesConfig(load_in_8bit=True)
350
+ model = AutoModelForCausalLM.from_pretrained(
351
+ "google/gemma-2b",
352
+ quantization_config=quant_config,
353
+ torch_dtype=torch.bfloat16,
354
+ attn_implementation="flash_attention_2",
355
+ )
356
+ ```
357
+
358
+ ### PyTorch scaled dot product attention
359
+
360
+ Scaled dot product attention (SDPA) is automatically enabled in PyTorch 2.0 and it supports FlashAttention, xFormers, and PyTorch's C++ implementation. SDPA chooses the most performant attention algorithm if you're using a CUDA backend. For other backends, SDPA defaults to the PyTorch C++ implementation.
361
+
362
+ > [!TIP]
363
+ > SDPA automaticallysupports FlashAttention-2 as long as you have the latest PyTorch version installed.
364
+
365
+ Use the [torch.nn.attention.sdpa_kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html) context manager to explicitly enable or disable any of the four attention algorithms. For example, use `SDPBackend.FLASH_ATTENTION` to enable FlashAttention.
366
+
367
+ ```py
368
+ import torch
369
+ from torch.nn.attention import SDPBackend, sdpa_kernel
370
+ from transformers import AutoModelForCausalLM
371
+
372
+ model = AutoModelForCausalLM.from_pretrained(
373
+ "google/gemma-2b",
374
+ torch_dtype=torch.bfloat16,
375
+ )
376
+
377
+ with sdpa_kernel(SDPBackend.FLASH_ATTENTION):
378
+ outputs = model.generate(**inputs)
379
+ ```
380
+
381
+ ## Quantization
382
+
383
+ Quantization reduces the size of model weights by storing them in a lower precision. This translates to lower memory usage and makes loading LLMs for inference more accessible if you're constrained by GPU memory.
384
+
385
+ If you aren't limited by your GPU, you don't necessarily need to quantize your model because it can increase latency slightly (except for AWQ and fused AWQ modules) due to the extra step required to quantize and dequantize the weights.
386
+
387
+ > [!TIP]
388
+ > There are many quantization libraries (see the [Quantization](./quantization) guide for more details) available, such as Quanto, AQLM, VPTQ, AWQ, and AutoGPTQ. Feel free to try them out and see which one works best for your use case. We also recommend reading the [Overview of natively supported quantization schemes in 🤗 Transformers](https://hf.co/blog/overview-quantization-transformers) blog post which compares AutoGPTQ and bitsandbytes.
389
+
390
+ Use the Model Memory Calculator below to estimate and compare how much memory is required to load a model. For example, try estimating the memory required to load [Mistral-7B-v0.1](https://hf.co/mistralai/Mistral-7B-v0.1).
391
+
392
+ <iframe
393
+ src="https://hf-accelerate-model-memory-usage.hf.space"
394
+ frameborder="0"
395
+ width="850"
396
+ height="450"
397
+ ></iframe>
398
+
399
+ To load a model in half-precision, set the [torch_dtype](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.PreTrainedModel.from_pretrained.torch_dtype) parameter in [`~transformers.AutoModelForCausalLM.from_pretrained`] to `torch.bfloat16`. This requires 13.74GB of memory.
400
+
401
+ ```py
402
+ from transformers import AutoTokenizer, AutoModelForCausalLM
403
+ import torch
404
+
405
+ model = AutoModelForCausalLM.from_pretrained(
406
+ "mistralai/Mistral-7B-v0.1", torch_dtype=torch.bfloat16, device_map="auto",
407
+ )
408
+ ```
409
+
410
+ To load a quantized model (8-bit or 4-bit), try [bitsandbytes](https://hf.co/docs/bitsandbytes) and set the [load_in_4bit](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.BitsAndBytesConfig.load_in_4bit) or [load_in_8bit](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.BitsAndBytesConfig.load_in_8bit) parameters to `True`. Loading the model in 8-bits only requires 6.87 GB of memory.
411
+
412
+ ```py
413
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
414
+ import torch
415
+
416
+ quant_config = BitsAndBytesConfig(load_in_8bit=True)
417
+ model = AutoModelForCausalLM.from_pretrained(
418
+ "mistralai/Mistral-7B-v0.1", quantization_config=quant_config, device_map="auto"
419
+ )
420
+ ```
docs/transformers/docs/source/en/llm_tutorial.md ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2024 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Text generation
18
+
19
+ [[open-in-colab]]
20
+
21
+ Text generation is the most popular application for large language models (LLMs). A LLM is trained to generate the next word (token) given some initial text (prompt) along with its own generated outputs up to a predefined length or when it reaches an end-of-sequence (`EOS`) token.
22
+
23
+ In Transformers, the [`~GenerationMixin.generate`] API handles text generation, and it is available for all models with generative capabilities.
24
+
25
+ This guide will show you the basics of text generation with [`~GenerationMixin.generate`] and some common pitfalls to avoid.
26
+
27
+ ## Default generate
28
+
29
+ Before you begin, it's helpful to install [bitsandbytes](https://hf.co/docs/bitsandbytes/index) to quantize really large models to reduce their memory usage.
30
+
31
+ ```bash
32
+ !pip install -U transformers bitsandbytes
33
+ ```
34
+ Bitsandbytes supports multiple backends in addition to CUDA-based GPUs. Refer to the multi-backend installation [guide](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend) to learn more.
35
+
36
+ Load a LLM with [`~PreTrainedModel.from_pretrained`] and add the following two parameters to reduce the memory requirements.
37
+
38
+ - `device_map="auto"` enables Accelerates' [Big Model Inference](./models#big-model-inference) feature for automatically initiating the model skeleton and loading and dispatching the model weights across all available devices, starting with the fastest device (GPU).
39
+ - `quantization_config` is a configuration object that defines the quantization settings. This examples uses bitsandbytes as the quantization backend (see the [Quantization](./quantization/overview) section for more available backends) and it loads the model in [4-bits](./quantization/bitsandbytes).
40
+
41
+ ```py
42
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
43
+
44
+ quantization_config = BitsAndBytesConfig(load_in_4bit=True)
45
+ model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", device_map="auto", quantization_config=quantization_config)
46
+ ```
47
+
48
+ Tokenize your input, and set the [`~PreTrainedTokenizer.padding_side`] parameter to `"left"` because a LLM is not trained to continue generation from padding tokens. The tokenizer returns the input ids and attention mask.
49
+
50
+ > [!TIP]
51
+ > Process more than one prompt at a time by passing a list of strings to the tokenizer. Batch the inputs to improve throughput at a small cost to latency and memory.
52
+
53
+ ```py
54
+ tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left")
55
+ model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda")
56
+ ```
57
+
58
+ Pass the inputs to [`~GenerationMixin.generate`] to generate tokens, and [`~PreTrainedTokenizer.batch_decode`] the generated tokens back to text.
59
+
60
+ ```py
61
+ generated_ids = model.generate(**model_inputs)
62
+ tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
63
+ "A list of colors: red, blue, green, yellow, orange, purple, pink,"
64
+ ```
65
+
66
+ ## Generation configuration
67
+
68
+ All generation settings are contained in [`GenerationConfig`]. In the example above, the generation settings are derived from the `generation_config.json` file of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1). A default decoding strategy is used when no configuration is saved with a model.
69
+
70
+ Inspect the configuration through the `generation_config` attribute. It only shows values that are different from the default configuration, in this case, the `bos_token_id` and `eos_token_id`.
71
+
72
+ ```py
73
+ from transformers import AutoModelForCausalLM
74
+
75
+ model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", device_map="auto")
76
+ model.generation_config
77
+ GenerationConfig {
78
+ "bos_token_id": 1,
79
+ "eos_token_id": 2
80
+ }
81
+ ```
82
+
83
+ You can customize [`~GenerationMixin.generate`] by overriding the parameters and values in [`GenerationConfig`]. Some of the most commonly adjusted parameters are [max_new_tokens](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.max_new_tokens), [num_beams](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.num_beams), [do_sample](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.do_sample), and [num_return_sequences](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.num_return_sequences).
84
+
85
+ ```py
86
+ # enable beam search sampling strategy
87
+ model.generate(**inputs, num_beams=4, do_sample=True)
88
+ ```
89
+
90
+ [`~GenerationMixin.generate`] can also be extended with external libraries or custom code. The `logits_processor` parameter accepts custom [`LogitsProcessor`] instances for manipulating the next token probability distribution. `stopping_criteria` supports custom [`StoppingCriteria`] to stop text generation. Check out the [logits-processor-zoo](https://github.com/NVIDIA/logits-processor-zoo) for more examples of external [`~GenerationMixin.generate`]-compatible extensions.
91
+
92
+ Refer to the [Generation strategies](./generation_strategies) guide to learn more about search, sampling, and decoding strategies.
93
+
94
+ ### Saving
95
+
96
+ Create an instance of [`GenerationConfig`] and specify the decoding parameters you want.
97
+
98
+ ```py
99
+ from transformers import AutoModelForCausalLM, GenerationConfig
100
+
101
+ model = AutoModelForCausalLM.from_pretrained("my_account/my_model")
102
+ generation_config = GenerationConfig(
103
+ max_new_tokens=50, do_sample=True, top_k=50, eos_token_id=model.config.eos_token_id
104
+ )
105
+ ```
106
+
107
+ Use [`~GenerationConfig.save_pretrained`] to save a specific generation configuration and set the `push_to_hub` parameter to `True` to upload it to the Hub.
108
+
109
+ ```py
110
+ generation_config.save_pretrained("my_account/my_model", push_to_hub=True)
111
+ ```
112
+
113
+ Leave the `config_file_name` parameter empty. This parameter should be used when storing multiple generation configurations in a single directory. It gives you a way to specify which generation configuration to load. You can create different configurations for different generative tasks (creative text generation with sampling, summarization with beam search) for use with a single model.
114
+
115
+ ```py
116
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig
117
+
118
+ tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
119
+ model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")
120
+
121
+ translation_generation_config = GenerationConfig(
122
+ num_beams=4,
123
+ early_stopping=True,
124
+ decoder_start_token_id=0,
125
+ eos_token_id=model.config.eos_token_id,
126
+ pad_token=model.config.pad_token_id,
127
+ )
128
+
129
+ translation_generation_config.save_pretrained("/tmp", config_file_name="translation_generation_config.json", push_to_hub=True)
130
+
131
+ generation_config = GenerationConfig.from_pretrained("/tmp", config_file_name="translation_generation_config.json")
132
+ inputs = tokenizer("translate English to French: Configuration files are easy to use!", return_tensors="pt")
133
+ outputs = model.generate(**inputs, generation_config=generation_config)
134
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
135
+ ```
136
+
137
+ ## Pitfalls
138
+
139
+ The section below covers some common issues you may encounter during text generation and how to solve them.
140
+
141
+ ### Output length
142
+
143
+ [`~GenerationMixin.generate`] returns up to 20 tokens by default unless otherwise specified in a models [`GenerationConfig`]. It is highly recommended to manually set the number of generated tokens with the [`max_new_tokens`] parameter to control the output length. [Decoder-only](https://hf.co/learn/nlp-course/chapter1/6?fw=pt) models returns the initial prompt along with the generated tokens.
144
+
145
+ ```py
146
+ model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to("cuda")
147
+ ```
148
+
149
+ <hfoptions id="output-length">
150
+ <hfoption id="default length">
151
+
152
+ ```py
153
+ generated_ids = model.generate(**model_inputs)
154
+ tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
155
+ 'A sequence of numbers: 1, 2, 3, 4, 5'
156
+ ```
157
+
158
+ </hfoption>
159
+ <hfoption id="max_new_tokens">
160
+
161
+ ```py
162
+ generated_ids = model.generate(**model_inputs, max_new_tokens=50)
163
+ tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
164
+ 'A sequence of numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,'
165
+ ```
166
+
167
+ </hfoption>
168
+ </hfoptions>
169
+
170
+ ### Decoding strategy
171
+
172
+ The default decoding strategy in [`~GenerationMixin.generate`] is *greedy search*, which selects the next most likely token, unless otherwise specified in a models [`GenerationConfig`]. While this decoding strategy works well for input-grounded tasks (transcription, translation), it is not optimal for more creative use cases (story writing, chat applications).
173
+
174
+ For example, enable a [multinomial sampling](./generation_strategies#multinomial-sampling) strategy to generate more diverse outputs. Refer to the [Generation strategy](./generation_strategies) guide for more decoding strategies.
175
+
176
+ ```py
177
+ model_inputs = tokenizer(["I am a cat."], return_tensors="pt").to("cuda")
178
+ ```
179
+
180
+ <hfoptions id="decoding">
181
+ <hfoption id="greedy search">
182
+
183
+ ```py
184
+ generated_ids = model.generate(**model_inputs)
185
+ tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
186
+ ```
187
+
188
+ </hfoption>
189
+ <hfoption id="multinomial sampling">
190
+
191
+ ```py
192
+ generated_ids = model.generate(**model_inputs, do_sample=True)
193
+ tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
194
+ ```
195
+
196
+ </hfoption>
197
+ </hfoptions>
198
+
199
+ ### Padding side
200
+
201
+ Inputs need to be padded if they don't have the same length. But LLMs aren't trained to continue generation from padding tokens, which means the [`~PreTrainedTokenizer.padding_side`] parameter needs to be set to the left of the input.
202
+
203
+ <hfoptions id="padding">
204
+ <hfoption id="right pad">
205
+
206
+ ```py
207
+ model_inputs = tokenizer(
208
+ ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
209
+ ).to("cuda")
210
+ generated_ids = model.generate(**model_inputs)
211
+ tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
212
+ '1, 2, 33333333333'
213
+ ```
214
+
215
+ </hfoption>
216
+ <hfoption id="left pad">
217
+
218
+ ```py
219
+ tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left")
220
+ tokenizer.pad_token = tokenizer.eos_token
221
+ model_inputs = tokenizer(
222
+ ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
223
+ ).to("cuda")
224
+ generated_ids = model.generate(**model_inputs)
225
+ tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
226
+ '1, 2, 3, 4, 5, 6,'
227
+ ```
228
+
229
+ </hfoption>
230
+ </hfoptions>
231
+
232
+ ### Prompt format
233
+
234
+ Some models and tasks expect a certain input prompt format, and if the format is incorrect, the model returns a suboptimal output. You can learn more about prompting in the [prompt engineering](./tasks/prompting) guide.
235
+
236
+ For example, a chat model expects the input as a [chat template](./chat_templating). Your prompt should include a `role` and `content` to indicate who is participating in the conversation. If you try to pass your prompt as a single string, the model doesn't always return the expected output.
237
+
238
+ ```py
239
+ from transformers import AutoTokenizer, AutoModelForCausalLM
240
+
241
+ tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")
242
+ model = AutoModelForCausalLM.from_pretrained(
243
+ "HuggingFaceH4/zephyr-7b-alpha", device_map="auto", load_in_4bit=True
244
+ )
245
+ ```
246
+
247
+ <hfoptions id="format">
248
+ <hfoption id="no format">
249
+
250
+ ```py
251
+ prompt = """How many cats does it take to change a light bulb? Reply as a pirate."""
252
+ model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
253
+ input_length = model_inputs.input_ids.shape[1]
254
+ generated_ids = model.generate(**model_inputs, max_new_tokens=50)
255
+ print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])
256
+ "Aye, matey! 'Tis a simple task for a cat with a keen eye and nimble paws. First, the cat will climb up the ladder, carefully avoiding the rickety rungs. Then, with"
257
+ ```
258
+
259
+ </hfoption>
260
+ <hfoption id="chat template">
261
+
262
+ ```py
263
+ messages = [
264
+ {
265
+ "role": "system",
266
+ "content": "You are a friendly chatbot who always responds in the style of a pirate",
267
+ },
268
+ {"role": "user", "content": "How many cats does it take to change a light bulb?"},
269
+ ]
270
+ model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")
271
+ input_length = model_inputs.shape[1]
272
+ generated_ids = model.generate(model_inputs, do_sample=True, max_new_tokens=50)
273
+ print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])
274
+ "Arr, matey! According to me beliefs, 'twas always one cat to hold the ladder and another to climb up it an’ change the light bulb, but if yer looking to save some catnip, maybe yer can
275
+ ```
276
+
277
+ </hfoption>
278
+ </hfoptions>
279
+
280
+ ## Resources
281
+
282
+ Take a look below for some more specific and specialized text generation libraries.
283
+
284
+ - [Optimum](https://github.com/huggingface/optimum): an extension of Transformers focused on optimizing training and inference on specific hardware devices
285
+ - [Outlines](https://github.com/dottxt-ai/outlines): a library for constrained text generation (generate JSON files for example).
286
+ - [SynCode](https://github.com/uiuc-focal-lab/syncode): a library for context-free grammar guided generation (JSON, SQL, Python).
287
+ - [Text Generation Inference](https://github.com/huggingface/text-generation-inference): a production-ready server for LLMs.
288
+ - [Text generation web UI](https://github.com/oobabooga/text-generation-webui): a Gradio web UI for text generation.
289
+ - [logits-processor-zoo](https://github.com/NVIDIA/logits-processor-zoo): additional logits processors for controlling text generation.
docs/transformers/docs/source/en/llm_tutorial_optimization.md ADDED
@@ -0,0 +1,782 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2023 The HuggingFace Team. All rights reserved.
2
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
3
+ the License. You may obtain a copy of the License at
4
+ http://www.apache.org/licenses/LICENSE-2.0
5
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
6
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
7
+ specific language governing permissions and limitations under the License.
8
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
9
+ rendered properly in your Markdown viewer.
10
+ -->
11
+
12
+ # Optimizing LLMs for Speed and Memory
13
+
14
+ [[open-in-colab]]
15
+
16
+ Large Language Models (LLMs) such as GPT3/4, [Falcon](https://huggingface.co/tiiuae/falcon-40b), and [Llama](https://huggingface.co/meta-llama/Llama-2-70b-hf) are rapidly advancing in their ability to tackle human-centric tasks, establishing themselves as essential tools in modern knowledge-based industries.
17
+ Deploying these models in real-world tasks remains challenging, however:
18
+
19
+ - To exhibit near-human text understanding and generation capabilities, LLMs currently require to be composed of billions of parameters (see [Kaplan et al](https://arxiv.org/abs/2001.08361), [Wei et. al](https://arxiv.org/abs/2206.07682)). This consequently amplifies the memory demands for inference.
20
+ - In many real-world tasks, LLMs need to be given extensive contextual information. This necessitates the model's capability to manage very long input sequences during inference.
21
+
22
+ The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences.
23
+
24
+ In this guide, we will go over the effective techniques for efficient LLM deployment:
25
+
26
+ 1. **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization.md) can achieve computational advantages without a considerable decline in model performance.
27
+
28
+ 2. **Flash Attention:** Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.
29
+
30
+ 3. **Architectural Innovations:** Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. The most important advancement in model architectures hereby are [Alibi](https://arxiv.org/abs/2108.12409), [Rotary embeddings](https://arxiv.org/abs/2104.09864), [Multi-Query Attention (MQA)](https://arxiv.org/abs/1911.02150) and [Grouped-Query-Attention (GQA)]((https://arxiv.org/abs/2305.13245)).
31
+
32
+ Throughout this guide, we will offer an analysis of auto-regressive generation from a tensor's perspective. We delve into the pros and cons of adopting lower precision, provide a comprehensive exploration of the latest attention algorithms, and discuss improved LLM architectures. While doing so, we run practical examples showcasing each of the feature improvements.
33
+
34
+ ## 1. Lower Precision
35
+
36
+ Memory requirements of LLMs can be best understood by seeing the LLM as a set of weight matrices and vectors and the text inputs as a sequence of vectors. In the following, the definition *weights* will be used to signify all model weight matrices and vectors.
37
+
38
+ At the time of writing this guide, LLMs consist of at least a couple billion parameters. Each parameter thereby is made of a decimal number, e.g. `4.5689` which is usually stored in either [float32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format), or [float16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) format. This allows us to easily compute the memory requirement to load the LLM into memory:
39
+
40
+ > *Loading the weights of a model having X billion parameters requires roughly 4 * X GB of VRAM in float32 precision*
41
+
42
+ Nowadays, models are however rarely trained in full float32 precision, but usually in bfloat16 precision or less frequently in float16 precision. Therefore the rule of thumb becomes:
43
+
44
+ > *Loading the weights of a model having X billion parameters requires roughly 2 * X GB of VRAM in bfloat16/float16 precision*
45
+
46
+ For shorter text inputs (less than 1024 tokens), the memory requirement for inference is very much dominated by the memory requirement to load the weights. Therefore, for now, let's assume that the memory requirement for inference is equal to the memory requirement to load the model into the GPU VRAM.
47
+
48
+ To give some examples of how much VRAM it roughly takes to load a model in bfloat16:
49
+
50
+ - **GPT3** requires 2 \* 175 GB = **350 GB** VRAM
51
+ - [**Bloom**](https://huggingface.co/bigscience/bloom) requires 2 \* 176 GB = **352 GB** VRAM
52
+ - [**Llama-2-70b**](https://huggingface.co/meta-llama/Llama-2-70b-hf) requires 2 \* 70 GB = **140 GB** VRAM
53
+ - [**Falcon-40b**](https://huggingface.co/tiiuae/falcon-40b) requires 2 \* 40 GB = **80 GB** VRAM
54
+ - [**MPT-30b**](https://huggingface.co/mosaicml/mpt-30b) requires 2 \* 30 GB = **60 GB** VRAM
55
+ - [**bigcode/starcoder**](https://huggingface.co/bigcode/starcoder) requires 2 \* 15.5 = **31 GB** VRAM
56
+
57
+ As of writing this document, the largest GPU chip on the market is the A100 & H100 offering 80GB of VRAM. Most of the models listed before require more than 80GB just to be loaded and therefore necessarily require [tensor parallelism](https://huggingface.co/docs/transformers/perf_train_gpu_many#tensor-parallelism) and/or [pipeline parallelism](https://huggingface.co/docs/transformers/perf_train_gpu_many#naive-model-parallelism-vertical-and-pipeline-parallelism).
58
+
59
+ 🤗 Transformers now supports tensor parallelism for supported models having `base_tp_plan` in their respective config classes. Learn more about Tensor Parallelism [here](perf_train_gpu_many#tensor-parallelism). Furthermore, if you're interested in writing models in a tensor-parallelism-friendly way, feel free to have a look at [the text-generation-inference library](https://github.com/huggingface/text-generation-inference/tree/main/server/text_generation_server/models/custom_modeling).
60
+
61
+ Naive pipeline parallelism is supported out of the box. For this, simply load the model with `device="auto"` which will automatically place the different layers on the available GPUs as explained [here](https://huggingface.co/docs/accelerate/v0.22.0/en/concept_guides/big_model_inference).
62
+ Note, however that while very effective, this naive pipeline parallelism does not tackle the issues of GPU idling. For this more advanced pipeline parallelism is required as explained [here](https://huggingface.co/docs/transformers/en/perf_train_gpu_many#naive-model-parallelism-vertical-and-pipeline-parallelism).
63
+
64
+ If you have access to an 8 x 80GB A100 node, you could load BLOOM as follows
65
+
66
+ ```bash
67
+ !pip install transformers accelerate bitsandbytes optimum
68
+ ```
69
+ ```python
70
+ from transformers import AutoModelForCausalLM
71
+
72
+ model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", device_map="auto", pad_token_id=0)
73
+ ```
74
+
75
+ By using `device_map="auto"` the attention layers would be equally distributed over all available GPUs.
76
+
77
+ In this guide, we will use [bigcode/octocoder](https://huggingface.co/bigcode/octocoder) as it can be run on a single 40 GB A100 GPU device chip. Note that all memory and speed optimizations that we will apply going forward, are equally applicable to models that require model or tensor parallelism.
78
+
79
+ Since the model is loaded in bfloat16 precision, using our rule of thumb above, we would expect the memory requirement to run inference with `bigcode/octocoder` to be around 31 GB VRAM. Let's give it a try.
80
+
81
+ We first load the model and tokenizer and then pass both to Transformers' [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) object.
82
+
83
+ ```python
84
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
85
+ import torch
86
+
87
+ model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto", pad_token_id=0)
88
+ tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")
89
+
90
+ pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
91
+ ```
92
+
93
+ ```python
94
+ prompt = "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer:"
95
+
96
+ result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):]
97
+ result
98
+ ```
99
+
100
+ **Output**:
101
+ ```
102
+ Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
103
+ ```
104
+
105
+ Nice, we can now directly use the result to convert bytes into Gigabytes.
106
+
107
+ ```python
108
+ def bytes_to_giga_bytes(bytes):
109
+ return bytes / 1024 / 1024 / 1024
110
+ ```
111
+
112
+ Let's call [`torch.cuda.max_memory_allocated`](https://pytorch.org/docs/stable/generated/torch.cuda.max_memory_allocated.html) to measure the peak GPU memory allocation.
113
+
114
+ ```python
115
+ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
116
+ ```
117
+
118
+ **Output**:
119
+ ```bash
120
+ 29.0260648727417
121
+ ```
122
+
123
+ Close enough to our back-of-the-envelope computation! We can see the number is not exactly correct as going from bytes to kilobytes requires a multiplication of 1024 instead of 1000. Therefore the back-of-the-envelope formula can also be understood as an "at most X GB" computation.
124
+ Note that if we had tried to run the model in full float32 precision, a whopping 64 GB of VRAM would have been required.
125
+
126
+ > Almost all models are trained in bfloat16 nowadays, there is no reason to run the model in full float32 precision if [your GPU supports bfloat16](https://discuss.pytorch.org/t/bfloat16-native-support/117155/5). Float32 won't give better inference results than the precision that was used to train the model.
127
+
128
+ If you are unsure in which format the model weights are stored on the Hub, you can always look into the checkpoint's config under `"torch_dtype"`, *e.g.* [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/config.json#L21). It is recommended to set the model to the same precision type as written in the config when loading with `from_pretrained(..., torch_dtype=...)` except when the original type is float32 in which case one can use both `float16` or `bfloat16` for inference.
129
+
130
+
131
+ Let's define a `flush(...)` function to free all allocated memory so that we can accurately measure the peak allocated GPU memory.
132
+
133
+ ```python
134
+ del pipe
135
+ del model
136
+
137
+ import gc
138
+ import torch
139
+
140
+ def flush():
141
+ gc.collect()
142
+ torch.cuda.empty_cache()
143
+ torch.cuda.reset_peak_memory_stats()
144
+ ```
145
+
146
+ Let's call it now for the next experiment.
147
+
148
+ ```python
149
+ flush()
150
+ ```
151
+ From the Accelerate library, you can also use a device-agnostic utility method called [release_memory](https://github.com/huggingface/accelerate/blob/29be4788629b772a3b722076e433b5b3b5c85da3/src/accelerate/utils/memory.py#L63), which takes various hardware backends like XPU, MLU, NPU, MPS, and more into account.
152
+
153
+ ```python
154
+ from accelerate.utils import release_memory
155
+ # ...
156
+
157
+ release_memory(model)
158
+ ```
159
+
160
+ Now what if your GPU does not have 32 GB of VRAM? It has been found that model weights can be quantized to 8-bit or 4-bits without a significant loss in performance (see [Dettmers et al.](https://arxiv.org/abs/2208.07339)).
161
+ Model can be quantized to even 3 or 2 bits with an acceptable loss in performance as shown in the recent [GPTQ paper](https://arxiv.org/abs/2210.17323) 🤯.
162
+
163
+ Without going into too many details, quantization schemes aim at reducing the precision of weights while trying to keep the model's inference results as accurate as possible (*a.k.a* as close as possible to bfloat16).
164
+ Note that quantization works especially well for text generation since all we care about is choosing the *set of most likely next tokens* and don't really care about the exact values of the next token *logit* distribution.
165
+ All that matters is that the next token *logit* distribution stays roughly the same so that an `argmax` or `topk` operation gives the same results.
166
+
167
+ There are various quantization techniques, which we won't discuss in detail here, but in general, all quantization techniques work as follows:
168
+
169
+ - 1. Quantize all weights to the target precision
170
+ - 2. Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision
171
+ - 3. Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision
172
+
173
+ In a nutshell, this means that *inputs-weight matrix* multiplications, with \\( X \\) being the *inputs*, \\( W \\) being a weight matrix and \\( Y \\) being the output:
174
+
175
+ $$ Y = X * W $$
176
+
177
+ are changed to
178
+
179
+ $$ Y = X * \text{dequantize}(W) $$
180
+
181
+ for every matrix multiplication. Dequantization and re-quantization is performed sequentially for all weight matrices as the inputs run through the network graph.
182
+
183
+ Therefore, inference time is often **not** reduced when using quantized weights, but rather increases.
184
+ Enough theory, let's give it a try! To quantize the weights with Transformers, you need to make sure that
185
+ the [`bitsandbytes`](https://github.com/bitsandbytes-foundation/bitsandbytes) library is installed.
186
+
187
+ ```bash
188
+ !pip install bitsandbytes
189
+ ```
190
+
191
+ We can then load models in 8-bit quantization by simply adding a `load_in_8bit=True` flag to `from_pretrained`.
192
+
193
+ ```python
194
+ model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_8bit=True, pad_token_id=0)
195
+ ```
196
+
197
+ Now, let's run our example again and measure the memory usage.
198
+
199
+ ```python
200
+ pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
201
+
202
+ result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):]
203
+ result
204
+ ```
205
+
206
+ **Output**:
207
+ ```
208
+ Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
209
+ ```
210
+
211
+ Nice, we're getting the same result as before, so no loss in accuracy! Let's look at how much memory was used this time.
212
+
213
+ ```python
214
+ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
215
+ ```
216
+
217
+ **Output**:
218
+ ```
219
+ 15.219234466552734
220
+ ```
221
+
222
+ Significantly less! We're down to just a bit over 15 GBs and could therefore run this model on consumer GPUs like the 4090.
223
+ We're seeing a very nice gain in memory efficiency and more or less no degradation to the model's output. However, we can also notice a slight slow-down during inference.
224
+
225
+
226
+ We delete the models and flush the memory again.
227
+ ```python
228
+ del model
229
+ del pipe
230
+ ```
231
+
232
+ ```python
233
+ flush()
234
+ ```
235
+
236
+ Let's see what peak GPU memory consumption 4-bit quantization gives. Quantizing the model to 4-bit can be done with the same API as before - this time by passing `load_in_4bit=True` instead of `load_in_8bit=True`.
237
+
238
+ ```python
239
+ model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_4bit=True, low_cpu_mem_usage=True, pad_token_id=0)
240
+
241
+ pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
242
+
243
+ result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):]
244
+ result
245
+ ```
246
+
247
+ **Output**:
248
+ ```
249
+ Here is a Python function that transforms bytes to Giga bytes:\n\n```\ndef bytes_to_gigabytes(bytes):\n return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single argument
250
+ ```
251
+
252
+ We're almost seeing the same output text as before - just the `python` is missing just before the code snippet. Let's see how much memory was required.
253
+
254
+ ```python
255
+ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
256
+ ```
257
+
258
+ **Output**:
259
+ ```
260
+ 9.543574333190918
261
+ ```
262
+
263
+ Just 9.5GB! That's really not a lot for a >15 billion parameter model.
264
+
265
+ While we see very little degradation in accuracy for our model here, 4-bit quantization can in practice often lead to different results compared to 8-bit quantization or full `bfloat16` inference. It is up to the user to try it out.
266
+
267
+ Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to \\( \text{quantize} \\) and \\( \text{dequantize} \\) taking longer during inference.
268
+
269
+ ```python
270
+ del model
271
+ del pipe
272
+ ```
273
+ ```python
274
+ flush()
275
+ ```
276
+
277
+ Overall, we saw that running OctoCoder in 8-bit precision reduced the required GPU VRAM from 32G GPU VRAM to only 15GB and running the model in 4-bit precision further reduces the required GPU VRAM to just a bit over 9GB.
278
+
279
+ 4-bit quantization allows the model to be run on GPUs such as RTX3090, V100, and T4 which are quite accessible for most people.
280
+
281
+ For more information on quantization and to see how one can quantize models to require even less GPU VRAM memory than 4-bit, we recommend looking into the [`AutoGPTQ`](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#autogptq-integration%60) implementation.
282
+
283
+ > As a conclusion, it is important to remember that model quantization trades improved memory efficiency against accuracy and in some cases inference time.
284
+
285
+ If GPU memory is not a constraint for your use case, there is often no need to look into quantization. However many GPUs simply can't run LLMs without quantization methods and in this case, 4-bit and 8-bit quantization schemes are extremely useful tools.
286
+
287
+ For more in-detail usage information, we strongly recommend taking a look at the [Transformers Quantization Docs](https://huggingface.co/docs/transformers/main_classes/quantization#general-usage).
288
+ Next, let's look into how we can improve computational and memory efficiency by using better algorithms and an improved model architecture.
289
+
290
+ ## 2. Flash Attention
291
+
292
+ Today's top-performing LLMs share more or less the same fundamental architecture that consists of feed-forward layers, activation layers, layer normalization layers, and most crucially, self-attention layers.
293
+
294
+ Self-attention layers are central to Large Language Models (LLMs) in that they enable the model to understand the contextual relationships between input tokens.
295
+ However, the peak GPU memory consumption for self-attention layers grows *quadratically* both in compute and memory complexity with number of input tokens (also called *sequence length*) that we denote in the following by \\( N \\) .
296
+ While this is not really noticeable for shorter input sequences (of up to 1000 input tokens), it becomes a serious problem for longer input sequences (at around 16000 input tokens).
297
+
298
+ Let's take a closer look. The formula to compute the output \\( \mathbf{O} \\) of a self-attention layer for an input \\( \mathbf{X} \\) of length \\( N \\) is:
299
+
300
+ $$ \textbf{O} = \text{Attn}(\mathbf{X}) = \mathbf{V} \times \text{Softmax}(\mathbf{QK}^T) \text{ with } \mathbf{Q} = \mathbf{W}_q \mathbf{X}, \mathbf{V} = \mathbf{W}_v \mathbf{X}, \mathbf{K} = \mathbf{W}_k \mathbf{X} $$
301
+
302
+ \\( \mathbf{X} = (\mathbf{x}_1, ... \mathbf{x}_{N}) \\) is thereby the input sequence to the attention layer. The projections \\( \mathbf{Q} \\) and \\( \mathbf{K} \\) will each consist of \\( N \\) vectors resulting in the \\( \mathbf{QK}^T \\) being of size \\( N^2 \\) .
303
+
304
+ LLMs usually have multiple attention heads, thus doing multiple self-attention computations in parallel.
305
+ Assuming, the LLM has 40 attention heads and runs in bfloat16 precision, we can calculate the memory requirement to store the \\( \mathbf{QK^T} \\) matrices to be \\( 40 * 2 * N^2 \\) bytes. For \\( N=1000 \\) only around 50 MB of VRAM are needed, however, for \\( N=16000 \\) we would need 19 GB of VRAM, and for \\( N=100,000 \\) we would need almost 1TB just to store the \\( \mathbf{QK}^T \\) matrices.
306
+
307
+ Long story short, the default self-attention algorithm quickly becomes prohibitively memory-expensive for large input contexts.
308
+
309
+ As LLMs improve in text comprehension and generation, they are applied to increasingly complex tasks. While models once handled the translation or summarization of a few sentences, they now manage entire pages, demanding the capability to process extensive input lengths.
310
+
311
+ How can we get rid of the exorbitant memory requirements for large input lengths? We need a new way to compute the self-attention mechanism that gets rid of the \\( QK^T \\) matrix. [Tri Dao et al.](https://arxiv.org/abs/2205.14135) developed exactly such a new algorithm and called it **Flash Attention**.
312
+
313
+ In a nutshell, Flash Attention breaks the \\(\mathbf{V} \times \text{Softmax}(\mathbf{QK}^T\\)) computation apart and instead computes smaller chunks of the output by iterating over multiple softmax computation steps:
314
+
315
+ $$ \textbf{O}_i \leftarrow s^a_{ij} * \textbf{O}_i + s^b_{ij} * \mathbf{V}_{j} \times \text{Softmax}(\mathbf{QK}^T_{i,j}) \text{ for multiple } i, j \text{ iterations} $$
316
+
317
+ with \\( s^a_{ij} \\) and \\( s^b_{ij} \\) being some softmax normalization statistics that need to be recomputed for every \\( i \\) and \\( j \\) .
318
+
319
+ Please note that the whole Flash Attention is a bit more complex and is greatly simplified here as going in too much depth is out of scope for this guide. The reader is invited to take a look at the well-written [Flash Attention paper](https://arxiv.org/abs/2205.14135) for more details.
320
+
321
+ The main takeaway here is:
322
+
323
+ > By keeping track of softmax normalization statistics and by using some smart mathematics, Flash Attention gives **numerical identical** outputs compared to the default self-attention layer at a memory cost that only increases linearly with \\( N \\) .
324
+
325
+ Looking at the formula, one would intuitively say that Flash Attention must be much slower compared to the default self-attention formula as more computation needs to be done. Indeed Flash Attention requires more FLOPs compared to normal attention as the softmax normalization statistics have to constantly be recomputed (see [paper](https://arxiv.org/abs/2205.14135) for more details if interested)
326
+
327
+ > However, Flash Attention is much faster in inference compared to default attention which comes from its ability to significantly reduce the demands on the slower, high-bandwidth memory of the GPU (VRAM), focusing instead on the faster on-chip memory (SRAM).
328
+
329
+ Essentially, Flash Attention makes sure that all intermediate write and read operations can be done using the fast *on-chip* SRAM memory instead of having to access the slower VRAM memory to compute the output vector \\( \mathbf{O} \\) .
330
+
331
+ In practice, there is currently absolutely no reason to **not** use Flash Attention if available. The algorithm gives mathematically the same outputs, and is both faster and more memory-efficient.
332
+
333
+ Let's look at a practical example.
334
+
335
+ Our OctoCoder model now gets a significantly longer input prompt which includes a so-called *system prompt*. System prompts are used to steer the LLM into a better assistant that is tailored to the users' task.
336
+ In the following, we use a system prompt that will make OctoCoder a better coding assistant.
337
+
338
+ ```python
339
+ system_prompt = """Below are a series of dialogues between various people and an AI technical assistant.
340
+ The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble but knowledgeable.
341
+ The assistant is happy to help with code questions and will do their best to understand exactly what is needed.
342
+ It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer.
343
+ That said, the assistant is practical really does its best, and doesn't let caution get too much in the way of being useful.
344
+
345
+ The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests).
346
+ The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective, and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data.
347
+
348
+ -----
349
+
350
+ Question: Write a function that takes two lists and returns a list that has alternating elements from each input list.
351
+
352
+ Answer: Sure. Here is a function that does that.
353
+
354
+ def alternating(list1, list2):
355
+ results = []
356
+ for i in range(len(list1)):
357
+ results.append(list1[i])
358
+ results.append(list2[i])
359
+ return results
360
+
361
+ Question: Can you write some test cases for this function?
362
+
363
+ Answer: Sure, here are some tests.
364
+
365
+ assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3]
366
+ assert alternating([True, False], [4, 5]) == [True, 4, False, 5]
367
+ assert alternating([], []) == []
368
+
369
+ Question: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.
370
+
371
+ Answer: Here is the modified function.
372
+
373
+ def alternating(list1, list2):
374
+ results = []
375
+ for i in range(min(len(list1), len(list2))):
376
+ results.append(list1[i])
377
+ results.append(list2[i])
378
+ if len(list1) > len(list2):
379
+ results.extend(list1[i+1:])
380
+ else:
381
+ results.extend(list2[i+1:])
382
+ return results
383
+
384
+ -----
385
+ """
386
+ ```
387
+ For demonstration purposes, we duplicate the system prompt by ten so that the input length is long enough to observe Flash Attention's memory savings.
388
+ We append the original text prompt `"Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here"`
389
+
390
+ ```python
391
+ long_prompt = 10 * system_prompt + prompt
392
+ ```
393
+
394
+ We instantiate our model again in bfloat16 precision.
395
+
396
+ ```python
397
+ model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto")
398
+ tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")
399
+
400
+ pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
401
+ ```
402
+
403
+ Let's now run the model just like before *without Flash Attention* and measure the peak GPU memory requirement and inference time.
404
+
405
+ ```python
406
+ import time
407
+
408
+ start_time = time.time()
409
+ result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]
410
+
411
+ print(f"Generated in {time.time() - start_time} seconds.")
412
+ result
413
+ ```
414
+
415
+ **Output**:
416
+ ```
417
+ Generated in 10.96854019165039 seconds.
418
+ Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
419
+ ````
420
+
421
+ We're getting the same output as before, however this time, the model repeats the answer multiple times until it's 60 tokens cut-off. This is not surprising as we've repeated the system prompt ten times for demonstration purposes and thus cued the model to repeat itself.
422
+
423
+ **Note** that the system prompt should not be repeated ten times in real-world applications - one time is enough!
424
+
425
+ Let's measure the peak GPU memory requirement.
426
+
427
+ ```python
428
+ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
429
+ ```
430
+
431
+ **Output**:
432
+ ```bash
433
+ 37.668193340301514
434
+ ```
435
+
436
+ As we can see the peak GPU memory requirement is now significantly higher than in the beginning, which is largely due to the longer input sequence. Also the generation takes a little over a minute now.
437
+
438
+ We call `flush()` to free GPU memory for our next experiment.
439
+
440
+ ```python
441
+ flush()
442
+ ```
443
+
444
+ For comparison, let's run the same function, but enable Flash Attention instead.
445
+ To do so, we convert the model to [BetterTransformer](https://huggingface.co/docs/optimum/bettertransformer/overview) and by doing so enabling PyTorch's [SDPA self-attention](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) which in turn is able to use Flash Attention.
446
+
447
+ ```python
448
+ model.to_bettertransformer()
449
+ ```
450
+
451
+ Now we run the exact same code snippet as before and under the hood Transformers will make use of Flash Attention.
452
+
453
+ ```py
454
+ start_time = time.time()
455
+ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
456
+ result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]
457
+
458
+ print(f"Generated in {time.time() - start_time} seconds.")
459
+ result
460
+ ```
461
+
462
+ **Output**:
463
+ ```
464
+ Generated in 3.0211617946624756 seconds.
465
+ Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
466
+ ```
467
+
468
+ We're getting the exact same result as before, but can observe a very significant speed-up thanks to Flash Attention.
469
+
470
+ Let's measure the memory consumption one last time.
471
+
472
+ ```python
473
+ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
474
+ ```
475
+
476
+ **Output**:
477
+ ```
478
+ 32.617331981658936
479
+ ```
480
+
481
+ And we're almost back to our original 29GB peak GPU memory from the beginning.
482
+
483
+ We can observe that we only use roughly 100MB more GPU memory when passing a very long input sequence with Flash Attention compared to passing a short input sequence as done in the beginning.
484
+
485
+ ```py
486
+ flush()
487
+ ```
488
+
489
+ For more information on how to use Flash Attention, please have a look at [this doc page](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#flashattention-2).
490
+
491
+ ## 3. Architectural Innovations
492
+
493
+ So far we have looked into improving computational and memory efficiency by:
494
+
495
+ - Casting the weights to a lower precision format
496
+ - Replacing the self-attention algorithm with a more memory- and compute efficient version
497
+
498
+ Let's now look into how we can change the architecture of an LLM so that it is most effective and efficient for task that require long text inputs, *e.g.*:
499
+ - Retrieval augmented Questions Answering,
500
+ - Summarization,
501
+ - Chat
502
+
503
+ Note that *chat* not only requires the LLM to handle long text inputs, but it also necessitates that the LLM is able to efficiently handle the back-and-forth dialogue between user and assistant (such as ChatGPT).
504
+
505
+ Once trained, the fundamental LLM architecture is difficult to change, so it is important to make considerations about the LLM's tasks beforehand and accordingly optimize the model's architecture.
506
+ There are two important components of the model architecture that quickly become memory and/or performance bottlenecks for large input sequences.
507
+
508
+ - The positional embeddings
509
+ - The key-value cache
510
+
511
+ Let's go over each component in more detail
512
+
513
+ ### 3.1 Improving positional embeddings of LLMs
514
+
515
+ Self-attention puts each token in relation to each other's tokens.
516
+ As an example, the \\( \text{Softmax}(\mathbf{QK}^T) \\) matrix of the text input sequence *"Hello", "I", "love", "you"* could look as follows:
517
+
518
+ ![](/blog/assets/163_optimize_llm/self_attn_tokens.png)
519
+
520
+ Each word token is given a probability mass at which it attends all other word tokens and, therefore is put into relation with all other word tokens. E.g. the word *"love"* attends to the word *"Hello"* with 5%, to *"I"* with 30%, and to itself with 65%.
521
+
522
+ A LLM based on self-attention, but without position embeddings would have great difficulties in understanding the positions of the text inputs to each other.
523
+ This is because the probability score computed by \\( \mathbf{QK}^T \\) relates each word token to each other word token in \\( O(1) \\) computations regardless of their relative positional distance to each other.
524
+ Therefore, for the LLM without position embeddings each token appears to have the same distance to all other tokens, *e.g.* differentiating between *"Hello I love you"* and *"You love I hello"* would be very challenging.
525
+
526
+ For the LLM to understand sentence order, an additional *cue* is needed and is usually applied in the form of *positional encodings* (or also called *positional embeddings*).
527
+ Positional encodings, encode the position of each token into a numerical presentation that the LLM can leverage to better understand sentence order.
528
+
529
+ The authors of the [*Attention Is All You Need*](https://arxiv.org/abs/1706.03762) paper introduced sinusoidal positional embeddings \\( \mathbf{P} = \mathbf{p}_1, \ldots, \mathbf{p}_N \\) .
530
+ where each vector \\( \mathbf{p}_i \\) is computed as a sinusoidal function of its position \\( i \\) .
531
+ The positional encodings are then simply added to the input sequence vectors \\( \mathbf{\hat{X}} = \mathbf{\hat{x}}_1, \ldots, \mathbf{\hat{x}}_N \\) = \\( \mathbf{x}_1 + \mathbf{p}_1, \ldots, \mathbf{x}_N + \mathbf{p}_N \\) thereby cueing the model to better learn sentence order.
532
+
533
+ Instead of using fixed position embeddings, others (such as [Devlin et al.](https://arxiv.org/abs/1810.04805)) used learned positional encodings for which the positional embeddings
534
+ \\( \mathbf{P} \\) are learned during training.
535
+
536
+ Sinusoidal and learned position embeddings used to be the predominant methods to encode sentence order into LLMs, but a couple of problems related to these positional encodings were found:
537
+
538
+ 1. Sinusoidal and learned position embeddings are both absolute positional embeddings, *i.e.* encoding a unique embedding for each position id: \\( 0, \ldots, N \\) . As shown by [Huang et al.](https://arxiv.org/abs/2009.13658) and [Su et al.](https://arxiv.org/abs/2104.09864), absolute positional embeddings lead to poor LLM performance for long text inputs. For long text inputs, it is advantageous if the model learns the relative positional distance input tokens have to each other instead of their absolute position.
539
+ 2. When using learned position embeddings, the LLM has to be trained on a fixed input length \\( N \\), which makes it difficult to extrapolate to an input length longer than what it was trained on.
540
+
541
+ Recently, relative positional embeddings that can tackle the above mentioned problems have become more popular, most notably:
542
+
543
+ - [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)
544
+ - [ALiBi](https://arxiv.org/abs/2108.12409)
545
+
546
+ Both *RoPE* and *ALiBi* argue that it's best to cue the LLM about sentence order directly in the self-attention algorithm as it's there that word tokens are put into relation with each other. More specifically, sentence order should be cued by modifying the \\( \mathbf{QK}^T \\) computation.
547
+
548
+ Without going into too many details, *RoPE* notes that positional information can be encoded into query-key pairs, *e.g.* \\( \mathbf{q}_i \\) and \\( \mathbf{x}_j \\) by rotating each vector by an angle \\( \theta * i \\) and \\( \theta * j \\) respectively with \\( i, j \\) describing each vectors sentence position:
549
+
550
+ $$ \mathbf{\hat{q}}_i^T \mathbf{\hat{x}}_j = \mathbf{{q}}_i^T \mathbf{R}_{\theta, i -j} \mathbf{{x}}_j. $$
551
+
552
+ \\( \mathbf{R}_{\theta, i - j} \\) thereby represents a rotational matrix. \\( \theta \\) is *not* learned during training, but instead set to a pre-defined value that depends on the maximum input sequence length during training.
553
+
554
+ > By doing so, the probability score between \\( \mathbf{q}_i \\) and \\( \mathbf{q}_j \\) is only affected if \\( i \ne j \\) and solely depends on the relative distance \\( i - j \\) regardless of each vector's specific positions \\( i \\) and \\( j \\) .
555
+
556
+ *RoPE* is used in multiple of today's most important LLMs, such as:
557
+
558
+ - [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
559
+ - [**Llama**](https://arxiv.org/abs/2302.13971)
560
+ - [**PaLM**](https://arxiv.org/abs/2204.02311)
561
+
562
+ As an alternative, *ALiBi* proposes a much simpler relative position encoding scheme. The relative distance that input tokens have to each other is added as a negative integer scaled by a pre-defined value `m` to each query-key entry of the \\( \mathbf{QK}^T \\) matrix right before the softmax computation.
563
+
564
+ ![](/blog/assets/163_optimize_llm/alibi.png)
565
+
566
+ As shown in the [ALiBi](https://arxiv.org/abs/2108.12409) paper, this simple relative positional encoding allows the model to retain a high performance even at very long text input sequences.
567
+
568
+ *ALiBi* is used in multiple of today's most important LLMs, such as:
569
+
570
+ - [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
571
+ - [**BLOOM**](https://huggingface.co/bigscience/bloom)
572
+
573
+ Both *RoPE* and *ALiBi* position encodings can extrapolate to input lengths not seen during training whereas it has been shown that extrapolation works much better out-of-the-box for *ALiBi* as compared to *RoPE*.
574
+ For ALiBi, one simply increases the values of the lower triangular position matrix to match the length of the input sequence.
575
+ For *RoPE*, keeping the same \\( \theta \\) that was used during training leads to poor results when passing text inputs much longer than those seen during training, *c.f* [Press et al.](https://arxiv.org/abs/2108.12409). However, the community has found a couple of effective tricks that adapt \\( \theta \\), thereby allowing *RoPE* position embeddings to work well for extrapolated text input sequences (see [here](https://github.com/huggingface/transformers/pull/24653)).
576
+
577
+ > Both RoPE and ALiBi are relative positional embeddings that are *not* learned during training, but instead are based on the following intuitions:
578
+ - Positional cues about the text inputs should be given directly to the \\( QK^T \\) matrix of the self-attention layer
579
+ - The LLM should be incentivized to learn a constant *relative* distance positional encodings have to each other
580
+ - The further text input tokens are from each other, the lower the probability of their query-value probability. Both RoPE and ALiBi lower the query-key probability of tokens far away from each other. RoPE by decreasing their vector product by increasing the angle between the query-key vectors. ALiBi by adding large negative numbers to the vector product
581
+
582
+ In conclusion, LLMs that are intended to be deployed in tasks that require handling large text inputs are better trained with relative positional embeddings, such as RoPE and ALiBi. Also note that even if an LLM with RoPE and ALiBi has been trained only on a fixed length of say \\( N_1 = 2048 \\) it can still be used in practice with text inputs much larger than \\( N_1 \\), like \\( N_2 = 8192 > N_1 \\) by extrapolating the positional embeddings.
583
+
584
+ ### 3.2 The key-value cache
585
+
586
+ Auto-regressive text generation with LLMs works by iteratively putting in an input sequence, sampling the next token, appending the next token to the input sequence, and continuing to do so until the LLM produces a token that signifies that the generation has finished.
587
+
588
+ Please have a look at [Transformer's Generate Text Tutorial](https://huggingface.co/docs/transformers/llm_tutorial#generate-text) to get a more visual explanation of how auto-regressive generation works.
589
+
590
+ Let's run a quick code snippet to show how auto-regressive works in practice. We will simply take the most likely next token via `torch.argmax`.
591
+
592
+ ```python
593
+ input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to("cuda")
594
+
595
+ for _ in range(5):
596
+ next_logits = model(input_ids)["logits"][:, -1:]
597
+ next_token_id = torch.argmax(next_logits,dim=-1)
598
+
599
+ input_ids = torch.cat([input_ids, next_token_id], dim=-1)
600
+ print("shape of input_ids", input_ids.shape)
601
+
602
+ generated_text = tokenizer.batch_decode(input_ids[:, -5:])
603
+ generated_text
604
+ ```
605
+
606
+ **Output**:
607
+ ```
608
+ shape of input_ids torch.Size([1, 21])
609
+ shape of input_ids torch.Size([1, 22])
610
+ shape of input_ids torch.Size([1, 23])
611
+ shape of input_ids torch.Size([1, 24])
612
+ shape of input_ids torch.Size([1, 25])
613
+ [' Here is a Python function']
614
+ ```
615
+
616
+ As we can see every time we increase the text input tokens by the just sampled token.
617
+
618
+ With very few exceptions, LLMs are trained using the [causal language modeling objective](https://huggingface.co/docs/transformers/tasks/language_modeling#causal-language-modeling) and therefore mask the upper triangle matrix of the attention score - this is why in the two diagrams above the attention scores are left blank (*a.k.a* have 0 probability). For a quick recap on causal language modeling you can refer to the [*Illustrated Self Attention blog*](https://jalammar.github.io/illustrated-gpt2/#part-2-illustrated-self-attention).
619
+
620
+ As a consequence, tokens *never* depend on previous tokens, more specifically the \\( \mathbf{q}_i \\) vector is never put in relation with any key, values vectors \\( \mathbf{k}_j, \mathbf{v}_j \\) if \\( j > i \\) . Instead \\( \mathbf{q}_i \\) only attends to previous key-value vectors \\( \mathbf{k}_{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in \{0, \ldots i - 1\} \\). In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps.
621
+
622
+ In the following, we will tell the LLM to make use of the key-value cache by retrieving and forwarding it for each forward pass.
623
+ In Transformers, we can retrieve the key-value cache by passing the `use_cache` flag to the `forward` call and can then pass it with the current token.
624
+
625
+ ```python
626
+ past_key_values = None # past_key_values is the key-value cache
627
+ generated_tokens = []
628
+ next_token_id = tokenizer(prompt, return_tensors="pt")["input_ids"].to("cuda")
629
+
630
+ for _ in range(5):
631
+ next_logits, past_key_values = model(next_token_id, past_key_values=past_key_values, use_cache=True).to_tuple()
632
+ next_logits = next_logits[:, -1:]
633
+ next_token_id = torch.argmax(next_logits, dim=-1)
634
+
635
+ print("shape of input_ids", next_token_id.shape)
636
+ print("length of key-value cache", len(past_key_values[0][0])) # past_key_values are of shape [num_layers, 0 for k, 1 for v, batch_size, length, hidden_dim]
637
+ generated_tokens.append(next_token_id.item())
638
+
639
+ generated_text = tokenizer.batch_decode(generated_tokens)
640
+ generated_text
641
+ ```
642
+
643
+ **Output**:
644
+ ```
645
+ shape of input_ids torch.Size([1, 1])
646
+ length of key-value cache 20
647
+ shape of input_ids torch.Size([1, 1])
648
+ length of key-value cache 21
649
+ shape of input_ids torch.Size([1, 1])
650
+ length of key-value cache 22
651
+ shape of input_ids torch.Size([1, 1])
652
+ length of key-value cache 23
653
+ shape of input_ids torch.Size([1, 1])
654
+ length of key-value cache 24
655
+ [' Here', ' is', ' a', ' Python', ' function']
656
+ ```
657
+
658
+ As one can see, when using the key-value cache the text input tokens are *not* increased in length, but remain a single input vector. The length of the key-value cache on the other hand is increased by one at every decoding step.
659
+
660
+ > Making use of the key-value cache means that the \\( \mathbf{QK}^T \\) is essentially reduced to \\( \mathbf{q}_c\mathbf{K}^T \\) with \\( \mathbf{q}_c \\) being the query projection of the currently passed input token which is *always* just a single vector.
661
+
662
+ Using the key-value cache has two advantages:
663
+ - Significant increase in computational efficiency as less computations are performed compared to computing the full \\( \mathbf{QK}^T \\) matrix. This leads to an increase in inference speed
664
+ - The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly.
665
+
666
+ > One should *always* make use of the key-value cache as it leads to identical results and a significant speed-up for longer input sequences. Transformers has the key-value cache enabled by default when making use of the text pipeline or the [`generate` method](https://huggingface.co/docs/transformers/main_classes/text_generation). We have an entire guide dedicated to caches [here](./kv_cache).
667
+
668
+ <Tip warning={true}>
669
+
670
+ Note that, despite our advice to use key-value caches, your LLM output may be slightly different when you use them. This is a property of the matrix multiplication kernels themselves -- you can read more about it [here](https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535).
671
+
672
+ </Tip>
673
+
674
+ #### 3.2.1 Multi-round conversation
675
+
676
+ The key-value cache is especially useful for applications such as chat where multiple passes of auto-regressive decoding are required. Let's look at an example.
677
+
678
+ ```
679
+ User: How many people live in France?
680
+ Assistant: Roughly 75 million people live in France
681
+ User: And how many are in Germany?
682
+ Assistant: Germany has ca. 81 million inhabitants
683
+ ```
684
+
685
+ In this chat, the LLM runs auto-regressive decoding twice:
686
+ 1. The first time, the key-value cache is empty and the input prompt is `"User: How many people live in France?"` and the model auto-regressively generates the text `"Roughly 75 million people live in France"` while increasing the key-value cache at every decoding step.
687
+ 2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, its computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`.
688
+
689
+ Two things should be noted here:
690
+ 1. Keeping all the context is crucial for LLMs deployed in chat so that the LLM understands all the previous context of the conversation. E.g. for the example above the LLM needs to understand that the user refers to the population when asking `"And how many are in Germany"`.
691
+ 2. The key-value cache is extremely useful for chat as it allows us to continuously grow the encoded chat history instead of having to re-encode the chat history again from scratch (as e.g. would be the case when using an encoder-decoder architecture).
692
+
693
+ In `transformers`, a `generate` call will return `past_key_values` when `return_dict_in_generate=True` is passed, in addition to the default `use_cache=True`. Note that it is not yet available through the `pipeline` interface.
694
+
695
+ ```python
696
+ # Generation as usual
697
+ prompt = system_prompt + "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here"
698
+ model_inputs = tokenizer(prompt, return_tensors='pt')
699
+ generation_output = model.generate(**model_inputs, max_new_tokens=60, return_dict_in_generate=True)
700
+ decoded_output = tokenizer.batch_decode(generation_output.sequences)[0]
701
+
702
+ # Piping the returned `past_key_values` to speed up the next conversation round
703
+ prompt = decoded_output + "\nQuestion: How can I modify the function above to return Mega bytes instead?\n\nAnswer: Here"
704
+ model_inputs = tokenizer(prompt, return_tensors='pt')
705
+ generation_output = model.generate(
706
+ **model_inputs,
707
+ past_key_values=generation_output.past_key_values,
708
+ max_new_tokens=60,
709
+ return_dict_in_generate=True
710
+ )
711
+ tokenizer.batch_decode(generation_output.sequences)[0][len(prompt):]
712
+ ```
713
+
714
+ **Output**:
715
+ ```
716
+ is a modified version of the function that returns Mega bytes instead.
717
+
718
+ def bytes_to_megabytes(bytes):
719
+ return bytes / 1024 / 1024
720
+
721
+ Answer: The function takes a number of bytes as input and returns the number of
722
+ ```
723
+
724
+ Great, no additional time is spent recomputing the same key and values for the attention layer! There is however one catch. While the required peak memory for the \\( \mathbf{QK}^T \\) matrix is significantly reduced, holding the key-value cache in memory can become very memory expensive for long input sequences or multi-turn chat. Remember that the key-value cache needs to store the key-value vectors for all previous input vectors \\( \mathbf{x}_i \text{, for } i \in \{1, \ldots, c - 1\} \\) for all self-attention layers and for all attention heads.
725
+
726
+ Let's compute the number of float values that need to be stored in the key-value cache for the LLM `bigcode/octocoder` that we used before.
727
+ The number of float values amounts to two times the sequence length times the number of attention heads times the attention head dimension and times the number of layers.
728
+ Computing this for our LLM at a hypothetical input sequence length of 16000 gives:
729
+
730
+ ```python
731
+ config = model.config
732
+ 2 * 16_000 * config.n_layer * config.n_head * config.n_embd // config.n_head
733
+ ```
734
+
735
+ **Output**:
736
+ ```
737
+ 7864320000
738
+ ```
739
+
740
+ Roughly 8 billion float values! Storing 8 billion float values in `float16` precision requires around 15 GB of RAM which is circa half as much as the model weights themselves!
741
+ Researchers have proposed two methods that allow to significantly reduce the memory cost of storing the key-value cache, which are explored in the next subsections.
742
+
743
+ #### 3.2.2 Multi-Query-Attention (MQA)
744
+
745
+ [Multi-Query-Attention](https://arxiv.org/abs/1911.02150) was proposed in Noam Shazeer's *Fast Transformer Decoding: One Write-Head is All You Need* paper. As the title says, Noam found out that instead of using `n_head` key-value projections weights, one can use a single head-value projection weight pair that is shared across all attention heads without that the model's performance significantly degrades.
746
+
747
+ > By using a single head-value projection weight pair, the key value vectors \\( \mathbf{k}_i, \mathbf{v}_i \\) have to be identical across all attention heads which in turn means that we only need to store 1 key-value projection pair in the cache instead of `n_head` ones.
748
+
749
+ As most LLMs use between 20 and 100 attention heads, MQA significantly reduces the memory consumption of the key-value cache. For the LLM used in this notebook we could therefore reduce the required memory consumption from 15 GB to less than 400 MB at an input sequence length of 16000.
750
+
751
+ In addition to memory savings, MQA also leads to improved computational efficiency as explained in the following.
752
+ In auto-regressive decoding, large key-value vectors need to be reloaded, concatenated with the current key-value vector pair to be then fed into the \\( \mathbf{q}_c\mathbf{K}^T \\) computation at every step. For auto-regressive decoding, the required memory bandwidth for the constant reloading can become a serious time bottleneck. By reducing the size of the key-value vectors less memory needs to be accessed, thus reducing the memory bandwidth bottleneck. For more detail, please have a look at [Noam's paper](https://arxiv.org/abs/1911.02150).
753
+
754
+ The important part to understand here is that reducing the number of key-value attention heads to 1 only makes sense if a key-value cache is used. The peak memory consumption of the model for a single forward pass without key-value cache stays unchanged as every attention head still has a unique query vector so that each attention head still has a different \\( \mathbf{QK}^T \\) matrix.
755
+
756
+ MQA has seen wide adoption by the community and is now used by many of the most popular LLMs:
757
+
758
+ - [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
759
+ - [**PaLM**](https://arxiv.org/abs/2204.02311)
760
+ - [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
761
+ - [**BLOOM**](https://huggingface.co/bigscience/bloom)
762
+
763
+ Also, the checkpoint used in this notebook - `bigcode/octocoder` - makes use of MQA.
764
+
765
+ #### 3.2.3 Grouped-Query-Attention (GQA)
766
+
767
+ [Grouped-Query-Attention](https://arxiv.org/abs/2305.13245), as proposed by Ainslie et al. from Google, found that using MQA can often lead to quality degradation compared to using vanilla multi-key-value head projections. The paper argues that more model performance can be kept by less drastically reducing the number of query head projection weights. Instead of using just a single key-value projection weight, `n < n_head` key-value projection weights should be used. By choosing `n` to a significantly smaller value than `n_head`, such as 2,4 or 8 almost all of the memory and speed gains from MQA can be kept while sacrificing less model capacity and thus arguably less performance.
768
+
769
+ Moreover, the authors of GQA found out that existing model checkpoints can be *uptrained* to have a GQA architecture with as little as 5% of the original pre-training compute. While 5% of the original pre-training compute can still be a massive amount, GQA *uptraining* allows existing checkpoints to be useful for longer input sequences.
770
+
771
+ GQA was only recently proposed which is why there is less adoption at the time of writing this notebook.
772
+ The most notable application of GQA is [Llama-v2](https://huggingface.co/meta-llama/Llama-2-70b-hf).
773
+
774
+ > As a conclusion, it is strongly recommended to make use of either GQA or MQA if the LLM is deployed with auto-regressive decoding and is required to handle large input sequences as is the case for example for chat.
775
+
776
+
777
+ ## Conclusion
778
+
779
+ The research community is constantly coming up with new, nifty ways to speed up inference time for ever-larger LLMs. As an example, one such promising research direction is [speculative decoding](https://arxiv.org/abs/2211.17192) where "easy tokens" are generated by smaller, faster language models and only "hard tokens" are generated by the LLM itself. Going into more detail is out of the scope of this notebook, but can be read upon in this [nice blog post](https://huggingface.co/blog/assisted-generation).
780
+
781
+ The reason massive LLMs such as GPT3/4, Llama-2-70b, Claude, PaLM can run so quickly in chat-interfaces such as [Hugging Face Chat](https://huggingface.co/chat/) or ChatGPT is to a big part thanks to the above-mentioned improvements in precision, algorithms, and architecture.
782
+ Going forward, accelerators such as GPUs, TPUs, etc... will only get faster and allow for more memory, but one should nevertheless always make sure to use the best available algorithms and architectures to get the most bang for your buck 🤗
docs/transformers/docs/source/en/main_classes/backbones.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2023 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Backbone
18
+
19
+ A backbone is a model used for feature extraction for higher level computer vision tasks such as object detection and image classification. Transformers provides an [`AutoBackbone`] class for initializing a Transformers backbone from pretrained model weights, and two utility classes:
20
+
21
+ * [`~utils.BackboneMixin`] enables initializing a backbone from Transformers or [timm](https://hf.co/docs/timm/index) and includes functions for returning the output features and indices.
22
+ * [`~utils.BackboneConfigMixin`] sets the output features and indices of the backbone configuration.
23
+
24
+ [timm](https://hf.co/docs/timm/index) models are loaded with the [`TimmBackbone`] and [`TimmBackboneConfig`] classes.
25
+
26
+ Backbones are supported for the following models:
27
+
28
+ * [BEiT](../model_doc/beit)
29
+ * [BiT](../model_doc/bit)
30
+ * [ConvNext](../model_doc/convnext)
31
+ * [ConvNextV2](../model_doc/convnextv2)
32
+ * [DiNAT](../model_doc/dinat)
33
+ * [DINOV2](../model_doc/dinov2)
34
+ * [FocalNet](../model_doc/focalnet)
35
+ * [MaskFormer](../model_doc/maskformer)
36
+ * [NAT](../model_doc/nat)
37
+ * [ResNet](../model_doc/resnet)
38
+ * [Swin Transformer](../model_doc/swin)
39
+ * [Swin Transformer v2](../model_doc/swinv2)
40
+ * [ViTDet](../model_doc/vitdet)
41
+
42
+ ## AutoBackbone
43
+
44
+ [[autodoc]] AutoBackbone
45
+
46
+ ## BackboneMixin
47
+
48
+ [[autodoc]] utils.BackboneMixin
49
+
50
+ ## BackboneConfigMixin
51
+
52
+ [[autodoc]] utils.BackboneConfigMixin
53
+
54
+ ## TimmBackbone
55
+
56
+ [[autodoc]] models.timm_backbone.TimmBackbone
57
+
58
+ ## TimmBackboneConfig
59
+
60
+ [[autodoc]] models.timm_backbone.TimmBackboneConfig
docs/transformers/docs/source/en/main_classes/callback.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Callbacks
18
+
19
+ Callbacks are objects that can customize the behavior of the training loop in the PyTorch
20
+ [`Trainer`] (this feature is not yet implemented in TensorFlow) that can inspect the training loop
21
+ state (for progress reporting, logging on TensorBoard or other ML platforms...) and take decisions (like early
22
+ stopping).
23
+
24
+ Callbacks are "read only" pieces of code, apart from the [`TrainerControl`] object they return, they
25
+ cannot change anything in the training loop. For customizations that require changes in the training loop, you should
26
+ subclass [`Trainer`] and override the methods you need (see [trainer](trainer) for examples).
27
+
28
+ By default, `TrainingArguments.report_to` is set to `"all"`, so a [`Trainer`] will use the following callbacks.
29
+
30
+ - [`DefaultFlowCallback`] which handles the default behavior for logging, saving and evaluation.
31
+ - [`PrinterCallback`] or [`ProgressCallback`] to display progress and print the
32
+ logs (the first one is used if you deactivate tqdm through the [`TrainingArguments`], otherwise
33
+ it's the second one).
34
+ - [`~integrations.TensorBoardCallback`] if tensorboard is accessible (either through PyTorch >= 1.4
35
+ or tensorboardX).
36
+ - [`~integrations.WandbCallback`] if [wandb](https://www.wandb.com/) is installed.
37
+ - [`~integrations.CometCallback`] if [comet_ml](https://www.comet.com/site/) is installed.
38
+ - [`~integrations.MLflowCallback`] if [mlflow](https://www.mlflow.org/) is installed.
39
+ - [`~integrations.NeptuneCallback`] if [neptune](https://neptune.ai/) is installed.
40
+ - [`~integrations.AzureMLCallback`] if [azureml-sdk](https://pypi.org/project/azureml-sdk/) is
41
+ installed.
42
+ - [`~integrations.CodeCarbonCallback`] if [codecarbon](https://pypi.org/project/codecarbon/) is
43
+ installed.
44
+ - [`~integrations.ClearMLCallback`] if [clearml](https://github.com/allegroai/clearml) is installed.
45
+ - [`~integrations.DagsHubCallback`] if [dagshub](https://dagshub.com/) is installed.
46
+ - [`~integrations.FlyteCallback`] if [flyte](https://flyte.org/) is installed.
47
+ - [`~integrations.DVCLiveCallback`] if [dvclive](https://dvc.org/doc/dvclive) is installed.
48
+ - [`~integrations.SwanLabCallback`] if [swanlab](http://swanlab.cn/) is installed.
49
+
50
+ If a package is installed but you don't wish to use the accompanying integration, you can change `TrainingArguments.report_to` to a list of just those integrations you want to use (e.g. `["azure_ml", "wandb"]`).
51
+
52
+ The main class that implements callbacks is [`TrainerCallback`]. It gets the
53
+ [`TrainingArguments`] used to instantiate the [`Trainer`], can access that
54
+ Trainer's internal state via [`TrainerState`], and can take some actions on the training loop via
55
+ [`TrainerControl`].
56
+
57
+
58
+ ## Available Callbacks
59
+
60
+ Here is the list of the available [`TrainerCallback`] in the library:
61
+
62
+ [[autodoc]] integrations.CometCallback
63
+ - setup
64
+
65
+ [[autodoc]] DefaultFlowCallback
66
+
67
+ [[autodoc]] PrinterCallback
68
+
69
+ [[autodoc]] ProgressCallback
70
+
71
+ [[autodoc]] EarlyStoppingCallback
72
+
73
+ [[autodoc]] integrations.TensorBoardCallback
74
+
75
+ [[autodoc]] integrations.WandbCallback
76
+ - setup
77
+
78
+ [[autodoc]] integrations.MLflowCallback
79
+ - setup
80
+
81
+ [[autodoc]] integrations.AzureMLCallback
82
+
83
+ [[autodoc]] integrations.CodeCarbonCallback
84
+
85
+ [[autodoc]] integrations.NeptuneCallback
86
+
87
+ [[autodoc]] integrations.ClearMLCallback
88
+
89
+ [[autodoc]] integrations.DagsHubCallback
90
+
91
+ [[autodoc]] integrations.FlyteCallback
92
+
93
+ [[autodoc]] integrations.DVCLiveCallback
94
+ - setup
95
+
96
+ [[autodoc]] integrations.SwanLabCallback
97
+ - setup
98
+
99
+ ## TrainerCallback
100
+
101
+ [[autodoc]] TrainerCallback
102
+
103
+ Here is an example of how to register a custom callback with the PyTorch [`Trainer`]:
104
+
105
+ ```python
106
+ class MyCallback(TrainerCallback):
107
+ "A callback that prints a message at the beginning of training"
108
+
109
+ def on_train_begin(self, args, state, control, **kwargs):
110
+ print("Starting training")
111
+
112
+
113
+ trainer = Trainer(
114
+ model,
115
+ args,
116
+ train_dataset=train_dataset,
117
+ eval_dataset=eval_dataset,
118
+ callbacks=[MyCallback], # We can either pass the callback class this way or an instance of it (MyCallback())
119
+ )
120
+ ```
121
+
122
+ Another way to register a callback is to call `trainer.add_callback()` as follows:
123
+
124
+ ```python
125
+ trainer = Trainer(...)
126
+ trainer.add_callback(MyCallback)
127
+ # Alternatively, we can pass an instance of the callback class
128
+ trainer.add_callback(MyCallback())
129
+ ```
130
+
131
+ ## TrainerState
132
+
133
+ [[autodoc]] TrainerState
134
+
135
+ ## TrainerControl
136
+
137
+ [[autodoc]] TrainerControl
docs/transformers/docs/source/en/main_classes/configuration.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Configuration
18
+
19
+ The base class [`PretrainedConfig`] implements the common methods for loading/saving a configuration
20
+ either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded
21
+ from HuggingFace's AWS S3 repository).
22
+
23
+ Each derived config class implements model specific attributes. Common attributes present in all config classes are:
24
+ `hidden_size`, `num_attention_heads`, and `num_hidden_layers`. Text models further implement:
25
+ `vocab_size`.
26
+
27
+
28
+ ## PretrainedConfig
29
+
30
+ [[autodoc]] PretrainedConfig
31
+ - push_to_hub
32
+ - all
docs/transformers/docs/source/en/main_classes/data_collator.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Data Collator
18
+
19
+ Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of
20
+ the same type as the elements of `train_dataset` or `eval_dataset`.
21
+
22
+ To be able to build batches, data collators may apply some processing (like padding). Some of them (like
23
+ [`DataCollatorForLanguageModeling`]) also apply some random data augmentation (like random masking)
24
+ on the formed batch.
25
+
26
+ Examples of use can be found in the [example scripts](../examples) or [example notebooks](../notebooks).
27
+
28
+
29
+ ## Default data collator
30
+
31
+ [[autodoc]] data.data_collator.default_data_collator
32
+
33
+ ## DefaultDataCollator
34
+
35
+ [[autodoc]] data.data_collator.DefaultDataCollator
36
+
37
+ ## DataCollatorWithPadding
38
+
39
+ [[autodoc]] data.data_collator.DataCollatorWithPadding
40
+
41
+ ## DataCollatorForTokenClassification
42
+
43
+ [[autodoc]] data.data_collator.DataCollatorForTokenClassification
44
+
45
+ ## DataCollatorForSeq2Seq
46
+
47
+ [[autodoc]] data.data_collator.DataCollatorForSeq2Seq
48
+
49
+ ## DataCollatorForLanguageModeling
50
+
51
+ [[autodoc]] data.data_collator.DataCollatorForLanguageModeling
52
+ - numpy_mask_tokens
53
+ - tf_mask_tokens
54
+ - torch_mask_tokens
55
+
56
+ ## DataCollatorForWholeWordMask
57
+
58
+ [[autodoc]] data.data_collator.DataCollatorForWholeWordMask
59
+ - numpy_mask_tokens
60
+ - tf_mask_tokens
61
+ - torch_mask_tokens
62
+
63
+ ## DataCollatorForPermutationLanguageModeling
64
+
65
+ [[autodoc]] data.data_collator.DataCollatorForPermutationLanguageModeling
66
+ - numpy_mask_tokens
67
+ - tf_mask_tokens
68
+ - torch_mask_tokens
69
+
70
+ ## DataCollatorWithFlattening
71
+
72
+ [[autodoc]] data.data_collator.DataCollatorWithFlattening
73
+
74
+ # DataCollatorForMultipleChoice
75
+
76
+ [[autodoc]] data.data_collator.DataCollatorForMultipleChoice
docs/transformers/docs/source/en/main_classes/deepspeed.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # DeepSpeed
18
+
19
+ [DeepSpeed](https://github.com/deepspeedai/DeepSpeed), powered by Zero Redundancy Optimizer (ZeRO), is an optimization library for training and fitting very large models onto a GPU. It is available in several ZeRO stages, where each stage progressively saves more GPU memory by partitioning the optimizer state, gradients, parameters, and enabling offloading to a CPU or NVMe. DeepSpeed is integrated with the [`Trainer`] class and most of the setup is automatically taken care of for you.
20
+
21
+ However, if you want to use DeepSpeed without the [`Trainer`], Transformers provides a [`HfDeepSpeedConfig`] class.
22
+
23
+ <Tip>
24
+
25
+ Learn more about using DeepSpeed with [`Trainer`] in the [DeepSpeed](../deepspeed) guide.
26
+
27
+ </Tip>
28
+
29
+ ## HfDeepSpeedConfig
30
+
31
+ [[autodoc]] integrations.HfDeepSpeedConfig
32
+ - all
docs/transformers/docs/source/en/main_classes/executorch.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ All rights reserved.
3
+
4
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
5
+ the License. You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
10
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
11
+ specific language governing permissions and limitations under the License.
12
+
13
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
14
+ rendered properly in your Markdown viewer.
15
+
16
+ -->
17
+
18
+
19
+ # ExecuTorch
20
+
21
+ [`ExecuTorch`](https://github.com/pytorch/executorch) is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch ecosystem and supports the deployment of PyTorch models with a focus on portability, productivity, and performance.
22
+
23
+ ExecuTorch introduces well defined entry points to perform model, device, and/or use-case specific optimizations such as backend delegation, user-defined compiler transformations, memory planning, and more. The first step in preparing a PyTorch model for execution on an edge device using ExecuTorch is to export the model. This is achieved through the use of a PyTorch API called [`torch.export`](https://pytorch.org/docs/stable/export.html).
24
+
25
+
26
+ ## ExecuTorch Integration
27
+
28
+ An integration point is being developed to ensure that 🤗 Transformers can be exported using `torch.export`. The goal of this integration is not only to enable export but also to ensure that the exported artifact can be further lowered and optimized to run efficiently in `ExecuTorch`, particularly for mobile and edge use cases.
29
+
30
+ [[autodoc]] TorchExportableModuleWithStaticCache
31
+ - forward
32
+
33
+ [[autodoc]] convert_and_export_with_cache
docs/transformers/docs/source/en/main_classes/feature_extractor.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2021 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Feature Extractor
18
+
19
+ A feature extractor is in charge of preparing input features for audio or vision models. This includes feature extraction from sequences, e.g., pre-processing audio files to generate Log-Mel Spectrogram features, feature extraction from images, e.g., cropping image files, but also padding, normalization, and conversion to NumPy, PyTorch, and TensorFlow tensors.
20
+
21
+
22
+ ## FeatureExtractionMixin
23
+
24
+ [[autodoc]] feature_extraction_utils.FeatureExtractionMixin
25
+ - from_pretrained
26
+ - save_pretrained
27
+
28
+ ## SequenceFeatureExtractor
29
+
30
+ [[autodoc]] SequenceFeatureExtractor
31
+ - pad
32
+
33
+ ## BatchFeature
34
+
35
+ [[autodoc]] BatchFeature
36
+
37
+ ## ImageFeatureExtractionMixin
38
+
39
+ [[autodoc]] image_utils.ImageFeatureExtractionMixin
docs/transformers/docs/source/en/main_classes/image_processor.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2022 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Image Processor
18
+
19
+ An image processor is in charge of preparing input features for vision models and post processing their outputs. This includes transformations such as resizing, normalization, and conversion to PyTorch, TensorFlow, Flax and Numpy tensors. It may also include model specific post-processing such as converting logits to segmentation masks.
20
+
21
+ Fast image processors are available for a few models and more will be added in the future. They are based on the [torchvision](https://pytorch.org/vision/stable/index.html) library and provide a significant speed-up, especially when processing on GPU.
22
+ They have the same API as the base image processors and can be used as drop-in replacements.
23
+ To use a fast image processor, you need to install the `torchvision` library, and set the `use_fast` argument to `True` when instantiating the image processor:
24
+
25
+ ```python
26
+ from transformers import AutoImageProcessor
27
+
28
+ processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", use_fast=True)
29
+ ```
30
+ Note that `use_fast` will be set to `True` by default in a future release.
31
+
32
+ When using a fast image processor, you can also set the `device` argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise.
33
+
34
+ ```python
35
+ from torchvision.io import read_image
36
+ from transformers import DetrImageProcessorFast
37
+
38
+ images = read_image("image.jpg")
39
+ processor = DetrImageProcessorFast.from_pretrained("facebook/detr-resnet-50")
40
+ images_processed = processor(images, return_tensors="pt", device="cuda")
41
+ ```
42
+
43
+ Here are some speed comparisons between the base and fast image processors for the `DETR` and `RT-DETR` models, and how they impact overall inference time:
44
+
45
+ <div class="flex">
46
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/benchmark_results_full_pipeline_detr_fast_padded.png" />
47
+ </div>
48
+ <div class="flex">
49
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/benchmark_results_full_pipeline_detr_fast_batched_compiled.png" />
50
+ </div>
51
+
52
+ <div class="flex">
53
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/benchmark_results_full_pipeline_rt_detr_fast_single.png" />
54
+ </div>
55
+ <div class="flex">
56
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/benchmark_results_full_pipeline_rt_detr_fast_batched.png" />
57
+ </div>
58
+
59
+ These benchmarks were run on an [AWS EC2 g5.2xlarge instance](https://aws.amazon.com/ec2/instance-types/g5/), utilizing an NVIDIA A10G Tensor Core GPU.
60
+
61
+
62
+ ## ImageProcessingMixin
63
+
64
+ [[autodoc]] image_processing_utils.ImageProcessingMixin
65
+ - from_pretrained
66
+ - save_pretrained
67
+
68
+ ## BatchFeature
69
+
70
+ [[autodoc]] BatchFeature
71
+
72
+ ## BaseImageProcessor
73
+
74
+ [[autodoc]] image_processing_utils.BaseImageProcessor
75
+
76
+
77
+ ## BaseImageProcessorFast
78
+
79
+ [[autodoc]] image_processing_utils_fast.BaseImageProcessorFast
docs/transformers/docs/source/en/main_classes/keras_callbacks.md ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2021 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Keras callbacks
18
+
19
+ When training a Transformers model with Keras, there are some library-specific callbacks available to automate common
20
+ tasks:
21
+
22
+ ## KerasMetricCallback
23
+
24
+ [[autodoc]] KerasMetricCallback
25
+
26
+ ## PushToHubCallback
27
+
28
+ [[autodoc]] PushToHubCallback
docs/transformers/docs/source/en/main_classes/logging.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Logging
18
+
19
+ 🤗 Transformers has a centralized logging system, so that you can setup the verbosity of the library easily.
20
+
21
+ Currently the default verbosity of the library is `WARNING`.
22
+
23
+ To change the level of verbosity, just use one of the direct setters. For instance, here is how to change the verbosity
24
+ to the INFO level.
25
+
26
+ ```python
27
+ import transformers
28
+
29
+ transformers.logging.set_verbosity_info()
30
+ ```
31
+
32
+ You can also use the environment variable `TRANSFORMERS_VERBOSITY` to override the default verbosity. You can set it
33
+ to one of the following: `debug`, `info`, `warning`, `error`, `critical`, `fatal`. For example:
34
+
35
+ ```bash
36
+ TRANSFORMERS_VERBOSITY=error ./myprogram.py
37
+ ```
38
+
39
+ Additionally, some `warnings` can be disabled by setting the environment variable
40
+ `TRANSFORMERS_NO_ADVISORY_WARNINGS` to a true value, like *1*. This will disable any warning that is logged using
41
+ [`logger.warning_advice`]. For example:
42
+
43
+ ```bash
44
+ TRANSFORMERS_NO_ADVISORY_WARNINGS=1 ./myprogram.py
45
+ ```
46
+
47
+ Here is an example of how to use the same logger as the library in your own module or script:
48
+
49
+ ```python
50
+ from transformers.utils import logging
51
+
52
+ logging.set_verbosity_info()
53
+ logger = logging.get_logger("transformers")
54
+ logger.info("INFO")
55
+ logger.warning("WARN")
56
+ ```
57
+
58
+
59
+ All the methods of this logging module are documented below, the main ones are
60
+ [`logging.get_verbosity`] to get the current level of verbosity in the logger and
61
+ [`logging.set_verbosity`] to set the verbosity to the level of your choice. In order (from the least
62
+ verbose to the most verbose), those levels (with their corresponding int values in parenthesis) are:
63
+
64
+ - `transformers.logging.CRITICAL` or `transformers.logging.FATAL` (int value, 50): only report the most
65
+ critical errors.
66
+ - `transformers.logging.ERROR` (int value, 40): only report errors.
67
+ - `transformers.logging.WARNING` or `transformers.logging.WARN` (int value, 30): only reports error and
68
+ warnings. This is the default level used by the library.
69
+ - `transformers.logging.INFO` (int value, 20): reports error, warnings and basic information.
70
+ - `transformers.logging.DEBUG` (int value, 10): report all information.
71
+
72
+ By default, `tqdm` progress bars will be displayed during model download. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] can be used to suppress or unsuppress this behavior.
73
+
74
+ ## `logging` vs `warnings`
75
+
76
+ Python has two logging systems that are often used in conjunction: `logging`, which is explained above, and `warnings`,
77
+ which allows further classification of warnings in specific buckets, e.g., `FutureWarning` for a feature or path
78
+ that has already been deprecated and `DeprecationWarning` to indicate an upcoming deprecation.
79
+
80
+ We use both in the `transformers` library. We leverage and adapt `logging`'s `captureWarnings` method to allow
81
+ management of these warning messages by the verbosity setters above.
82
+
83
+ What does that mean for developers of the library? We should respect the following heuristics:
84
+ - `warnings` should be favored for developers of the library and libraries dependent on `transformers`
85
+ - `logging` should be used for end-users of the library using it in every-day projects
86
+
87
+ See reference of the `captureWarnings` method below.
88
+
89
+ [[autodoc]] logging.captureWarnings
90
+
91
+ ## Base setters
92
+
93
+ [[autodoc]] logging.set_verbosity_error
94
+
95
+ [[autodoc]] logging.set_verbosity_warning
96
+
97
+ [[autodoc]] logging.set_verbosity_info
98
+
99
+ [[autodoc]] logging.set_verbosity_debug
100
+
101
+ ## Other functions
102
+
103
+ [[autodoc]] logging.get_verbosity
104
+
105
+ [[autodoc]] logging.set_verbosity
106
+
107
+ [[autodoc]] logging.get_logger
108
+
109
+ [[autodoc]] logging.enable_default_handler
110
+
111
+ [[autodoc]] logging.disable_default_handler
112
+
113
+ [[autodoc]] logging.enable_explicit_format
114
+
115
+ [[autodoc]] logging.reset_format
116
+
117
+ [[autodoc]] logging.enable_progress_bar
118
+
119
+ [[autodoc]] logging.disable_progress_bar
docs/transformers/docs/source/en/main_classes/model.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Models
18
+
19
+ The base classes [`PreTrainedModel`], [`TFPreTrainedModel`], and
20
+ [`FlaxPreTrainedModel`] implement the common methods for loading/saving a model either from a local
21
+ file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace's AWS
22
+ S3 repository).
23
+
24
+ [`PreTrainedModel`] and [`TFPreTrainedModel`] also implement a few methods which
25
+ are common among all the models to:
26
+
27
+ - resize the input token embeddings when new tokens are added to the vocabulary
28
+ - prune the attention heads of the model.
29
+
30
+ The other methods that are common to each model are defined in [`~modeling_utils.ModuleUtilsMixin`]
31
+ (for the PyTorch models) and [`~modeling_tf_utils.TFModuleUtilsMixin`] (for the TensorFlow models) or
32
+ for text generation, [`~generation.GenerationMixin`] (for the PyTorch models),
33
+ [`~generation.TFGenerationMixin`] (for the TensorFlow models) and
34
+ [`~generation.FlaxGenerationMixin`] (for the Flax/JAX models).
35
+
36
+
37
+ ## PreTrainedModel
38
+
39
+ [[autodoc]] PreTrainedModel
40
+ - push_to_hub
41
+ - all
42
+
43
+ Custom models should also include a `_supports_assign_param_buffer`, which determines if superfast init can apply
44
+ on the particular model. Signs that your model needs this are if `test_save_and_load_from_pretrained` fails. If so,
45
+ set this to `False`.
46
+
47
+ ## ModuleUtilsMixin
48
+
49
+ [[autodoc]] modeling_utils.ModuleUtilsMixin
50
+
51
+ ## TFPreTrainedModel
52
+
53
+ [[autodoc]] TFPreTrainedModel
54
+ - push_to_hub
55
+ - all
56
+
57
+ ## TFModelUtilsMixin
58
+
59
+ [[autodoc]] modeling_tf_utils.TFModelUtilsMixin
60
+
61
+ ## FlaxPreTrainedModel
62
+
63
+ [[autodoc]] FlaxPreTrainedModel
64
+ - push_to_hub
65
+ - all
66
+
67
+ ## Pushing to the Hub
68
+
69
+ [[autodoc]] utils.PushToHubMixin
70
+
71
+ ## Sharded checkpoints
72
+
73
+ [[autodoc]] modeling_utils.load_sharded_checkpoint
docs/transformers/docs/source/en/main_classes/onnx.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Exporting 🤗 Transformers models to ONNX
18
+
19
+ 🤗 Transformers provides a `transformers.onnx` package that enables you to
20
+ convert model checkpoints to an ONNX graph by leveraging configuration objects.
21
+
22
+ See the [guide](../serialization) on exporting 🤗 Transformers models for more
23
+ details.
24
+
25
+ ## ONNX Configurations
26
+
27
+ We provide three abstract classes that you should inherit from, depending on the
28
+ type of model architecture you wish to export:
29
+
30
+ * Encoder-based models inherit from [`~onnx.config.OnnxConfig`]
31
+ * Decoder-based models inherit from [`~onnx.config.OnnxConfigWithPast`]
32
+ * Encoder-decoder models inherit from [`~onnx.config.OnnxSeq2SeqConfigWithPast`]
33
+
34
+ ### OnnxConfig
35
+
36
+ [[autodoc]] onnx.config.OnnxConfig
37
+
38
+ ### OnnxConfigWithPast
39
+
40
+ [[autodoc]] onnx.config.OnnxConfigWithPast
41
+
42
+ ### OnnxSeq2SeqConfigWithPast
43
+
44
+ [[autodoc]] onnx.config.OnnxSeq2SeqConfigWithPast
45
+
46
+ ## ONNX Features
47
+
48
+ Each ONNX configuration is associated with a set of _features_ that enable you
49
+ to export models for different types of topologies or tasks.
50
+
51
+ ### FeaturesManager
52
+
53
+ [[autodoc]] onnx.features.FeaturesManager
54
+
docs/transformers/docs/source/en/main_classes/optimizer_schedules.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Optimization
18
+
19
+ The `.optimization` module provides:
20
+
21
+ - an optimizer with weight decay fixed that can be used to fine-tuned models, and
22
+ - several schedules in the form of schedule objects that inherit from `_LRSchedule`:
23
+ - a gradient accumulation class to accumulate the gradients of multiple batches
24
+
25
+
26
+ ## AdaFactor (PyTorch)
27
+
28
+ [[autodoc]] Adafactor
29
+
30
+ ## AdamWeightDecay (TensorFlow)
31
+
32
+ [[autodoc]] AdamWeightDecay
33
+
34
+ [[autodoc]] create_optimizer
35
+
36
+ ## Schedules
37
+
38
+ ### Learning Rate Schedules (PyTorch)
39
+
40
+ [[autodoc]] SchedulerType
41
+
42
+ [[autodoc]] get_scheduler
43
+
44
+ [[autodoc]] get_constant_schedule
45
+
46
+ [[autodoc]] get_constant_schedule_with_warmup
47
+
48
+ <img alt="" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/warmup_constant_schedule.png"/>
49
+
50
+ [[autodoc]] get_cosine_schedule_with_warmup
51
+
52
+ <img alt="" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/warmup_cosine_schedule.png"/>
53
+
54
+ [[autodoc]] get_cosine_with_hard_restarts_schedule_with_warmup
55
+
56
+ <img alt="" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/warmup_cosine_hard_restarts_schedule.png"/>
57
+
58
+ [[autodoc]] get_linear_schedule_with_warmup
59
+
60
+ <img alt="" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/warmup_linear_schedule.png"/>
61
+
62
+ [[autodoc]] get_polynomial_decay_schedule_with_warmup
63
+
64
+ [[autodoc]] get_inverse_sqrt_schedule
65
+
66
+ [[autodoc]] get_wsd_schedule
67
+
68
+ ### Warmup (TensorFlow)
69
+
70
+ [[autodoc]] WarmUp
71
+
72
+ ## Gradient Strategies
73
+
74
+ ### GradientAccumulator (TensorFlow)
75
+
76
+ [[autodoc]] GradientAccumulator
docs/transformers/docs/source/en/main_classes/output.md ADDED
@@ -0,0 +1,321 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Model outputs
18
+
19
+ All models have outputs that are instances of subclasses of [`~utils.ModelOutput`]. Those are
20
+ data structures containing all the information returned by the model, but that can also be used as tuples or
21
+ dictionaries.
22
+
23
+ Let's see how this looks in an example:
24
+
25
+ ```python
26
+ from transformers import BertTokenizer, BertForSequenceClassification
27
+ import torch
28
+
29
+ tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
30
+ model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")
31
+
32
+ inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
33
+ labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
34
+ outputs = model(**inputs, labels=labels)
35
+ ```
36
+
37
+ The `outputs` object is a [`~modeling_outputs.SequenceClassifierOutput`], as we can see in the
38
+ documentation of that class below, it means it has an optional `loss`, a `logits`, an optional `hidden_states` and
39
+ an optional `attentions` attribute. Here we have the `loss` since we passed along `labels`, but we don't have
40
+ `hidden_states` and `attentions` because we didn't pass `output_hidden_states=True` or
41
+ `output_attentions=True`.
42
+
43
+ <Tip>
44
+
45
+ When passing `output_hidden_states=True` you may expect the `outputs.hidden_states[-1]` to match `outputs.last_hidden_state` exactly.
46
+ However, this is not always the case. Some models apply normalization or subsequent process to the last hidden state when it's returned.
47
+
48
+ </Tip>
49
+
50
+
51
+ You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you
52
+ will get `None`. Here for instance `outputs.loss` is the loss computed by the model, and `outputs.attentions` is
53
+ `None`.
54
+
55
+ When considering our `outputs` object as tuple, it only considers the attributes that don't have `None` values.
56
+ Here for instance, it has two elements, `loss` then `logits`, so
57
+
58
+ ```python
59
+ outputs[:2]
60
+ ```
61
+
62
+ will return the tuple `(outputs.loss, outputs.logits)` for instance.
63
+
64
+ When considering our `outputs` object as dictionary, it only considers the attributes that don't have `None`
65
+ values. Here for instance, it has two keys that are `loss` and `logits`.
66
+
67
+ We document here the generic model outputs that are used by more than one model type. Specific output types are
68
+ documented on their corresponding model page.
69
+
70
+ ## ModelOutput
71
+
72
+ [[autodoc]] utils.ModelOutput
73
+ - to_tuple
74
+
75
+ ## BaseModelOutput
76
+
77
+ [[autodoc]] modeling_outputs.BaseModelOutput
78
+
79
+ ## BaseModelOutputWithPooling
80
+
81
+ [[autodoc]] modeling_outputs.BaseModelOutputWithPooling
82
+
83
+ ## BaseModelOutputWithCrossAttentions
84
+
85
+ [[autodoc]] modeling_outputs.BaseModelOutputWithCrossAttentions
86
+
87
+ ## BaseModelOutputWithPoolingAndCrossAttentions
88
+
89
+ [[autodoc]] modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions
90
+
91
+ ## BaseModelOutputWithPast
92
+
93
+ [[autodoc]] modeling_outputs.BaseModelOutputWithPast
94
+
95
+ ## BaseModelOutputWithPastAndCrossAttentions
96
+
97
+ [[autodoc]] modeling_outputs.BaseModelOutputWithPastAndCrossAttentions
98
+
99
+ ## Seq2SeqModelOutput
100
+
101
+ [[autodoc]] modeling_outputs.Seq2SeqModelOutput
102
+
103
+ ## CausalLMOutput
104
+
105
+ [[autodoc]] modeling_outputs.CausalLMOutput
106
+
107
+ ## CausalLMOutputWithCrossAttentions
108
+
109
+ [[autodoc]] modeling_outputs.CausalLMOutputWithCrossAttentions
110
+
111
+ ## CausalLMOutputWithPast
112
+
113
+ [[autodoc]] modeling_outputs.CausalLMOutputWithPast
114
+
115
+ ## MaskedLMOutput
116
+
117
+ [[autodoc]] modeling_outputs.MaskedLMOutput
118
+
119
+ ## Seq2SeqLMOutput
120
+
121
+ [[autodoc]] modeling_outputs.Seq2SeqLMOutput
122
+
123
+ ## NextSentencePredictorOutput
124
+
125
+ [[autodoc]] modeling_outputs.NextSentencePredictorOutput
126
+
127
+ ## SequenceClassifierOutput
128
+
129
+ [[autodoc]] modeling_outputs.SequenceClassifierOutput
130
+
131
+ ## Seq2SeqSequenceClassifierOutput
132
+
133
+ [[autodoc]] modeling_outputs.Seq2SeqSequenceClassifierOutput
134
+
135
+ ## MultipleChoiceModelOutput
136
+
137
+ [[autodoc]] modeling_outputs.MultipleChoiceModelOutput
138
+
139
+ ## TokenClassifierOutput
140
+
141
+ [[autodoc]] modeling_outputs.TokenClassifierOutput
142
+
143
+ ## QuestionAnsweringModelOutput
144
+
145
+ [[autodoc]] modeling_outputs.QuestionAnsweringModelOutput
146
+
147
+ ## Seq2SeqQuestionAnsweringModelOutput
148
+
149
+ [[autodoc]] modeling_outputs.Seq2SeqQuestionAnsweringModelOutput
150
+
151
+ ## Seq2SeqSpectrogramOutput
152
+
153
+ [[autodoc]] modeling_outputs.Seq2SeqSpectrogramOutput
154
+
155
+ ## SemanticSegmenterOutput
156
+
157
+ [[autodoc]] modeling_outputs.SemanticSegmenterOutput
158
+
159
+ ## ImageClassifierOutput
160
+
161
+ [[autodoc]] modeling_outputs.ImageClassifierOutput
162
+
163
+ ## ImageClassifierOutputWithNoAttention
164
+
165
+ [[autodoc]] modeling_outputs.ImageClassifierOutputWithNoAttention
166
+
167
+ ## DepthEstimatorOutput
168
+
169
+ [[autodoc]] modeling_outputs.DepthEstimatorOutput
170
+
171
+ ## Wav2Vec2BaseModelOutput
172
+
173
+ [[autodoc]] modeling_outputs.Wav2Vec2BaseModelOutput
174
+
175
+ ## XVectorOutput
176
+
177
+ [[autodoc]] modeling_outputs.XVectorOutput
178
+
179
+ ## Seq2SeqTSModelOutput
180
+
181
+ [[autodoc]] modeling_outputs.Seq2SeqTSModelOutput
182
+
183
+ ## Seq2SeqTSPredictionOutput
184
+
185
+ [[autodoc]] modeling_outputs.Seq2SeqTSPredictionOutput
186
+
187
+ ## SampleTSPredictionOutput
188
+
189
+ [[autodoc]] modeling_outputs.SampleTSPredictionOutput
190
+
191
+ ## TFBaseModelOutput
192
+
193
+ [[autodoc]] modeling_tf_outputs.TFBaseModelOutput
194
+
195
+ ## TFBaseModelOutputWithPooling
196
+
197
+ [[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPooling
198
+
199
+ ## TFBaseModelOutputWithPoolingAndCrossAttentions
200
+
201
+ [[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPoolingAndCrossAttentions
202
+
203
+ ## TFBaseModelOutputWithPast
204
+
205
+ [[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPast
206
+
207
+ ## TFBaseModelOutputWithPastAndCrossAttentions
208
+
209
+ [[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPastAndCrossAttentions
210
+
211
+ ## TFSeq2SeqModelOutput
212
+
213
+ [[autodoc]] modeling_tf_outputs.TFSeq2SeqModelOutput
214
+
215
+ ## TFCausalLMOutput
216
+
217
+ [[autodoc]] modeling_tf_outputs.TFCausalLMOutput
218
+
219
+ ## TFCausalLMOutputWithCrossAttentions
220
+
221
+ [[autodoc]] modeling_tf_outputs.TFCausalLMOutputWithCrossAttentions
222
+
223
+ ## TFCausalLMOutputWithPast
224
+
225
+ [[autodoc]] modeling_tf_outputs.TFCausalLMOutputWithPast
226
+
227
+ ## TFMaskedLMOutput
228
+
229
+ [[autodoc]] modeling_tf_outputs.TFMaskedLMOutput
230
+
231
+ ## TFSeq2SeqLMOutput
232
+
233
+ [[autodoc]] modeling_tf_outputs.TFSeq2SeqLMOutput
234
+
235
+ ## TFNextSentencePredictorOutput
236
+
237
+ [[autodoc]] modeling_tf_outputs.TFNextSentencePredictorOutput
238
+
239
+ ## TFSequenceClassifierOutput
240
+
241
+ [[autodoc]] modeling_tf_outputs.TFSequenceClassifierOutput
242
+
243
+ ## TFSeq2SeqSequenceClassifierOutput
244
+
245
+ [[autodoc]] modeling_tf_outputs.TFSeq2SeqSequenceClassifierOutput
246
+
247
+ ## TFMultipleChoiceModelOutput
248
+
249
+ [[autodoc]] modeling_tf_outputs.TFMultipleChoiceModelOutput
250
+
251
+ ## TFTokenClassifierOutput
252
+
253
+ [[autodoc]] modeling_tf_outputs.TFTokenClassifierOutput
254
+
255
+ ## TFQuestionAnsweringModelOutput
256
+
257
+ [[autodoc]] modeling_tf_outputs.TFQuestionAnsweringModelOutput
258
+
259
+ ## TFSeq2SeqQuestionAnsweringModelOutput
260
+
261
+ [[autodoc]] modeling_tf_outputs.TFSeq2SeqQuestionAnsweringModelOutput
262
+
263
+ ## FlaxBaseModelOutput
264
+
265
+ [[autodoc]] modeling_flax_outputs.FlaxBaseModelOutput
266
+
267
+ ## FlaxBaseModelOutputWithPast
268
+
269
+ [[autodoc]] modeling_flax_outputs.FlaxBaseModelOutputWithPast
270
+
271
+ ## FlaxBaseModelOutputWithPooling
272
+
273
+ [[autodoc]] modeling_flax_outputs.FlaxBaseModelOutputWithPooling
274
+
275
+ ## FlaxBaseModelOutputWithPastAndCrossAttentions
276
+
277
+ [[autodoc]] modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions
278
+
279
+ ## FlaxSeq2SeqModelOutput
280
+
281
+ [[autodoc]] modeling_flax_outputs.FlaxSeq2SeqModelOutput
282
+
283
+ ## FlaxCausalLMOutputWithCrossAttentions
284
+
285
+ [[autodoc]] modeling_flax_outputs.FlaxCausalLMOutputWithCrossAttentions
286
+
287
+ ## FlaxMaskedLMOutput
288
+
289
+ [[autodoc]] modeling_flax_outputs.FlaxMaskedLMOutput
290
+
291
+ ## FlaxSeq2SeqLMOutput
292
+
293
+ [[autodoc]] modeling_flax_outputs.FlaxSeq2SeqLMOutput
294
+
295
+ ## FlaxNextSentencePredictorOutput
296
+
297
+ [[autodoc]] modeling_flax_outputs.FlaxNextSentencePredictorOutput
298
+
299
+ ## FlaxSequenceClassifierOutput
300
+
301
+ [[autodoc]] modeling_flax_outputs.FlaxSequenceClassifierOutput
302
+
303
+ ## FlaxSeq2SeqSequenceClassifierOutput
304
+
305
+ [[autodoc]] modeling_flax_outputs.FlaxSeq2SeqSequenceClassifierOutput
306
+
307
+ ## FlaxMultipleChoiceModelOutput
308
+
309
+ [[autodoc]] modeling_flax_outputs.FlaxMultipleChoiceModelOutput
310
+
311
+ ## FlaxTokenClassifierOutput
312
+
313
+ [[autodoc]] modeling_flax_outputs.FlaxTokenClassifierOutput
314
+
315
+ ## FlaxQuestionAnsweringModelOutput
316
+
317
+ [[autodoc]] modeling_flax_outputs.FlaxQuestionAnsweringModelOutput
318
+
319
+ ## FlaxSeq2SeqQuestionAnsweringModelOutput
320
+
321
+ [[autodoc]] modeling_flax_outputs.FlaxSeq2SeqQuestionAnsweringModelOutput
docs/transformers/docs/source/en/main_classes/peft.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2024 The HuggingFace Team. All rights reserved.
2
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
3
+ the License. You may obtain a copy of the License at
4
+ http://www.apache.org/licenses/LICENSE-2.0
5
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
6
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
7
+ specific language governing permissions and limitations under the License.
8
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
9
+ rendered properly in your Markdown viewer.
10
+ -->
11
+
12
+ # PEFT
13
+
14
+ The [`~integrations.PeftAdapterMixin`] provides functions from the [PEFT](https://huggingface.co/docs/peft/index) library for managing adapters with Transformers. This mixin currently supports LoRA, IA3, and AdaLora. Prefix tuning methods (prompt tuning, prompt learning) aren't supported because they can't be injected into a torch module.
15
+
16
+ [[autodoc]] integrations.PeftAdapterMixin
17
+ - load_adapter
18
+ - add_adapter
19
+ - set_adapter
20
+ - disable_adapters
21
+ - enable_adapters
22
+ - active_adapters
23
+ - get_adapter_state_dict
docs/transformers/docs/source/en/main_classes/pipelines.md ADDED
@@ -0,0 +1,501 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Pipelines
18
+
19
+ The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of
20
+ the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
21
+ Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the
22
+ [task summary](../task_summary) for examples of use.
23
+
24
+ There are two categories of pipeline abstractions to be aware about:
25
+
26
+ - The [`pipeline`] which is the most powerful object encapsulating all other pipelines.
27
+ - Task-specific pipelines are available for [audio](#audio), [computer vision](#computer-vision), [natural language processing](#natural-language-processing), and [multimodal](#multimodal) tasks.
28
+
29
+ ## The pipeline abstraction
30
+
31
+ The *pipeline* abstraction is a wrapper around all the other available pipelines. It is instantiated as any other
32
+ pipeline but can provide additional quality of life.
33
+
34
+ Simple call on one item:
35
+
36
+ ```python
37
+ >>> pipe = pipeline("text-classification")
38
+ >>> pipe("This restaurant is awesome")
39
+ [{'label': 'POSITIVE', 'score': 0.9998743534088135}]
40
+ ```
41
+
42
+ If you want to use a specific model from the [hub](https://huggingface.co) you can ignore the task if the model on
43
+ the hub already defines it:
44
+
45
+ ```python
46
+ >>> pipe = pipeline(model="FacebookAI/roberta-large-mnli")
47
+ >>> pipe("This restaurant is awesome")
48
+ [{'label': 'NEUTRAL', 'score': 0.7313136458396912}]
49
+ ```
50
+
51
+ To call a pipeline on many items, you can call it with a *list*.
52
+
53
+ ```python
54
+ >>> pipe = pipeline("text-classification")
55
+ >>> pipe(["This restaurant is awesome", "This restaurant is awful"])
56
+ [{'label': 'POSITIVE', 'score': 0.9998743534088135},
57
+ {'label': 'NEGATIVE', 'score': 0.9996669292449951}]
58
+ ```
59
+
60
+ To iterate over full datasets it is recommended to use a `dataset` directly. This means you don't need to allocate
61
+ the whole dataset at once, nor do you need to do batching yourself. This should work just as fast as custom loops on
62
+ GPU. If it doesn't don't hesitate to create an issue.
63
+
64
+ ```python
65
+ import datasets
66
+ from transformers import pipeline
67
+ from transformers.pipelines.pt_utils import KeyDataset
68
+ from tqdm.auto import tqdm
69
+
70
+ pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
71
+ dataset = datasets.load_dataset("superb", name="asr", split="test")
72
+
73
+ # KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item
74
+ # as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset
75
+ for out in tqdm(pipe(KeyDataset(dataset, "file"))):
76
+ print(out)
77
+ # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
78
+ # {"text": ....}
79
+ # ....
80
+ ```
81
+
82
+ For ease of use, a generator is also possible:
83
+
84
+
85
+ ```python
86
+ from transformers import pipeline
87
+
88
+ pipe = pipeline("text-classification")
89
+
90
+
91
+ def data():
92
+ while True:
93
+ # This could come from a dataset, a database, a queue or HTTP request
94
+ # in a server
95
+ # Caveat: because this is iterative, you cannot use `num_workers > 1` variable
96
+ # to use multiple threads to preprocess data. You can still have 1 thread that
97
+ # does the preprocessing while the main runs the big inference
98
+ yield "This is a test"
99
+
100
+
101
+ for out in pipe(data()):
102
+ print(out)
103
+ # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
104
+ # {"text": ....}
105
+ # ....
106
+ ```
107
+
108
+ [[autodoc]] pipeline
109
+
110
+ ## Pipeline batching
111
+
112
+ All pipelines can use batching. This will work
113
+ whenever the pipeline uses its streaming ability (so when passing lists or `Dataset` or `generator`).
114
+
115
+ ```python
116
+ from transformers import pipeline
117
+ from transformers.pipelines.pt_utils import KeyDataset
118
+ import datasets
119
+
120
+ dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised")
121
+ pipe = pipeline("text-classification", device=0)
122
+ for out in pipe(KeyDataset(dataset, "text"), batch_size=8, truncation="only_first"):
123
+ print(out)
124
+ # [{'label': 'POSITIVE', 'score': 0.9998743534088135}]
125
+ # Exactly the same output as before, but the content are passed
126
+ # as batches to the model
127
+ ```
128
+
129
+ <Tip warning={true}>
130
+
131
+ However, this is not automatically a win for performance. It can be either a 10x speedup or 5x slowdown depending
132
+ on hardware, data and the actual model being used.
133
+
134
+ Example where it's mostly a speedup:
135
+
136
+ </Tip>
137
+
138
+ ```python
139
+ from transformers import pipeline
140
+ from torch.utils.data import Dataset
141
+ from tqdm.auto import tqdm
142
+
143
+ pipe = pipeline("text-classification", device=0)
144
+
145
+
146
+ class MyDataset(Dataset):
147
+ def __len__(self):
148
+ return 5000
149
+
150
+ def __getitem__(self, i):
151
+ return "This is a test"
152
+
153
+
154
+ dataset = MyDataset()
155
+
156
+ for batch_size in [1, 8, 64, 256]:
157
+ print("-" * 30)
158
+ print(f"Streaming batch_size={batch_size}")
159
+ for out in tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)):
160
+ pass
161
+ ```
162
+
163
+ ```
164
+ # On GTX 970
165
+ ------------------------------
166
+ Streaming no batching
167
+ 100%|██████████████████████████████████████████████████████████████████████| 5000/5000 [00:26<00:00, 187.52it/s]
168
+ ------------------------------
169
+ Streaming batch_size=8
170
+ 100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:04<00:00, 1205.95it/s]
171
+ ------------------------------
172
+ Streaming batch_size=64
173
+ 100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 2478.24it/s]
174
+ ------------------------------
175
+ Streaming batch_size=256
176
+ 100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 2554.43it/s]
177
+ (diminishing returns, saturated the GPU)
178
+ ```
179
+
180
+ Example where it's most a slowdown:
181
+
182
+ ```python
183
+ class MyDataset(Dataset):
184
+ def __len__(self):
185
+ return 5000
186
+
187
+ def __getitem__(self, i):
188
+ if i % 64 == 0:
189
+ n = 100
190
+ else:
191
+ n = 1
192
+ return "This is a test" * n
193
+ ```
194
+
195
+ This is a occasional very long sentence compared to the other. In that case, the **whole** batch will need to be 400
196
+ tokens long, so the whole batch will be [64, 400] instead of [64, 4], leading to the high slowdown. Even worse, on
197
+ bigger batches, the program simply crashes.
198
+
199
+
200
+ ```
201
+ ------------------------------
202
+ Streaming no batching
203
+ 100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 183.69it/s]
204
+ ------------------------------
205
+ Streaming batch_size=8
206
+ 100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 265.74it/s]
207
+ ------------------------------
208
+ Streaming batch_size=64
209
+ 100%|██████████████████████████████████████████████████████████████████████| 1000/1000 [00:26<00:00, 37.80it/s]
210
+ ------------------------------
211
+ Streaming batch_size=256
212
+ 0%| | 0/1000 [00:00<?, ?it/s]
213
+ Traceback (most recent call last):
214
+ File "/home/nicolas/src/transformers/test.py", line 42, in <module>
215
+ for out in tqdm(pipe(dataset, batch_size=256), total=len(dataset)):
216
+ ....
217
+ q = q / math.sqrt(dim_per_head) # (bs, n_heads, q_length, dim_per_head)
218
+ RuntimeError: CUDA out of memory. Tried to allocate 376.00 MiB (GPU 0; 3.95 GiB total capacity; 1.72 GiB already allocated; 354.88 MiB free; 2.46 GiB reserved in total by PyTorch)
219
+ ```
220
+
221
+ There are no good (general) solutions for this problem, and your mileage may vary depending on your use cases. Rule of
222
+ thumb:
223
+
224
+ For users, a rule of thumb is:
225
+
226
+ - **Measure performance on your load, with your hardware. Measure, measure, and keep measuring. Real numbers are the
227
+ only way to go.**
228
+ - If you are latency constrained (live product doing inference), don't batch.
229
+ - If you are using CPU, don't batch.
230
+ - If you are using throughput (you want to run your model on a bunch of static data), on GPU, then:
231
+
232
+ - If you have no clue about the size of the sequence_length ("natural" data), by default don't batch, measure and
233
+ try tentatively to add it, add OOM checks to recover when it will fail (and it will at some point if you don't
234
+ control the sequence_length.)
235
+ - If your sequence_length is super regular, then batching is more likely to be VERY interesting, measure and push
236
+ it until you get OOMs.
237
+ - The larger the GPU the more likely batching is going to be more interesting
238
+ - As soon as you enable batching, make sure you can handle OOMs nicely.
239
+
240
+ ## Pipeline chunk batching
241
+
242
+ `zero-shot-classification` and `question-answering` are slightly specific in the sense, that a single input might yield
243
+ multiple forward pass of a model. Under normal circumstances, this would yield issues with `batch_size` argument.
244
+
245
+ In order to circumvent this issue, both of these pipelines are a bit specific, they are `ChunkPipeline` instead of
246
+ regular `Pipeline`. In short:
247
+
248
+
249
+ ```python
250
+ preprocessed = pipe.preprocess(inputs)
251
+ model_outputs = pipe.forward(preprocessed)
252
+ outputs = pipe.postprocess(model_outputs)
253
+ ```
254
+
255
+ Now becomes:
256
+
257
+
258
+ ```python
259
+ all_model_outputs = []
260
+ for preprocessed in pipe.preprocess(inputs):
261
+ model_outputs = pipe.forward(preprocessed)
262
+ all_model_outputs.append(model_outputs)
263
+ outputs = pipe.postprocess(all_model_outputs)
264
+ ```
265
+
266
+ This should be very transparent to your code because the pipelines are used in
267
+ the same way.
268
+
269
+ This is a simplified view, since the pipeline can handle automatically the batch to ! Meaning you don't have to care
270
+ about how many forward passes you inputs are actually going to trigger, you can optimize the `batch_size`
271
+ independently of the inputs. The caveats from the previous section still apply.
272
+
273
+ ## Pipeline FP16 inference
274
+ Models can be run in FP16 which can be significantly faster on GPU while saving memory. Most models will not suffer noticeable performance loss from this. The larger the model, the less likely that it will.
275
+
276
+ To enable FP16 inference, you can simply pass `torch_dtype=torch.float16` or `torch_dtype='float16'` to the pipeline constructor. Note that this only works for models with a PyTorch backend. Your inputs will be converted to FP16 internally.
277
+
278
+ ## Pipeline custom code
279
+
280
+ If you want to override a specific pipeline.
281
+
282
+ Don't hesitate to create an issue for your task at hand, the goal of the pipeline is to be easy to use and support most
283
+ cases, so `transformers` could maybe support your use case.
284
+
285
+
286
+ If you want to try simply you can:
287
+
288
+ - Subclass your pipeline of choice
289
+
290
+ ```python
291
+ class MyPipeline(TextClassificationPipeline):
292
+ def postprocess():
293
+ # Your code goes here
294
+ scores = scores * 100
295
+ # And here
296
+
297
+
298
+ my_pipeline = MyPipeline(model=model, tokenizer=tokenizer, ...)
299
+ # or if you use *pipeline* function, then:
300
+ my_pipeline = pipeline(model="xxxx", pipeline_class=MyPipeline)
301
+ ```
302
+
303
+ That should enable you to do all the custom code you want.
304
+
305
+
306
+ ## Implementing a pipeline
307
+
308
+ [Implementing a new pipeline](../add_new_pipeline)
309
+
310
+ ## Audio
311
+
312
+ Pipelines available for audio tasks include the following.
313
+
314
+ ### AudioClassificationPipeline
315
+
316
+ [[autodoc]] AudioClassificationPipeline
317
+ - __call__
318
+ - all
319
+
320
+ ### AutomaticSpeechRecognitionPipeline
321
+
322
+ [[autodoc]] AutomaticSpeechRecognitionPipeline
323
+ - __call__
324
+ - all
325
+
326
+ ### TextToAudioPipeline
327
+
328
+ [[autodoc]] TextToAudioPipeline
329
+ - __call__
330
+ - all
331
+
332
+
333
+ ### ZeroShotAudioClassificationPipeline
334
+
335
+ [[autodoc]] ZeroShotAudioClassificationPipeline
336
+ - __call__
337
+ - all
338
+
339
+ ## Computer vision
340
+
341
+ Pipelines available for computer vision tasks include the following.
342
+
343
+ ### DepthEstimationPipeline
344
+ [[autodoc]] DepthEstimationPipeline
345
+ - __call__
346
+ - all
347
+
348
+ ### ImageClassificationPipeline
349
+
350
+ [[autodoc]] ImageClassificationPipeline
351
+ - __call__
352
+ - all
353
+
354
+ ### ImageSegmentationPipeline
355
+
356
+ [[autodoc]] ImageSegmentationPipeline
357
+ - __call__
358
+ - all
359
+
360
+ ### ImageToImagePipeline
361
+
362
+ [[autodoc]] ImageToImagePipeline
363
+ - __call__
364
+ - all
365
+
366
+ ### ObjectDetectionPipeline
367
+
368
+ [[autodoc]] ObjectDetectionPipeline
369
+ - __call__
370
+ - all
371
+
372
+ ### VideoClassificationPipeline
373
+
374
+ [[autodoc]] VideoClassificationPipeline
375
+ - __call__
376
+ - all
377
+
378
+ ### ZeroShotImageClassificationPipeline
379
+
380
+ [[autodoc]] ZeroShotImageClassificationPipeline
381
+ - __call__
382
+ - all
383
+
384
+ ### ZeroShotObjectDetectionPipeline
385
+
386
+ [[autodoc]] ZeroShotObjectDetectionPipeline
387
+ - __call__
388
+ - all
389
+
390
+ ## Natural Language Processing
391
+
392
+ Pipelines available for natural language processing tasks include the following.
393
+
394
+ ### FillMaskPipeline
395
+
396
+ [[autodoc]] FillMaskPipeline
397
+ - __call__
398
+ - all
399
+
400
+ ### QuestionAnsweringPipeline
401
+
402
+ [[autodoc]] QuestionAnsweringPipeline
403
+ - __call__
404
+ - all
405
+
406
+ ### SummarizationPipeline
407
+
408
+ [[autodoc]] SummarizationPipeline
409
+ - __call__
410
+ - all
411
+
412
+ ### TableQuestionAnsweringPipeline
413
+
414
+ [[autodoc]] TableQuestionAnsweringPipeline
415
+ - __call__
416
+
417
+ ### TextClassificationPipeline
418
+
419
+ [[autodoc]] TextClassificationPipeline
420
+ - __call__
421
+ - all
422
+
423
+ ### TextGenerationPipeline
424
+
425
+ [[autodoc]] TextGenerationPipeline
426
+ - __call__
427
+ - all
428
+
429
+ ### Text2TextGenerationPipeline
430
+
431
+ [[autodoc]] Text2TextGenerationPipeline
432
+ - __call__
433
+ - all
434
+
435
+ ### TokenClassificationPipeline
436
+
437
+ [[autodoc]] TokenClassificationPipeline
438
+ - __call__
439
+ - all
440
+
441
+ ### TranslationPipeline
442
+
443
+ [[autodoc]] TranslationPipeline
444
+ - __call__
445
+ - all
446
+
447
+ ### ZeroShotClassificationPipeline
448
+
449
+ [[autodoc]] ZeroShotClassificationPipeline
450
+ - __call__
451
+ - all
452
+
453
+ ## Multimodal
454
+
455
+ Pipelines available for multimodal tasks include the following.
456
+
457
+ ### DocumentQuestionAnsweringPipeline
458
+
459
+ [[autodoc]] DocumentQuestionAnsweringPipeline
460
+ - __call__
461
+ - all
462
+
463
+ ### FeatureExtractionPipeline
464
+
465
+ [[autodoc]] FeatureExtractionPipeline
466
+ - __call__
467
+ - all
468
+
469
+ ### ImageFeatureExtractionPipeline
470
+
471
+ [[autodoc]] ImageFeatureExtractionPipeline
472
+ - __call__
473
+ - all
474
+
475
+ ### ImageToTextPipeline
476
+
477
+ [[autodoc]] ImageToTextPipeline
478
+ - __call__
479
+ - all
480
+
481
+ ### ImageTextToTextPipeline
482
+
483
+ [[autodoc]] ImageTextToTextPipeline
484
+ - __call__
485
+ - all
486
+
487
+ ### MaskGenerationPipeline
488
+
489
+ [[autodoc]] MaskGenerationPipeline
490
+ - __call__
491
+ - all
492
+
493
+ ### VisualQuestionAnsweringPipeline
494
+
495
+ [[autodoc]] VisualQuestionAnsweringPipeline
496
+ - __call__
497
+ - all
498
+
499
+ ## Parent class: `Pipeline`
500
+
501
+ [[autodoc]] Pipeline
docs/transformers/docs/source/en/main_classes/processors.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Processors
18
+
19
+ Processors can mean two different things in the Transformers library:
20
+ - the objects that pre-process inputs for multi-modal models such as [Wav2Vec2](../model_doc/wav2vec2) (speech and text)
21
+ or [CLIP](../model_doc/clip) (text and vision)
22
+ - deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD.
23
+
24
+ ## Multi-modal processors
25
+
26
+ Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text,
27
+ vision and audio). This is handled by objects called processors, which group together two or more processing objects
28
+ such as tokenizers (for the text modality), image processors (for vision) and feature extractors (for audio).
29
+
30
+ Those processors inherit from the following base class that implements the saving and loading functionality:
31
+
32
+ [[autodoc]] ProcessorMixin
33
+
34
+ ## Deprecated processors
35
+
36
+ All processors follow the same architecture which is that of the
37
+ [`~data.processors.utils.DataProcessor`]. The processor returns a list of
38
+ [`~data.processors.utils.InputExample`]. These
39
+ [`~data.processors.utils.InputExample`] can be converted to
40
+ [`~data.processors.utils.InputFeatures`] in order to be fed to the model.
41
+
42
+ [[autodoc]] data.processors.utils.DataProcessor
43
+
44
+ [[autodoc]] data.processors.utils.InputExample
45
+
46
+ [[autodoc]] data.processors.utils.InputFeatures
47
+
48
+ ## GLUE
49
+
50
+ [General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/) is a benchmark that evaluates the
51
+ performance of models across a diverse set of existing NLU tasks. It was released together with the paper [GLUE: A
52
+ multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7)
53
+
54
+ This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB,
55
+ QQP, QNLI, RTE and WNLI.
56
+
57
+ Those processors are:
58
+
59
+ - [`~data.processors.utils.MrpcProcessor`]
60
+ - [`~data.processors.utils.MnliProcessor`]
61
+ - [`~data.processors.utils.MnliMismatchedProcessor`]
62
+ - [`~data.processors.utils.Sst2Processor`]
63
+ - [`~data.processors.utils.StsbProcessor`]
64
+ - [`~data.processors.utils.QqpProcessor`]
65
+ - [`~data.processors.utils.QnliProcessor`]
66
+ - [`~data.processors.utils.RteProcessor`]
67
+ - [`~data.processors.utils.WnliProcessor`]
68
+
69
+ Additionally, the following method can be used to load values from a data file and convert them to a list of
70
+ [`~data.processors.utils.InputExample`].
71
+
72
+ [[autodoc]] data.processors.glue.glue_convert_examples_to_features
73
+
74
+
75
+ ## XNLI
76
+
77
+ [The Cross-Lingual NLI Corpus (XNLI)](https://www.nyu.edu/projects/bowman/xnli/) is a benchmark that evaluates the
78
+ quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on [*MultiNLI*](http://www.nyu.edu/projects/bowman/multinli/): pairs of text are labeled with textual entailment annotations for 15
79
+ different languages (including both high-resource language such as English and low-resource languages such as Swahili).
80
+
81
+ It was released together with the paper [XNLI: Evaluating Cross-lingual Sentence Representations](https://arxiv.org/abs/1809.05053)
82
+
83
+ This library hosts the processor to load the XNLI data:
84
+
85
+ - [`~data.processors.utils.XnliProcessor`]
86
+
87
+ Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
88
+
89
+ An example using these processors is given in the [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) script.
90
+
91
+
92
+ ## SQuAD
93
+
94
+ [The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//) is a benchmark that
95
+ evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version
96
+ (v1.1) was released together with the paper [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250). The second version (v2.0) was released alongside the paper [Know What You Don't
97
+ Know: Unanswerable Questions for SQuAD](https://arxiv.org/abs/1806.03822).
98
+
99
+ This library hosts a processor for each of the two versions:
100
+
101
+ ### Processors
102
+
103
+ Those processors are:
104
+
105
+ - [`~data.processors.utils.SquadV1Processor`]
106
+ - [`~data.processors.utils.SquadV2Processor`]
107
+
108
+ They both inherit from the abstract class [`~data.processors.utils.SquadProcessor`]
109
+
110
+ [[autodoc]] data.processors.squad.SquadProcessor
111
+ - all
112
+
113
+ Additionally, the following method can be used to convert SQuAD examples into
114
+ [`~data.processors.utils.SquadFeatures`] that can be used as model inputs.
115
+
116
+ [[autodoc]] data.processors.squad.squad_convert_examples_to_features
117
+
118
+
119
+ These processors as well as the aforementioned method can be used with files containing the data as well as with the
120
+ *tensorflow_datasets* package. Examples are given below.
121
+
122
+
123
+ ### Example usage
124
+
125
+ Here is an example using the processors as well as the conversion method using data files:
126
+
127
+ ```python
128
+ # Loading a V2 processor
129
+ processor = SquadV2Processor()
130
+ examples = processor.get_dev_examples(squad_v2_data_dir)
131
+
132
+ # Loading a V1 processor
133
+ processor = SquadV1Processor()
134
+ examples = processor.get_dev_examples(squad_v1_data_dir)
135
+
136
+ features = squad_convert_examples_to_features(
137
+ examples=examples,
138
+ tokenizer=tokenizer,
139
+ max_seq_length=max_seq_length,
140
+ doc_stride=args.doc_stride,
141
+ max_query_length=max_query_length,
142
+ is_training=not evaluate,
143
+ )
144
+ ```
145
+
146
+ Using *tensorflow_datasets* is as easy as using a data file:
147
+
148
+ ```python
149
+ # tensorflow_datasets only handle Squad V1.
150
+ tfds_examples = tfds.load("squad")
151
+ examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)
152
+
153
+ features = squad_convert_examples_to_features(
154
+ examples=examples,
155
+ tokenizer=tokenizer,
156
+ max_seq_length=max_seq_length,
157
+ doc_stride=args.doc_stride,
158
+ max_query_length=max_query_length,
159
+ is_training=not evaluate,
160
+ )
161
+ ```
162
+
163
+ Another example using these processors is given in the [run_squad.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/question-answering/run_squad.py) script.
docs/transformers/docs/source/en/main_classes/quantization.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2023 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Quantization
18
+
19
+ Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference. Transformers supports the AWQ and GPTQ quantization algorithms and it supports 8-bit and 4-bit quantization with bitsandbytes.
20
+
21
+ Quantization techniques that aren't supported in Transformers can be added with the [`HfQuantizer`] class.
22
+
23
+ <Tip>
24
+
25
+ Learn how to quantize models in the [Quantization](../quantization) guide.
26
+
27
+ </Tip>
28
+
29
+ ## QuantoConfig
30
+
31
+ [[autodoc]] QuantoConfig
32
+
33
+ ## AqlmConfig
34
+
35
+ [[autodoc]] AqlmConfig
36
+
37
+ ## VptqConfig
38
+
39
+ [[autodoc]] VptqConfig
40
+
41
+ ## AwqConfig
42
+
43
+ [[autodoc]] AwqConfig
44
+
45
+ ## EetqConfig
46
+ [[autodoc]] EetqConfig
47
+
48
+ ## GPTQConfig
49
+
50
+ [[autodoc]] GPTQConfig
51
+
52
+ ## BitsAndBytesConfig
53
+
54
+ [[autodoc]] BitsAndBytesConfig
55
+
56
+ ## HfQuantizer
57
+
58
+ [[autodoc]] quantizers.base.HfQuantizer
59
+
60
+ ## HiggsConfig
61
+
62
+ [[autodoc]] HiggsConfig
63
+
64
+ ## HqqConfig
65
+
66
+ [[autodoc]] HqqConfig
67
+
68
+ ## FbgemmFp8Config
69
+
70
+ [[autodoc]] FbgemmFp8Config
71
+
72
+ ## CompressedTensorsConfig
73
+
74
+ [[autodoc]] CompressedTensorsConfig
75
+
76
+ ## TorchAoConfig
77
+
78
+ [[autodoc]] TorchAoConfig
79
+
80
+ ## BitNetConfig
81
+
82
+ [[autodoc]] BitNetConfig
83
+
84
+ ## SpQRConfig
85
+
86
+ [[autodoc]] SpQRConfig
87
+
88
+ ## FineGrainedFP8Config
89
+
90
+ [[autodoc]] FineGrainedFP8Config
91
+
92
+ ## QuarkConfig
93
+
94
+ [[autodoc]] QuarkConfig
95
+
96
+ ## AutoRoundConfig
97
+
98
+ [[autodoc]] AutoRoundConfig
docs/transformers/docs/source/en/main_classes/text_generation.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2022 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Generation
18
+
19
+ Each framework has a generate method for text generation implemented in their respective `GenerationMixin` class:
20
+
21
+ - PyTorch [`~generation.GenerationMixin.generate`] is implemented in [`~generation.GenerationMixin`].
22
+ - TensorFlow [`~generation.TFGenerationMixin.generate`] is implemented in [`~generation.TFGenerationMixin`].
23
+ - Flax/JAX [`~generation.FlaxGenerationMixin.generate`] is implemented in [`~generation.FlaxGenerationMixin`].
24
+
25
+ Regardless of your framework of choice, you can parameterize the generate method with a [`~generation.GenerationConfig`]
26
+ class instance. Please refer to this class for the complete list of generation parameters, which control the behavior
27
+ of the generation method.
28
+
29
+ To learn how to inspect a model's generation configuration, what are the defaults, how to change the parameters ad hoc,
30
+ and how to create and save a customized generation configuration, refer to the
31
+ [text generation strategies guide](../generation_strategies). The guide also explains how to use related features,
32
+ like token streaming.
33
+
34
+ ## GenerationConfig
35
+
36
+ [[autodoc]] generation.GenerationConfig
37
+ - from_pretrained
38
+ - from_model_config
39
+ - save_pretrained
40
+ - update
41
+ - validate
42
+ - get_generation_mode
43
+
44
+ ## GenerationMixin
45
+
46
+ [[autodoc]] GenerationMixin
47
+ - generate
48
+ - compute_transition_scores
49
+
50
+ ## TFGenerationMixin
51
+
52
+ [[autodoc]] TFGenerationMixin
53
+ - generate
54
+ - compute_transition_scores
55
+
56
+ ## FlaxGenerationMixin
57
+
58
+ [[autodoc]] FlaxGenerationMixin
59
+ - generate
docs/transformers/docs/source/en/main_classes/tokenizer.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Tokenizer
18
+
19
+ A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most
20
+ of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the
21
+ Rust library [🤗 Tokenizers](https://github.com/huggingface/tokenizers). The "Fast" implementations allows:
22
+
23
+ 1. a significant speed-up in particular when doing batched tokenization and
24
+ 2. additional methods to map between the original string (character and words) and the token space (e.g. getting the
25
+ index of the token comprising a given character or the span of characters corresponding to a given token).
26
+
27
+ The base classes [`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`]
28
+ implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and
29
+ "Fast" tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library
30
+ (downloaded from HuggingFace's AWS S3 repository). They both rely on
31
+ [`~tokenization_utils_base.PreTrainedTokenizerBase`] that contains the common methods, and
32
+ [`~tokenization_utils_base.SpecialTokensMixin`].
33
+
34
+ [`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`] thus implement the main
35
+ methods for using all the tokenizers:
36
+
37
+ - Tokenizing (splitting strings in sub-word token strings), converting tokens strings to ids and back, and
38
+ encoding/decoding (i.e., tokenizing and converting to integers).
39
+ - Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece...).
40
+ - Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the
41
+ tokenizer for easy access and making sure they are not split during tokenization.
42
+
43
+ [`BatchEncoding`] holds the output of the
44
+ [`~tokenization_utils_base.PreTrainedTokenizerBase`]'s encoding methods (`__call__`,
45
+ `encode_plus` and `batch_encode_plus`) and is derived from a Python dictionary. When the tokenizer is a pure python
46
+ tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by
47
+ these methods (`input_ids`, `attention_mask`...). When the tokenizer is a "Fast" tokenizer (i.e., backed by
48
+ HuggingFace [tokenizers library](https://github.com/huggingface/tokenizers)), this class provides in addition
49
+ several advanced alignment methods which can be used to map between the original string (character and words) and the
50
+ token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
51
+ to a given token).
52
+
53
+
54
+ # Multimodal Tokenizer
55
+
56
+ Apart from that each tokenizer can be a "multimodal" tokenizer which means that the tokenizer will hold all relevant special tokens
57
+ as part of tokenizer attributes for easier access. For example, if the tokenizer is loaded from a vision-language model like LLaVA, you will
58
+ be able to access `tokenizer.image_token_id` to obtain the special image token used as a placeholder.
59
+
60
+ To enable extra special tokens for any type of tokenizer, you have to add the following lines and save the tokenizer. Extra special tokens do not
61
+ have to be modality related and can ne anything that the model often needs access to. In the below code, tokenizer at `output_dir` will have direct access
62
+ to three more special tokens.
63
+
64
+ ```python
65
+ vision_tokenizer = AutoTokenizer.from_pretrained(
66
+ "llava-hf/llava-1.5-7b-hf",
67
+ extra_special_tokens={"image_token": "<image>", "boi_token": "<image_start>", "eoi_token": "<image_end>"}
68
+ )
69
+ print(vision_tokenizer.image_token, vision_tokenizer.image_token_id)
70
+ ("<image>", 32000)
71
+ ```
72
+
73
+ ## PreTrainedTokenizer
74
+
75
+ [[autodoc]] PreTrainedTokenizer
76
+ - __call__
77
+ - add_tokens
78
+ - add_special_tokens
79
+ - apply_chat_template
80
+ - batch_decode
81
+ - decode
82
+ - encode
83
+ - push_to_hub
84
+ - all
85
+
86
+ ## PreTrainedTokenizerFast
87
+
88
+ The [`PreTrainedTokenizerFast`] depend on the [tokenizers](https://huggingface.co/docs/tokenizers) library. The tokenizers obtained from the 🤗 tokenizers library can be
89
+ loaded very simply into 🤗 transformers. Take a look at the [Using tokenizers from 🤗 tokenizers](../fast_tokenizers) page to understand how this is done.
90
+
91
+ [[autodoc]] PreTrainedTokenizerFast
92
+ - __call__
93
+ - add_tokens
94
+ - add_special_tokens
95
+ - apply_chat_template
96
+ - batch_decode
97
+ - decode
98
+ - encode
99
+ - push_to_hub
100
+ - all
101
+
102
+ ## BatchEncoding
103
+
104
+ [[autodoc]] BatchEncoding
docs/transformers/docs/source/en/main_classes/trainer.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Trainer
18
+
19
+ The [`Trainer`] class provides an API for feature-complete training in PyTorch, and it supports distributed training on multiple GPUs/TPUs, mixed precision for [NVIDIA GPUs](https://nvidia.github.io/apex/), [AMD GPUs](https://rocm.docs.amd.com/en/latest/rocm.html), and [`torch.amp`](https://pytorch.org/docs/stable/amp.html) for PyTorch. [`Trainer`] goes hand-in-hand with the [`TrainingArguments`] class, which offers a wide range of options to customize how a model is trained. Together, these two classes provide a complete training API.
20
+
21
+ [`Seq2SeqTrainer`] and [`Seq2SeqTrainingArguments`] inherit from the [`Trainer`] and [`TrainingArguments`] classes and they're adapted for training models for sequence-to-sequence tasks such as summarization or translation.
22
+
23
+ <Tip warning={true}>
24
+
25
+ The [`Trainer`] class is optimized for 🤗 Transformers models and can have surprising behaviors
26
+ when used with other models. When using it with your own model, make sure:
27
+
28
+ - your model always return tuples or subclasses of [`~utils.ModelOutput`]
29
+ - your model can compute the loss if a `labels` argument is provided and that loss is returned as the first
30
+ element of the tuple (if your model returns tuples)
31
+ - your model can accept multiple label arguments (use `label_names` in [`TrainingArguments`] to indicate their name to the [`Trainer`]) but none of them should be named `"label"`
32
+
33
+ </Tip>
34
+
35
+ ## Trainer[[api-reference]]
36
+
37
+ [[autodoc]] Trainer
38
+ - all
39
+
40
+ ## Seq2SeqTrainer
41
+
42
+ [[autodoc]] Seq2SeqTrainer
43
+ - evaluate
44
+ - predict
45
+
46
+ ## TrainingArguments
47
+
48
+ [[autodoc]] TrainingArguments
49
+ - all
50
+
51
+ ## Seq2SeqTrainingArguments
52
+
53
+ [[autodoc]] Seq2SeqTrainingArguments
54
+ - all
docs/transformers/docs/source/en/model_doc/albert.md ADDED
@@ -0,0 +1,307 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2020 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # ALBERT
18
+
19
+ <div class="flex flex-wrap space-x-1">
20
+ <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21
+ <img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
22
+ <img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=
23
+ ">
24
+ <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
25
+ </div>
26
+
27
+ ## Overview
28
+
29
+ The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
30
+ Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training
31
+ speed of BERT:
32
+
33
+ - Splitting the embedding matrix into two smaller matrices.
34
+ - Using repeating layers split among groups.
35
+
36
+ The abstract from the paper is the following:
37
+
38
+ *Increasing model size when pretraining natural language representations often results in improved performance on
39
+ downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations,
40
+ longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
41
+ techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
42
+ that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
43
+ self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
44
+ with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
45
+ SQuAD benchmarks while having fewer parameters compared to BERT-large.*
46
+
47
+ This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by
48
+ [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT).
49
+
50
+ ## Usage tips
51
+
52
+ - ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
53
+ than the left.
54
+ - ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
55
+ similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
56
+ number of (repeating) layers.
57
+ - Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V being the vocab size). If E < H, it has less parameters.
58
+ - Layers are split in groups that share parameters (to save memory).
59
+ Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not.
60
+
61
+ ### Using Scaled Dot Product Attention (SDPA)
62
+
63
+ PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
64
+ encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
65
+ [official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
66
+ or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
67
+ page for more information.
68
+
69
+ SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
70
+ `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
71
+
72
+ ```
73
+ from transformers import AlbertModel
74
+ model = AlbertModel.from_pretrained("albert/albert-base-v1", torch_dtype=torch.float16, attn_implementation="sdpa")
75
+ ...
76
+ ```
77
+
78
+ For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
79
+
80
+ On a local benchmark (GeForce RTX 2060-8GB, PyTorch 2.3.1, OS Ubuntu 20.04) with `float16`, we saw the
81
+ following speedups during training and inference.
82
+
83
+ #### Training for 100 iterations
84
+
85
+ |batch_size|seq_len|Time per batch (eager - s)| Time per batch (sdpa - s)| Speedup (%)| Eager peak mem (MB)| sdpa peak mem (MB)| Mem saving (%)|
86
+ |----------|-------|--------------------------|--------------------------|------------|--------------------|-------------------|---------------|
87
+ |2 |256 |0.028 |0.024 |14.388 |358.411 |321.088 |11.624 |
88
+ |2 |512 |0.049 |0.041 |17.681 |753.458 |602.660 |25.022 |
89
+ |4 |256 |0.044 |0.039 |12.246 |679.534 |602.660 |12.756 |
90
+ |4 |512 |0.090 |0.076 |18.472 |1434.820 |1134.140 |26.512 |
91
+ |8 |256 |0.081 |0.072 |12.664 |1283.825 |1134.140 |13.198 |
92
+ |8 |512 |0.170 |0.143 |18.957 |2820.398 |2219.695 |27.062 |
93
+
94
+ #### Inference with 50 batches
95
+
96
+ |batch_size|seq_len|Per token latency eager (ms)|Per token latency SDPA (ms)|Speedup (%) |Mem eager (MB)|Mem BT (MB)|Mem saved (%)|
97
+ |----------|-------|----------------------------|---------------------------|------------|--------------|-----------|-------------|
98
+ |4 |128 |0.083 |0.071 |16.967 |48.319 |48.45 |-0.268 |
99
+ |4 |256 |0.148 |0.127 |16.37 |63.4 |63.922 |-0.817 |
100
+ |4 |512 |0.31 |0.247 |25.473 |110.092 |94.343 |16.693 |
101
+ |8 |128 |0.137 |0.124 |11.102 |63.4 |63.66 |-0.409 |
102
+ |8 |256 |0.271 |0.231 |17.271 |91.202 |92.246 |-1.132 |
103
+ |8 |512 |0.602 |0.48 |25.47 |186.159 |152.564 |22.021 |
104
+ |16 |128 |0.252 |0.224 |12.506 |91.202 |91.722 |-0.567 |
105
+ |16 |256 |0.526 |0.448 |17.604 |148.378 |150.467 |-1.388 |
106
+ |16 |512 |1.203 |0.96 |25.365 |338.293 |271.102 |24.784 |
107
+
108
+ This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by
109
+ [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT).
110
+
111
+
112
+ ## Resources
113
+
114
+
115
+ The resources provided in the following sections consist of a list of official Hugging Face and community (indicated by 🌎) resources to help you get started with AlBERT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
116
+
117
+
118
+ <PipelineTag pipeline="text-classification"/>
119
+
120
+
121
+ - [`AlbertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification).
122
+
123
+
124
+ - [`TFAlbertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification).
125
+
126
+ - [`FlaxAlbertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_flax.ipynb).
127
+ - Check the [Text classification task guide](../tasks/sequence_classification) on how to use the model.
128
+
129
+
130
+ <PipelineTag pipeline="token-classification"/>
131
+
132
+
133
+ - [`AlbertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification).
134
+
135
+
136
+ - [`TFAlbertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb).
137
+
138
+
139
+
140
+ - [`FlaxAlbertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/token-classification).
141
+ - [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course.
142
+ - Check the [Token classification task guide](../tasks/token_classification) on how to use the model.
143
+
144
+ <PipelineTag pipeline="fill-mask"/>
145
+
146
+ - [`AlbertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
147
+ - [`TFAlbertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_mlmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb).
148
+ - [`FlaxAlbertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/masked_language_modeling_flax.ipynb).
149
+ - [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course.
150
+ - Check the [Masked language modeling task guide](../tasks/masked_language_modeling) on how to use the model.
151
+
152
+ <PipelineTag pipeline="question-answering"/>
153
+
154
+ - [`AlbertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb).
155
+ - [`TFAlbertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb).
156
+ - [`FlaxAlbertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/question-answering).
157
+ - [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the 🤗 Hugging Face Course.
158
+ - Check the [Question answering task guide](../tasks/question_answering) on how to use the model.
159
+
160
+ **Multiple choice**
161
+
162
+ - [`AlbertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb).
163
+ - [`TFAlbertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb).
164
+
165
+ - Check the [Multiple choice task guide](../tasks/multiple_choice) on how to use the model.
166
+
167
+
168
+ ## AlbertConfig
169
+
170
+ [[autodoc]] AlbertConfig
171
+
172
+ ## AlbertTokenizer
173
+
174
+ [[autodoc]] AlbertTokenizer
175
+ - build_inputs_with_special_tokens
176
+ - get_special_tokens_mask
177
+ - create_token_type_ids_from_sequences
178
+ - save_vocabulary
179
+
180
+ ## AlbertTokenizerFast
181
+
182
+ [[autodoc]] AlbertTokenizerFast
183
+
184
+ ## Albert specific outputs
185
+
186
+ [[autodoc]] models.albert.modeling_albert.AlbertForPreTrainingOutput
187
+
188
+ [[autodoc]] models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput
189
+
190
+ <frameworkcontent>
191
+ <pt>
192
+
193
+ ## AlbertModel
194
+
195
+ [[autodoc]] AlbertModel
196
+ - forward
197
+
198
+ ## AlbertForPreTraining
199
+
200
+ [[autodoc]] AlbertForPreTraining
201
+ - forward
202
+
203
+ ## AlbertForMaskedLM
204
+
205
+ [[autodoc]] AlbertForMaskedLM
206
+ - forward
207
+
208
+ ## AlbertForSequenceClassification
209
+
210
+ [[autodoc]] AlbertForSequenceClassification
211
+ - forward
212
+
213
+ ## AlbertForMultipleChoice
214
+
215
+ [[autodoc]] AlbertForMultipleChoice
216
+
217
+ ## AlbertForTokenClassification
218
+
219
+ [[autodoc]] AlbertForTokenClassification
220
+ - forward
221
+
222
+ ## AlbertForQuestionAnswering
223
+
224
+ [[autodoc]] AlbertForQuestionAnswering
225
+ - forward
226
+
227
+ </pt>
228
+
229
+ <tf>
230
+
231
+ ## TFAlbertModel
232
+
233
+ [[autodoc]] TFAlbertModel
234
+ - call
235
+
236
+ ## TFAlbertForPreTraining
237
+
238
+ [[autodoc]] TFAlbertForPreTraining
239
+ - call
240
+
241
+ ## TFAlbertForMaskedLM
242
+
243
+ [[autodoc]] TFAlbertForMaskedLM
244
+ - call
245
+
246
+ ## TFAlbertForSequenceClassification
247
+
248
+ [[autodoc]] TFAlbertForSequenceClassification
249
+ - call
250
+
251
+ ## TFAlbertForMultipleChoice
252
+
253
+ [[autodoc]] TFAlbertForMultipleChoice
254
+ - call
255
+
256
+ ## TFAlbertForTokenClassification
257
+
258
+ [[autodoc]] TFAlbertForTokenClassification
259
+ - call
260
+
261
+ ## TFAlbertForQuestionAnswering
262
+
263
+ [[autodoc]] TFAlbertForQuestionAnswering
264
+ - call
265
+
266
+ </tf>
267
+ <jax>
268
+
269
+ ## FlaxAlbertModel
270
+
271
+ [[autodoc]] FlaxAlbertModel
272
+ - __call__
273
+
274
+ ## FlaxAlbertForPreTraining
275
+
276
+ [[autodoc]] FlaxAlbertForPreTraining
277
+ - __call__
278
+
279
+ ## FlaxAlbertForMaskedLM
280
+
281
+ [[autodoc]] FlaxAlbertForMaskedLM
282
+ - __call__
283
+
284
+ ## FlaxAlbertForSequenceClassification
285
+
286
+ [[autodoc]] FlaxAlbertForSequenceClassification
287
+ - __call__
288
+
289
+ ## FlaxAlbertForMultipleChoice
290
+
291
+ [[autodoc]] FlaxAlbertForMultipleChoice
292
+ - __call__
293
+
294
+ ## FlaxAlbertForTokenClassification
295
+
296
+ [[autodoc]] FlaxAlbertForTokenClassification
297
+ - __call__
298
+
299
+ ## FlaxAlbertForQuestionAnswering
300
+
301
+ [[autodoc]] FlaxAlbertForQuestionAnswering
302
+ - __call__
303
+
304
+ </jax>
305
+ </frameworkcontent>
306
+
307
+
docs/transformers/docs/source/en/model_doc/align.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2023 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # ALIGN
18
+
19
+ <div class="flex flex-wrap space-x-1">
20
+ <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21
+ </div>
22
+
23
+ ## Overview
24
+
25
+ The ALIGN model was proposed in [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. ALIGN is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. ALIGN features a dual-encoder architecture with [EfficientNet](efficientnet) as its vision encoder and [BERT](bert) as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.
26
+
27
+ The abstract from the paper is the following:
28
+
29
+ *Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.*
30
+
31
+ This model was contributed by [Alara Dirik](https://huggingface.co/adirik).
32
+ The original code is not released, this implementation is based on the Kakao Brain implementation based on the original paper.
33
+
34
+ ## Usage example
35
+
36
+ ALIGN uses EfficientNet to get visual features and BERT to get the text features. Both the text and visual features are then projected to a latent space with identical dimension. The dot product between the projected image and text features is then used as a similarity score.
37
+
38
+ [`AlignProcessor`] wraps [`EfficientNetImageProcessor`] and [`BertTokenizer`] into a single instance to both encode the text and preprocess the images. The following example shows how to get the image-text similarity scores using [`AlignProcessor`] and [`AlignModel`].
39
+
40
+ ```python
41
+ import requests
42
+ import torch
43
+ from PIL import Image
44
+ from transformers import AlignProcessor, AlignModel
45
+
46
+ processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
47
+ model = AlignModel.from_pretrained("kakaobrain/align-base")
48
+
49
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
50
+ image = Image.open(requests.get(url, stream=True).raw)
51
+ candidate_labels = ["an image of a cat", "an image of a dog"]
52
+
53
+ inputs = processor(images=image ,text=candidate_labels, return_tensors="pt")
54
+
55
+ with torch.no_grad():
56
+ outputs = model(**inputs)
57
+
58
+ # this is the image-text similarity score
59
+ logits_per_image = outputs.logits_per_image
60
+
61
+ # we can take the softmax to get the label probabilities
62
+ probs = logits_per_image.softmax(dim=1)
63
+ print(probs)
64
+ ```
65
+
66
+ ## Resources
67
+
68
+ A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ALIGN.
69
+
70
+ - A blog post on [ALIGN and the COYO-700M dataset](https://huggingface.co/blog/vit-align).
71
+ - A zero-shot image classification [demo](https://huggingface.co/spaces/adirik/ALIGN-zero-shot-image-classification).
72
+ - [Model card](https://huggingface.co/kakaobrain/align-base) of `kakaobrain/align-base` model.
73
+
74
+ If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. The resource should ideally demonstrate something new instead of duplicating an existing resource.
75
+
76
+ ## AlignConfig
77
+
78
+ [[autodoc]] AlignConfig
79
+ - from_text_vision_configs
80
+
81
+ ## AlignTextConfig
82
+
83
+ [[autodoc]] AlignTextConfig
84
+
85
+ ## AlignVisionConfig
86
+
87
+ [[autodoc]] AlignVisionConfig
88
+
89
+ ## AlignProcessor
90
+
91
+ [[autodoc]] AlignProcessor
92
+
93
+ ## AlignModel
94
+
95
+ [[autodoc]] AlignModel
96
+ - forward
97
+ - get_text_features
98
+ - get_image_features
99
+
100
+ ## AlignTextModel
101
+
102
+ [[autodoc]] AlignTextModel
103
+ - forward
104
+
105
+ ## AlignVisionModel
106
+
107
+ [[autodoc]] AlignVisionModel
108
+ - forward
docs/transformers/docs/source/en/model_doc/altclip.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2022 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # AltCLIP
18
+
19
+ <div class="flex flex-wrap space-x-1">
20
+ <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21
+ </div>
22
+
23
+ ## Overview
24
+
25
+ The AltCLIP model was proposed in [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679v2) by Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu. AltCLIP
26
+ (Altering the Language Encoder in CLIP) is a neural network trained on a variety of image-text and text-text pairs. By switching CLIP's
27
+ text encoder with a pretrained multilingual text encoder XLM-R, we could obtain very close performances with CLIP on almost all tasks, and extended original CLIP's capabilities such as multilingual understanding.
28
+
29
+ The abstract from the paper is the following:
30
+
31
+ *In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model.
32
+ Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained
33
+ multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of
34
+ teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art
35
+ performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with
36
+ CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.*
37
+
38
+ This model was contributed by [jongjyh](https://huggingface.co/jongjyh).
39
+
40
+ ## Usage tips and example
41
+
42
+ The usage of AltCLIP is very similar to the CLIP. the difference between CLIP is the text encoder. Note that we use bidirectional attention instead of casual attention
43
+ and we take the [CLS] token in XLM-R to represent text embedding.
44
+
45
+ AltCLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image
46
+ classification. AltCLIP uses a ViT like transformer to get visual features and a bidirectional language model to get the text
47
+ features. Both the text and visual features are then projected to a latent space with identical dimension. The dot
48
+ product between the projected image and text features is then used as a similar score.
49
+
50
+ To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
51
+ which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
52
+ also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
53
+ The [`CLIPImageProcessor`] can be used to resize (or rescale) and normalize images for the model.
54
+
55
+ The [`AltCLIPProcessor`] wraps a [`CLIPImageProcessor`] and a [`XLMRobertaTokenizer`] into a single instance to both
56
+ encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
57
+ [`AltCLIPProcessor`] and [`AltCLIPModel`].
58
+
59
+ ```python
60
+ >>> from PIL import Image
61
+ >>> import requests
62
+
63
+ >>> from transformers import AltCLIPModel, AltCLIPProcessor
64
+
65
+ >>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
66
+ >>> processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")
67
+
68
+ >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
69
+ >>> image = Image.open(requests.get(url, stream=True).raw)
70
+
71
+ >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
72
+
73
+ >>> outputs = model(**inputs)
74
+ >>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
75
+ >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
76
+ ```
77
+
78
+ <Tip>
79
+
80
+ This model is based on `CLIPModel`, use it like you would use the original [CLIP](clip).
81
+
82
+ </Tip>
83
+
84
+ ## AltCLIPConfig
85
+
86
+ [[autodoc]] AltCLIPConfig
87
+ - from_text_vision_configs
88
+
89
+ ## AltCLIPTextConfig
90
+
91
+ [[autodoc]] AltCLIPTextConfig
92
+
93
+ ## AltCLIPVisionConfig
94
+
95
+ [[autodoc]] AltCLIPVisionConfig
96
+
97
+ ## AltCLIPProcessor
98
+
99
+ [[autodoc]] AltCLIPProcessor
100
+
101
+ ## AltCLIPModel
102
+
103
+ [[autodoc]] AltCLIPModel
104
+ - forward
105
+ - get_text_features
106
+ - get_image_features
107
+
108
+ ## AltCLIPTextModel
109
+
110
+ [[autodoc]] AltCLIPTextModel
111
+ - forward
112
+
113
+ ## AltCLIPVisionModel
114
+
115
+ [[autodoc]] AltCLIPVisionModel
116
+ - forward
docs/transformers/docs/source/en/model_doc/aria.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--Copyright 2024 The HuggingFace Team. All rights reserved.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+ the License. You may obtain a copy of the License at
5
+
6
+ http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+ specific language governing permissions and limitations under the License.
11
+
12
+ ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+ rendered properly in your Markdown viewer.
14
+
15
+ -->
16
+
17
+ # Aria
18
+
19
+ <div class="flex flex-wrap space-x-1">
20
+ <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21
+ <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
22
+ <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
23
+ </div>
24
+
25
+ ## Overview
26
+
27
+ The Aria model was proposed in [Aria: An Open Multimodal Native Mixture-of-Experts Model](https://huggingface.co/papers/2410.05993) by Li et al. from the Rhymes.AI team.
28
+
29
+ Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.
30
+
31
+ The abstract from the paper is the following:
32
+
33
+ *Information comes in diverse modalities. Multimodal native AI models are essential to integrate real-world information and deliver comprehensive understanding. While proprietary multimodal native models exist, their lack of openness imposes obstacles for adoptions, let alone adaptations. To fill this gap, we introduce Aria, an open multimodal native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. Aria is a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual token and text token, respectively. It outperforms Pixtral-12B and Llama3.2-11B, and is competitive against the best proprietary models on various multimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline, which progressively equips the model with strong capabilities in language understanding, multimodal understanding, long context window, and instruction following. We open-source the model weights along with a codebase that facilitates easy adoptions and adaptations of Aria in real-world applications.*
34
+
35
+ This model was contributed by [m-ric](https://huggingface.co/m-ric).
36
+ The original code can be found [here](https://github.com/rhymes-ai/Aria).
37
+
38
+ ## Usage tips
39
+
40
+ Here's how to use the model for vision tasks:
41
+ ```python
42
+ import requests
43
+ import torch
44
+ from PIL import Image
45
+
46
+ from transformers import AriaProcessor, AriaForConditionalGeneration
47
+
48
+ model_id_or_path = "rhymes-ai/Aria"
49
+
50
+ model = AriaForConditionalGeneration.from_pretrained(
51
+ model_id_or_path, device_map="auto"
52
+ )
53
+
54
+ processor = AriaProcessor.from_pretrained(model_id_or_path)
55
+
56
+ image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
57
+
58
+ messages = [
59
+ {
60
+ "role": "user",
61
+ "content": [
62
+ {"type": "image"},
63
+ {"text": "what is the image?", "type": "text"},
64
+ ],
65
+ }
66
+ ]
67
+
68
+ text = processor.apply_chat_template(messages, add_generation_prompt=True)
69
+ inputs = processor(text=text, images=image, return_tensors="pt")
70
+ inputs.to(model.device)
71
+
72
+ output = model.generate(
73
+ **inputs,
74
+ max_new_tokens=15,
75
+ stop_strings=["<|im_end|>"],
76
+ tokenizer=processor.tokenizer,
77
+ do_sample=True,
78
+ temperature=0.9,
79
+ )
80
+ output_ids = output[0][inputs["input_ids"].shape[1]:]
81
+ response = processor.decode(output_ids, skip_special_tokens=True)
82
+ ```
83
+
84
+
85
+ ## AriaImageProcessor
86
+
87
+ [[autodoc]] AriaImageProcessor
88
+
89
+ ## AriaProcessor
90
+
91
+ [[autodoc]] AriaProcessor
92
+
93
+ ## AriaTextConfig
94
+
95
+ [[autodoc]] AriaTextConfig
96
+
97
+ ## AriaConfig
98
+
99
+ [[autodoc]] AriaConfig
100
+
101
+ ## AriaTextModel
102
+
103
+ [[autodoc]] AriaTextModel
104
+
105
+ ## AriaTextForCausalLM
106
+
107
+ [[autodoc]] AriaTextForCausalLM
108
+
109
+ ## AriaForConditionalGeneration
110
+
111
+ [[autodoc]] AriaForConditionalGeneration
112
+ - forward