Buckets:
| # CLAP | |
| [CLAP (Contrastive Language-Audio Pretraining)](https://huggingface.co/papers/2211.06687) is a multimodal model that combines audio data with natural language descriptions through contrastive learning. | |
| It incorporates feature fusion and keyword-to-caption augmentation to process variable-length audio inputs and to improve performance. CLAP doesn't require task-specific training data and can learn meaningful audio representations through natural language. | |
| You can find all the original CLAP checkpoints under the [CLAP](https://huggingface.co/collections/laion/clap-contrastive-language-audio-pretraining-65415c0b18373b607262a490) collection. | |
| > [!TIP] | |
| > This model was contributed by [ybelkada](https://huggingface.co/ybelkada) and [ArthurZ](https://huggingface.co/ArthurZ). | |
| > | |
| > Click on the CLAP models in the right sidebar for more examples of how to apply CLAP to different audio retrieval and classification tasks. | |
| The example below demonstrates how to extract text embeddings with the [AutoModel](/docs/transformers/pr_33892/en/model_doc/auto#transformers.AutoModel) class. | |
| <hfoptions id="usage"> | |
| <hfoption id="AutoModel"> | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModel | |
| model = AutoModel.from_pretrained("laion/clap-htsat-unfused", dtype=torch.float16, device_map="auto") | |
| tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused") | |
| texts = ["the sound of a cat", "the sound of a dog", "music playing"] | |
| inputs = tokenizer(texts, padding=True, return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| text_features = model.get_text_features(**inputs) | |
| print(f"Text embeddings shape: {text_features.shape}") | |
| print(f"Text embeddings: {text_features}") | |
| ``` | |
| </hfoption> | |
| </hfoptions> | |
| ## ClapConfig[[transformers.ClapConfig]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class transformers.ClapConfig</name><anchor>transformers.ClapConfig</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/configuration_clap.py#L284</source><parameters>[{"name": "text_config", "val": " = None"}, {"name": "audio_config", "val": " = None"}, {"name": "logit_scale_init_value", "val": " = 14.285714285714285"}, {"name": "projection_dim", "val": " = 512"}, {"name": "projection_hidden_act", "val": " = 'relu'"}, {"name": "initializer_factor", "val": " = 1.0"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **text_config** (`dict`, *optional*) -- | |
| Dictionary of configuration options used to initialize [ClapTextConfig](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapTextConfig). | |
| - **audio_config** (`dict`, *optional*) -- | |
| Dictionary of configuration options used to initialize [ClapAudioConfig](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapAudioConfig). | |
| - **logit_scale_init_value** (`float`, *optional*, defaults to 14.29) -- | |
| The initial value of the *logit_scale* parameter. Default is used as per the original CLAP implementation. | |
| - **projection_dim** (`int`, *optional*, defaults to 512) -- | |
| Dimensionality of text and audio projection layers. | |
| - **projection_hidden_act** (`str`, *optional*, defaults to `"relu"`) -- | |
| Activation function for the projection layers. | |
| - **initializer_factor** (`float`, *optional*, defaults to 1.0) -- | |
| Factor to scale the initialization of the model weights. | |
| - **kwargs** (*optional*) -- | |
| Dictionary of keyword arguments.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| [ClapConfig](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapConfig) is the configuration class to store the configuration of a [ClapModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapModel). It is used to instantiate | |
| a CLAP model according to the specified arguments, defining the text model and audio model configs. Instantiating a | |
| configuration with the defaults will yield a similar configuration to that of the CLAP | |
| [laion/clap-htsat-fused](https://huggingface.co/laion/clap-htsat-fused) architecture. | |
| Configuration objects inherit from [PreTrainedConfig](/docs/transformers/pr_33892/en/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the | |
| documentation from [PreTrainedConfig](/docs/transformers/pr_33892/en/main_classes/configuration#transformers.PreTrainedConfig) for more information. | |
| <ExampleCodeBlock anchor="transformers.ClapConfig.example"> | |
| Example: | |
| ```python | |
| >>> from transformers import ClapConfig, ClapModel | |
| >>> # Initializing a ClapConfig with laion-ai/base style configuration | |
| >>> configuration = ClapConfig() | |
| >>> # Initializing a ClapModel (with random weights) from the laion-ai/base style configuration | |
| >>> model = ClapModel(configuration) | |
| >>> # Accessing the model configuration | |
| >>> configuration = model.config | |
| >>> # We can also initialize a ClapConfig from a ClapTextConfig and a ClapAudioConfig | |
| >>> from transformers import ClapTextConfig, ClapAudioConfig | |
| >>> # Initializing a ClapText and ClapAudioConfig configuration | |
| >>> config_text = ClapTextConfig() | |
| >>> config_audio = ClapAudioConfig() | |
| >>> config = ClapConfig(text_config=config_text, audio_config=config_audio) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## ClapTextConfig[[transformers.ClapTextConfig]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class transformers.ClapTextConfig</name><anchor>transformers.ClapTextConfig</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/configuration_clap.py#L24</source><parameters>[{"name": "vocab_size", "val": " = 50265"}, {"name": "hidden_size", "val": " = 768"}, {"name": "num_hidden_layers", "val": " = 12"}, {"name": "num_attention_heads", "val": " = 12"}, {"name": "intermediate_size", "val": " = 3072"}, {"name": "hidden_act", "val": " = 'gelu'"}, {"name": "hidden_dropout_prob", "val": " = 0.1"}, {"name": "attention_probs_dropout_prob", "val": " = 0.1"}, {"name": "max_position_embeddings", "val": " = 514"}, {"name": "type_vocab_size", "val": " = 1"}, {"name": "initializer_factor", "val": " = 1.0"}, {"name": "layer_norm_eps", "val": " = 1e-12"}, {"name": "projection_dim", "val": " = 512"}, {"name": "pad_token_id", "val": " = 1"}, {"name": "bos_token_id", "val": " = 0"}, {"name": "eos_token_id", "val": " = 2"}, {"name": "use_cache", "val": " = True"}, {"name": "projection_hidden_act", "val": " = 'relu'"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **vocab_size** (`int`, *optional*, defaults to 30522) -- | |
| Vocabulary size of the CLAP model. Defines the number of different tokens that can be represented by the | |
| `inputs_ids` passed when calling [ClapTextModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapTextModel). | |
| - **hidden_size** (`int`, *optional*, defaults to 768) -- | |
| Dimensionality of the encoder layers and the pooler layer. | |
| - **num_hidden_layers** (`int`, *optional*, defaults to 12) -- | |
| Number of hidden layers in the Transformer encoder. | |
| - **num_attention_heads** (`int`, *optional*, defaults to 12) -- | |
| Number of attention heads for each attention layer in the Transformer encoder. | |
| - **intermediate_size** (`int`, *optional*, defaults to 3072) -- | |
| Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder. | |
| - **hidden_act** (`str` or `Callable`, *optional*, defaults to `"relu"`) -- | |
| The non-linear activation function (function or string) in the encoder and pooler. If string, `"relu"`, | |
| `"relu"`, `"silu"` and `"relu_new"` are supported. | |
| - **hidden_dropout_prob** (`float`, *optional*, defaults to 0.1) -- | |
| The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. | |
| - **attention_probs_dropout_prob** (`float`, *optional*, defaults to 0.1) -- | |
| The dropout ratio for the attention probabilities. | |
| - **max_position_embeddings** (`int`, *optional*, defaults to 512) -- | |
| The maximum sequence length that this model might ever be used with. Typically set this to something large | |
| just in case (e.g., 512 or 1024 or 2048). | |
| - **type_vocab_size** (`int`, *optional*, defaults to 2) -- | |
| The vocabulary size of the `token_type_ids` passed when calling [ClapTextModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapTextModel). | |
| - **layer_norm_eps** (`float`, *optional*, defaults to 1e-12) -- | |
| The epsilon used by the layer normalization layers. | |
| - **is_decoder** (`bool`, *optional*, defaults to `False`) -- | |
| Whether the model is used as a decoder or not. If `False`, the model is used as an encoder. | |
| - **use_cache** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not the model should return the last key/values attentions (not used by all models). Only | |
| relevant if `config.is_decoder=True`. | |
| - **projection_hidden_act** (`str`, *optional*, defaults to `"relu"`) -- | |
| The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`, | |
| `"relu"`, `"silu"` and `"gelu_new"` are supported. | |
| - **projection_dim** (`int`, *optional*, defaults to 512) -- | |
| Dimension of the projection head of the `ClapTextModelWithProjection`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| This is the configuration class to store the configuration of a [ClapTextModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapTextModel). It is used to instantiate a CLAP | |
| model according to the specified arguments, defining the model architecture. Instantiating a configuration with the | |
| defaults will yield a similar configuration to that of the CLAP | |
| [calp-hsat-fused](https://huggingface.co/laion/clap-hsat-fused) architecture. | |
| Configuration objects inherit from [PreTrainedConfig](/docs/transformers/pr_33892/en/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the | |
| documentation from [PreTrainedConfig](/docs/transformers/pr_33892/en/main_classes/configuration#transformers.PreTrainedConfig) for more information. | |
| <ExampleCodeBlock anchor="transformers.ClapTextConfig.example"> | |
| Examples: | |
| ```python | |
| >>> from transformers import ClapTextConfig, ClapTextModel | |
| >>> # Initializing a CLAP text configuration | |
| >>> configuration = ClapTextConfig() | |
| >>> # Initializing a model (with random weights) from the configuration | |
| >>> model = ClapTextModel(configuration) | |
| >>> # Accessing the model configuration | |
| >>> configuration = model.config | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## ClapAudioConfig[[transformers.ClapAudioConfig]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class transformers.ClapAudioConfig</name><anchor>transformers.ClapAudioConfig</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/configuration_clap.py#L131</source><parameters>[{"name": "window_size", "val": " = 8"}, {"name": "num_mel_bins", "val": " = 64"}, {"name": "spec_size", "val": " = 256"}, {"name": "hidden_act", "val": " = 'gelu'"}, {"name": "patch_size", "val": " = 4"}, {"name": "patch_stride", "val": " = [4, 4]"}, {"name": "num_classes", "val": " = 527"}, {"name": "hidden_size", "val": " = 768"}, {"name": "projection_dim", "val": " = 512"}, {"name": "depths", "val": " = [2, 2, 6, 2]"}, {"name": "num_attention_heads", "val": " = [4, 8, 16, 32]"}, {"name": "enable_fusion", "val": " = False"}, {"name": "hidden_dropout_prob", "val": " = 0.1"}, {"name": "fusion_type", "val": " = None"}, {"name": "patch_embed_input_channels", "val": " = 1"}, {"name": "flatten_patch_embeds", "val": " = True"}, {"name": "patch_embeds_hidden_size", "val": " = 96"}, {"name": "enable_patch_layer_norm", "val": " = True"}, {"name": "drop_path_rate", "val": " = 0.0"}, {"name": "attention_probs_dropout_prob", "val": " = 0.0"}, {"name": "qkv_bias", "val": " = True"}, {"name": "mlp_ratio", "val": " = 4.0"}, {"name": "aff_block_r", "val": " = 4"}, {"name": "num_hidden_layers", "val": " = 4"}, {"name": "projection_hidden_act", "val": " = 'relu'"}, {"name": "layer_norm_eps", "val": " = 1e-05"}, {"name": "initializer_factor", "val": " = 1.0"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **window_size** (`int`, *optional*, defaults to 8) -- | |
| Image size of the spectrogram | |
| - **num_mel_bins** (`int`, *optional*, defaults to 64) -- | |
| Number of mel features used per frames. Should correspond to the value used in the `ClapProcessor` class. | |
| - **spec_size** (`int`, *optional*, defaults to 256) -- | |
| Desired input size of the spectrogram that the model supports. It can be different from the output of the | |
| `ClapFeatureExtractor`, in which case the input features will be resized. Corresponds to the `image_size` | |
| of the audio models. | |
| - **hidden_act** (`str`, *optional*, defaults to `"gelu"`) -- | |
| The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, | |
| `"relu"`, `"silu"` and `"gelu_new"` are supported. | |
| - **patch_size** (`int`, *optional*, defaults to 4) -- | |
| Patch size for the audio spectrogram | |
| - **patch_stride** (`list`, *optional*, defaults to `[4, 4]`) -- | |
| Patch stride for the audio spectrogram | |
| - **num_classes** (`int`, *optional*, defaults to 527) -- | |
| Number of classes used for the head training | |
| - **hidden_size** (`int`, *optional*, defaults to 768) -- | |
| Hidden size of the output of the audio encoder. Correspond to the dimension of the penultimate layer's | |
| output,which is sent to the projection MLP layer. | |
| - **projection_dim** (`int`, *optional*, defaults to 512) -- | |
| Hidden size of the projection layer. | |
| - **depths** (`list`, *optional*, defaults to `[2, 2, 6, 2]`) -- | |
| Depths used for the Swin Layers of the audio model | |
| - **num_attention_heads** (`list`, *optional*, defaults to `[4, 8, 16, 32]`) -- | |
| Number of attention heads used for the Swin Layers of the audio model | |
| - **enable_fusion** (`bool`, *optional*, defaults to `False`) -- | |
| Whether or not to enable patch fusion. This is the main contribution of the authors, and should give the | |
| best results. | |
| - **hidden_dropout_prob** (`float`, *optional*, defaults to 0.1) -- | |
| The dropout probability for all fully connected layers in the encoder. | |
| - **fusion_type** (`[type]`, *optional*) -- | |
| Fusion type used for the patch fusion. | |
| - **patch_embed_input_channels** (`int`, *optional*, defaults to 1) -- | |
| Number of channels used for the input spectrogram | |
| - **flatten_patch_embeds** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to flatten the patch embeddings | |
| - **patch_embeds_hidden_size** (`int`, *optional*, defaults to 96) -- | |
| Hidden size of the patch embeddings. It is used as the number of output channels. | |
| - **enable_patch_layer_norm** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to enable layer normalization for the patch embeddings | |
| - **drop_path_rate** (`float`, *optional*, defaults to 0.0) -- | |
| Drop path rate for the patch fusion | |
| - **attention_probs_dropout_prob** (`float`, *optional*, defaults to 0.0) -- | |
| The dropout ratio for the attention probabilities. | |
| - **qkv_bias** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to add a bias to the query, key, value projections. | |
| - **mlp_ratio** (`float`, *optional*, defaults to 4.0) -- | |
| Ratio of the mlp hidden dim to embedding dim. | |
| - **aff_block_r** (`int`, *optional*, defaults to 4) -- | |
| downsize_ratio used in the AudioFF block | |
| - **num_hidden_layers** (`int`, *optional*, defaults to 4) -- | |
| Number of hidden layers in the Transformer encoder. | |
| - **projection_hidden_act** (`str`, *optional*, defaults to `"relu"`) -- | |
| The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`, | |
| `"relu"`, `"silu"` and `"gelu_new"` are supported. | |
| - **layer_norm_eps** (`[type]`, *optional*, defaults to 1e-05) -- | |
| The epsilon used by the layer normalization layers. | |
| - **initializer_factor** (`float`, *optional*, defaults to 1.0) -- | |
| A factor for initializing all weight matrices (should be kept to 1, used internally for initialization | |
| testing).</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| This is the configuration class to store the configuration of a [ClapAudioModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapAudioModel). It is used to instantiate a | |
| CLAP audio encoder according to the specified arguments, defining the model architecture. Instantiating a | |
| configuration with the defaults will yield a similar configuration to that of the audio encoder of the CLAP | |
| [laion/clap-htsat-fused](https://huggingface.co/laion/clap-htsat-fused) architecture. | |
| Configuration objects inherit from [PreTrainedConfig](/docs/transformers/pr_33892/en/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the | |
| documentation from [PreTrainedConfig](/docs/transformers/pr_33892/en/main_classes/configuration#transformers.PreTrainedConfig) for more information. | |
| <ExampleCodeBlock anchor="transformers.ClapAudioConfig.example"> | |
| Example: | |
| ```python | |
| >>> from transformers import ClapAudioConfig, ClapAudioModel | |
| >>> # Initializing a ClapAudioConfig with laion/clap-htsat-fused style configuration | |
| >>> configuration = ClapAudioConfig() | |
| >>> # Initializing a ClapAudioModel (with random weights) from the laion/clap-htsat-fused style configuration | |
| >>> model = ClapAudioModel(configuration) | |
| >>> # Accessing the model configuration | |
| >>> configuration = model.config | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| ## ClapFeatureExtractor[[transformers.ClapFeatureExtractor]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class transformers.ClapFeatureExtractor</name><anchor>transformers.ClapFeatureExtractor</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/feature_extraction_clap.py#L34</source><parameters>[{"name": "feature_size", "val": " = 64"}, {"name": "sampling_rate", "val": " = 48000"}, {"name": "hop_length", "val": " = 480"}, {"name": "max_length_s", "val": " = 10"}, {"name": "fft_window_size", "val": " = 1024"}, {"name": "padding_value", "val": " = 0.0"}, {"name": "return_attention_mask", "val": " = False"}, {"name": "frequency_min", "val": ": float = 0"}, {"name": "frequency_max", "val": ": float = 14000"}, {"name": "top_db", "val": ": typing.Optional[int] = None"}, {"name": "truncation", "val": ": str = 'fusion'"}, {"name": "padding", "val": ": str = 'repeatpad'"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **feature_size** (`int`, *optional*, defaults to 64) -- | |
| The feature dimension of the extracted Mel spectrograms. This corresponds to the number of mel filters | |
| (`n_mels`). | |
| - **sampling_rate** (`int`, *optional*, defaults to 48000) -- | |
| The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). This only serves | |
| to warn users if the audio fed to the feature extractor does not have the same sampling rate. | |
| - **hop_length** (`int`,*optional*, defaults to 480) -- | |
| Length of the overlapping windows for the STFT used to obtain the Mel Spectrogram. The audio will be split | |
| in smaller `frames` with a step of `hop_length` between each frame. | |
| - **max_length_s** (`int`, *optional*, defaults to 10) -- | |
| The maximum input length of the model in seconds. This is used to pad the audio. | |
| - **fft_window_size** (`int`, *optional*, defaults to 1024) -- | |
| Size of the window (in samples) on which the Fourier transform is applied. This controls the frequency | |
| resolution of the spectrogram. 400 means that the fourier transform is computed on windows of 400 samples. | |
| - **padding_value** (`float`, *optional*, defaults to 0.0) -- | |
| Padding value used to pad the audio. Should correspond to silences. | |
| - **return_attention_mask** (`bool`, *optional*, defaults to `False`) -- | |
| Whether or not the model should return the attention masks corresponding to the input. | |
| - **frequency_min** (`float`, *optional*, defaults to 0) -- | |
| The lowest frequency of interest. The STFT will not be computed for values below this. | |
| - **frequency_max** (`float`, *optional*, defaults to 14000) -- | |
| The highest frequency of interest. The STFT will not be computed for values above this. | |
| - **top_db** (`float`, *optional*) -- | |
| The highest decibel value used to convert the mel spectrogram to the log scale. For more details see the | |
| `audio_utils.power_to_db` function | |
| - **truncation** (`str`, *optional*, defaults to `"fusion"`) -- | |
| Truncation pattern for long audio inputs. Two patterns are available: | |
| - `fusion` will use `_random_mel_fusion`, which stacks 3 random crops from the mel spectrogram and a | |
| downsampled version of the entire mel spectrogram. | |
| If `config.fusion` is set to True, shorter audios also need to to return 4 mels, which will just be a copy | |
| of the original mel obtained from the padded audio. | |
| - `rand_trunc` will select a random crop of the mel spectrogram. | |
| - **padding** (`str`, *optional*, defaults to `"repeatpad"`) -- | |
| Padding pattern for shorter audio inputs. Three patterns were originally implemented: | |
| - `repeatpad`: the audio is repeated, and then padded to fit the `max_length`. | |
| - `repeat`: the audio is repeated and then cut to fit the `max_length` | |
| - `pad`: the audio is padded.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Constructs a CLAP feature extractor. | |
| This feature extractor inherits from [SequenceFeatureExtractor](/docs/transformers/pr_33892/en/main_classes/feature_extractor#transformers.SequenceFeatureExtractor) which contains | |
| most of the main methods. Users should refer to this superclass for more information regarding those methods. | |
| This class extracts mel-filter bank features from raw speech using a custom numpy implementation of the *Short Time | |
| Fourier Transform* (STFT) which should match pytorch's `torch.stft` equivalent. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>to_dict</name><anchor>transformers.ClapFeatureExtractor.to_dict</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/feature_extraction_clap.py#L139</source><parameters>[]</parameters><rettype>`dict[str, Any]`</rettype><retdesc>Dictionary of all the attributes that make up this configuration instance, except for the | |
| mel filter banks, which do not need to be saved or printed as they are too long.</retdesc></docstring> | |
| Serializes this instance to a Python dictionary. | |
| </div></div> | |
| ## ClapProcessor[[transformers.ClapProcessor]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class transformers.ClapProcessor</name><anchor>transformers.ClapProcessor</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/processing_clap.py#L31</source><parameters>[{"name": "feature_extractor", "val": ""}, {"name": "tokenizer", "val": ""}]</parameters><paramsdesc>- **feature_extractor** ([ClapFeatureExtractor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapFeatureExtractor)) -- | |
| The audio processor is a required input. | |
| - **tokenizer** ([RobertaTokenizerFast](/docs/transformers/pr_33892/en/model_doc/roberta#transformers.RobertaTokenizerFast)) -- | |
| The tokenizer is a required input.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Constructs a CLAP processor which wraps a CLAP feature extractor and a RoBerta tokenizer into a single processor. | |
| [ClapProcessor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapProcessor) offers all the functionalities of [ClapFeatureExtractor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapFeatureExtractor) and [RobertaTokenizerFast](/docs/transformers/pr_33892/en/model_doc/roberta#transformers.RobertaTokenizerFast). See the | |
| `__call__()` and [decode()](/docs/transformers/pr_33892/en/main_classes/processors#transformers.ProcessorMixin.decode) for more information. | |
| </div> | |
| ## ClapModel[[transformers.ClapModel]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class transformers.ClapModel</name><anchor>transformers.ClapModel</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/modeling_clap.py#L1507</source><parameters>[{"name": "config", "val": ": ClapConfig"}]</parameters><paramsdesc>- **config** ([ClapConfig](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapConfig)) -- | |
| Model configuration class with all the parameters of the model. Initializing with a config file does not | |
| load the weights associated with the model, only the configuration. Check out the | |
| [from_pretrained()](/docs/transformers/pr_33892/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| The bare Clap Model outputting raw hidden-states without any specific head on top. | |
| This model inherits from [PreTrainedModel](/docs/transformers/pr_33892/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads | |
| etc.) | |
| This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. | |
| Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage | |
| and behavior. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>forward</name><anchor>transformers.ClapModel.forward</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/modeling_clap.py#L1615</source><parameters>[{"name": "input_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "input_features", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "is_longer", "val": ": typing.Optional[torch.BoolTensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "position_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "return_loss", "val": ": typing.Optional[bool] = None"}, {"name": "output_attentions", "val": ": typing.Optional[bool] = None"}, {"name": "output_hidden_states", "val": ": typing.Optional[bool] = None"}, {"name": "return_dict", "val": ": typing.Optional[bool] = None"}]</parameters><paramsdesc>- **input_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. | |
| Indices can be obtained using [AutoTokenizer](/docs/transformers/pr_33892/en/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/pr_33892/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and | |
| [PreTrainedTokenizer.__call__()](/docs/transformers/pr_33892/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. | |
| [What are input IDs?](../glossary#input-ids) | |
| - **input_features** (`torch.FloatTensor` of shape `(batch_size, sequence_length, feature_dim)`, *optional*) -- | |
| The tensors corresponding to the input audio features. Audio features can be obtained using | |
| [ClapFeatureExtractor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapFeatureExtractor). See `ClapFeatureExtractor.__call__()` for details ([ClapProcessor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapProcessor) uses | |
| [ClapFeatureExtractor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapFeatureExtractor) for processing audios). | |
| - **is_longer** (`torch.FloatTensor`, of shape `(batch_size, 1)`, *optional*) -- | |
| Whether the audio clip is longer than `max_length`. If `True`, a feature fusion will be enabled to enhance | |
| the features. | |
| - **attention_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](../glossary#attention-mask) | |
| - **position_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`. | |
| [What are position IDs?](../glossary#position-ids) | |
| - **return_loss** (`bool`, *optional*) -- | |
| Whether or not to return the contrastive loss. | |
| - **output_attentions** (`bool`, *optional*) -- | |
| Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned | |
| tensors for more detail. | |
| - **output_hidden_states** (`bool`, *optional*) -- | |
| Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for | |
| more detail. | |
| - **return_dict** (`bool`, *optional*) -- | |
| Whether or not to return a [ModelOutput](/docs/transformers/pr_33892/en/main_classes/output#transformers.utils.ModelOutput) instead of a plain tuple.</paramsdesc><paramgroups>0</paramgroups><rettype>`transformers.models.clap.modeling_clap.ClapOutput` or `tuple(torch.FloatTensor)`</rettype><retdesc>A `transformers.models.clap.modeling_clap.ClapOutput` or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([ClapConfig](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapConfig)) and inputs. | |
| - **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`) -- Contrastive loss for audio-text similarity. | |
| - **logits_per_audio** (`torch.FloatTensor` of shape `(audio_batch_size, text_batch_size)`) -- The scaled dot product scores between `audio_embeds` and `text_embeds`. This represents the audio-text | |
| similarity scores. | |
| - **logits_per_text** (`torch.FloatTensor` of shape `(text_batch_size, audio_batch_size)`) -- The scaled dot product scores between `text_embeds` and `audio_embeds`. This represents the text-audio | |
| similarity scores. | |
| - **text_embeds** (`torch.FloatTensor` of shape `(batch_size, output_dim`) -- The text embeddings obtained by applying the projection layer to the pooled output of [ClapTextModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapTextModel). | |
| - **audio_embeds** (`torch.FloatTensor` of shape `(batch_size, output_dim`) -- The audio embeddings obtained by applying the projection layer to the pooled output of [ClapAudioModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapAudioModel). | |
| - **text_model_output** (`<class '~modeling_outputs.BaseModelOutputWithPooling'>.text_model_output`, defaults to `None`) -- The output of the [ClapTextModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapTextModel). | |
| - **audio_model_output** (`<class '~modeling_outputs.BaseModelOutputWithPooling'>.audio_model_output`, defaults to `None`) -- The output of the [ClapAudioModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapAudioModel).</retdesc></docstring> | |
| The [ClapModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapModel) forward method, overrides the `__call__` special method. | |
| <Tip> | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| </Tip> | |
| <ExampleCodeBlock anchor="transformers.ClapModel.forward.example"> | |
| Examples: | |
| ```python | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoProcessor, ClapModel | |
| >>> dataset = load_dataset("hf-internal-testing/ashraq-esc50-1-dog-example") | |
| >>> audio_sample = dataset["train"]["audio"][0]["array"] | |
| >>> model = ClapModel.from_pretrained("laion/clap-htsat-unfused") | |
| >>> processor = AutoProcessor.from_pretrained("laion/clap-htsat-unfused") | |
| >>> input_text = ["Sound of a dog", "Sound of vacuum cleaner"] | |
| >>> inputs = processor(text=input_text, audios=audio_sample, return_tensors="pt", padding=True) | |
| >>> outputs = model(**inputs) | |
| >>> logits_per_audio = outputs.logits_per_audio # this is the audio-text similarity score | |
| >>> probs = logits_per_audio.softmax(dim=-1) # we can take the softmax to get the label probabilities | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>get_text_features</name><anchor>transformers.ClapModel.get_text_features</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/modeling_clap.py#L1542</source><parameters>[{"name": "input_ids", "val": ": Tensor"}, {"name": "attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "position_ids", "val": ": typing.Optional[torch.Tensor] = None"}]</parameters><paramsdesc>- **input_ids** (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- | |
| Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. | |
| Indices can be obtained using [AutoTokenizer](/docs/transformers/pr_33892/en/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/pr_33892/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and | |
| [PreTrainedTokenizer.__call__()](/docs/transformers/pr_33892/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. | |
| [What are input IDs?](../glossary#input-ids) | |
| - **attention_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](../glossary#attention-mask) | |
| - **position_ids** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`. | |
| [What are position IDs?](../glossary#position-ids)</paramsdesc><paramgroups>0</paramgroups><rettype>text_features (`torch.FloatTensor` of shape `(batch_size, output_dim`)</rettype><retdesc>The text embeddings obtained by | |
| applying the projection layer to the pooled output of [ClapTextModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapTextModel).</retdesc></docstring> | |
| <ExampleCodeBlock anchor="transformers.ClapModel.get_text_features.example"> | |
| Examples: | |
| ```python | |
| >>> import torch | |
| >>> from transformers import AutoTokenizer, ClapModel | |
| >>> model = ClapModel.from_pretrained("laion/clap-htsat-unfused") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused") | |
| >>> inputs = tokenizer(["the sound of a cat", "the sound of a dog"], padding=True, return_tensors="pt") | |
| >>> with torch.inference_mode(): | |
| ... text_features = model.get_text_features(**inputs) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>get_audio_features</name><anchor>transformers.ClapModel.get_audio_features</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/modeling_clap.py#L1576</source><parameters>[{"name": "input_features", "val": ": Tensor"}, {"name": "is_longer", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}]</parameters><paramsdesc>- **input_features** (`torch.Tensor` of shape `(batch_size, sequence_length, feature_dim)`) -- | |
| The tensors corresponding to the input audio features. Audio features can be obtained using | |
| [ClapFeatureExtractor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapFeatureExtractor). See `ClapFeatureExtractor.__call__()` for details ([ClapProcessor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapProcessor) uses | |
| [ClapFeatureExtractor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapFeatureExtractor) for processing audios). | |
| - **is_longer** (`torch.FloatTensor`, of shape `(batch_size, 1)`, *optional*) -- | |
| Whether the audio clip is longer than `max_length`. If `True`, a feature fusion will be enabled to enhance | |
| the features. | |
| - **attention_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](../glossary#attention-mask)</paramsdesc><paramgroups>0</paramgroups><rettype>audio_features (`torch.FloatTensor` of shape `(batch_size, output_dim`)</rettype><retdesc>The audio embeddings obtained by | |
| applying the projection layer to the pooled output of [ClapAudioModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapAudioModel).</retdesc></docstring> | |
| <ExampleCodeBlock anchor="transformers.ClapModel.get_audio_features.example"> | |
| Examples: | |
| ```python | |
| >>> import torch | |
| >>> from transformers import AutoFeatureExtractor, ClapModel | |
| >>> model = ClapModel.from_pretrained("laion/clap-htsat-unfused") | |
| >>> feature_extractor = AutoFeatureExtractor.from_pretrained("laion/clap-htsat-unfused") | |
| >>> random_audio = torch.rand((16_000)) | |
| >>> inputs = feature_extractor(random_audio, return_tensors="pt") | |
| >>> with torch.inference_mode(): | |
| ... audio_features = model.get_audio_features(**inputs) | |
| ``` | |
| </ExampleCodeBlock> | |
| </div></div> | |
| ## ClapTextModel[[transformers.ClapTextModel]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class transformers.ClapTextModel</name><anchor>transformers.ClapTextModel</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/modeling_clap.py#L1409</source><parameters>[{"name": "config", "val": ""}, {"name": "add_pooling_layer", "val": " = True"}]</parameters><paramsdesc>- **config** ([ClapTextModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapTextModel)) -- | |
| Model configuration class with all the parameters of the model. Initializing with a config file does not | |
| load the weights associated with the model, only the configuration. Check out the | |
| [from_pretrained()](/docs/transformers/pr_33892/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights. | |
| - **add_pooling_layer** (`bool`, *optional*, defaults to `True`) -- | |
| Whether to add a pooling layer</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of | |
| cross-attention is added between the self-attention layers, following the architecture described in *Attention is | |
| all you need*_ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz | |
| Kaiser and Illia Polosukhin. | |
| To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set | |
| to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and | |
| `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass. | |
| .. _*Attention is all you need*: https://huggingface.co/papers/1706.03762 | |
| This model inherits from [PreTrainedModel](/docs/transformers/pr_33892/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads | |
| etc.) | |
| This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. | |
| Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage | |
| and behavior. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>forward</name><anchor>transformers.ClapTextModel.forward</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/modeling_clap.py#L1435</source><parameters>[{"name": "input_ids", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "token_type_ids", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "position_ids", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "inputs_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_attentions", "val": ": typing.Optional[bool] = None"}, {"name": "output_hidden_states", "val": ": typing.Optional[bool] = None"}, {"name": "return_dict", "val": ": typing.Optional[bool] = None"}]</parameters><paramsdesc>- **input_ids** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. | |
| Indices can be obtained using [AutoTokenizer](/docs/transformers/pr_33892/en/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/pr_33892/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and | |
| [PreTrainedTokenizer.__call__()](/docs/transformers/pr_33892/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. | |
| [What are input IDs?](../glossary#input-ids) | |
| - **attention_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](../glossary#attention-mask) | |
| - **token_type_ids** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`: | |
| - 0 corresponds to a *sentence A* token, | |
| - 1 corresponds to a *sentence B* token. | |
| [What are token type IDs?](../glossary#token-type-ids) | |
| - **position_ids** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`. | |
| [What are position IDs?](../glossary#position-ids) | |
| - **inputs_embeds** (`torch.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) -- | |
| Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This | |
| is useful if you want more control over how to convert `input_ids` indices into associated vectors than the | |
| model's internal embedding lookup matrix. | |
| - **output_attentions** (`bool`, *optional*) -- | |
| Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned | |
| tensors for more detail. | |
| - **output_hidden_states** (`bool`, *optional*) -- | |
| Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for | |
| more detail. | |
| - **return_dict** (`bool`, *optional*) -- | |
| Whether or not to return a [ModelOutput](/docs/transformers/pr_33892/en/main_classes/output#transformers.utils.ModelOutput) instead of a plain tuple.</paramsdesc><paramgroups>0</paramgroups><rettype>[transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions](/docs/transformers/pr_33892/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions) or `tuple(torch.FloatTensor)`</rettype><retdesc>A [transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions](/docs/transformers/pr_33892/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions) or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([ClapConfig](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapConfig)) and inputs. | |
| - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`) -- Sequence of hidden-states at the output of the last layer of the model. | |
| - **pooler_output** (`torch.FloatTensor` of shape `(batch_size, hidden_size)`) -- Last layer hidden-state of the first token of the sequence (classification token) after further processing | |
| through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns | |
| the classification token after processing through a linear layer and a tanh activation function. The linear | |
| layer weights are trained from the next sentence prediction (classification) objective during pretraining. | |
| - **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + | |
| one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. | |
| Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. | |
| - **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |
| heads. | |
| - **cross_attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the | |
| weighted average in the cross-attention heads. | |
| - **past_key_values** (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`) -- It is a [Cache](/docs/transformers/pr_33892/en/internal/generation_utils#transformers.Cache) instance. For more details, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache). | |
| Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if | |
| `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values` | |
| input) to speed up sequential decoding.</retdesc></docstring> | |
| The [ClapTextModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapTextModel) forward method, overrides the `__call__` special method. | |
| <Tip> | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| </Tip> | |
| </div></div> | |
| ## ClapTextModelWithProjection[[transformers.ClapTextModelWithProjection]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class transformers.ClapTextModelWithProjection</name><anchor>transformers.ClapTextModelWithProjection</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/modeling_clap.py#L1714</source><parameters>[{"name": "config", "val": ": ClapTextConfig"}]</parameters><paramsdesc>- **config** ([ClapTextConfig](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapTextConfig)) -- | |
| Model configuration class with all the parameters of the model. Initializing with a config file does not | |
| load the weights associated with the model, only the configuration. Check out the | |
| [from_pretrained()](/docs/transformers/pr_33892/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| The Clap Model with a projection layer on top (a linear layer on top of the pooled output). | |
| This model inherits from [PreTrainedModel](/docs/transformers/pr_33892/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads | |
| etc.) | |
| This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. | |
| Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage | |
| and behavior. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>forward</name><anchor>transformers.ClapTextModelWithProjection.forward</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/modeling_clap.py#L1731</source><parameters>[{"name": "input_ids", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "position_ids", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_attentions", "val": ": typing.Optional[bool] = None"}, {"name": "output_hidden_states", "val": ": typing.Optional[bool] = None"}, {"name": "return_dict", "val": ": typing.Optional[bool] = None"}]</parameters><paramsdesc>- **input_ids** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. | |
| Indices can be obtained using [AutoTokenizer](/docs/transformers/pr_33892/en/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/pr_33892/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and | |
| [PreTrainedTokenizer.__call__()](/docs/transformers/pr_33892/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. | |
| [What are input IDs?](../glossary#input-ids) | |
| - **attention_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](../glossary#attention-mask) | |
| - **position_ids** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`. | |
| [What are position IDs?](../glossary#position-ids) | |
| - **output_attentions** (`bool`, *optional*) -- | |
| Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned | |
| tensors for more detail. | |
| - **output_hidden_states** (`bool`, *optional*) -- | |
| Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for | |
| more detail. | |
| - **return_dict** (`bool`, *optional*) -- | |
| Whether or not to return a [ModelOutput](/docs/transformers/pr_33892/en/main_classes/output#transformers.utils.ModelOutput) instead of a plain tuple.</paramsdesc><paramgroups>0</paramgroups><rettype>`transformers.models.clap.modeling_clap.ClapTextModelOutput` or `tuple(torch.FloatTensor)`</rettype><retdesc>A `transformers.models.clap.modeling_clap.ClapTextModelOutput` or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([ClapConfig](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapConfig)) and inputs. | |
| - **text_embeds** (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`) -- The text embeddings obtained by applying the projection layer to the pooler_output. | |
| - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*, defaults to `None`) -- Sequence of hidden-states at the output of the last layer of the model. | |
| - **hidden_states** (`tuple[torch.FloatTensor, ...]`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + | |
| one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. | |
| Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. | |
| - **attentions** (`tuple[torch.FloatTensor, ...]`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |
| heads.</retdesc></docstring> | |
| The [ClapTextModelWithProjection](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapTextModelWithProjection) forward method, overrides the `__call__` special method. | |
| <Tip> | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| </Tip> | |
| <ExampleCodeBlock anchor="transformers.ClapTextModelWithProjection.forward.example"> | |
| Examples: | |
| ```python | |
| >>> from transformers import AutoTokenizer, ClapTextModelWithProjection | |
| >>> model = ClapTextModelWithProjection.from_pretrained("laion/clap-htsat-unfused") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused") | |
| >>> inputs = tokenizer(["a sound of a cat", "a sound of a dog"], padding=True, return_tensors="pt") | |
| >>> outputs = model(**inputs) | |
| >>> text_embeds = outputs.text_embeds | |
| ``` | |
| </ExampleCodeBlock> | |
| </div></div> | |
| ## ClapAudioModel[[transformers.ClapAudioModel]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class transformers.ClapAudioModel</name><anchor>transformers.ClapAudioModel</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/modeling_clap.py#L1335</source><parameters>[{"name": "config", "val": ": ClapAudioConfig"}]</parameters></docstring> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>forward</name><anchor>transformers.ClapAudioModel.forward</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/modeling_clap.py#L1349</source><parameters>[{"name": "input_features", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "is_longer", "val": ": typing.Optional[torch.BoolTensor] = None"}, {"name": "output_attentions", "val": ": typing.Optional[bool] = None"}, {"name": "output_hidden_states", "val": ": typing.Optional[bool] = None"}, {"name": "return_dict", "val": ": typing.Optional[bool] = None"}]</parameters><paramsdesc>- **input_features** (`torch.FloatTensor` of shape `(batch_size, sequence_length, feature_dim)`, *optional*) -- | |
| The tensors corresponding to the input audio features. Audio features can be obtained using | |
| [ClapFeatureExtractor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapFeatureExtractor). See `ClapFeatureExtractor.__call__()` for details ([ClapProcessor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapProcessor) uses | |
| [ClapFeatureExtractor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapFeatureExtractor) for processing audios). | |
| - **is_longer** (`torch.FloatTensor`, of shape `(batch_size, 1)`, *optional*) -- | |
| Whether the audio clip is longer than `max_length`. If `True`, a feature fusion will be enabled to enhance | |
| the features. | |
| - **output_attentions** (`bool`, *optional*) -- | |
| Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned | |
| tensors for more detail. | |
| - **output_hidden_states** (`bool`, *optional*) -- | |
| Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for | |
| more detail. | |
| - **return_dict** (`bool`, *optional*) -- | |
| Whether or not to return a [ModelOutput](/docs/transformers/pr_33892/en/main_classes/output#transformers.utils.ModelOutput) instead of a plain tuple.</paramsdesc><paramgroups>0</paramgroups><rettype>[transformers.modeling_outputs.BaseModelOutputWithPooling](/docs/transformers/pr_33892/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or `tuple(torch.FloatTensor)`</rettype><retdesc>A [transformers.modeling_outputs.BaseModelOutputWithPooling](/docs/transformers/pr_33892/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([ClapConfig](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapConfig)) and inputs. | |
| - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`) -- Sequence of hidden-states at the output of the last layer of the model. | |
| - **pooler_output** (`torch.FloatTensor` of shape `(batch_size, hidden_size)`) -- Last layer hidden-state of the first token of the sequence (classification token) after further processing | |
| through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns | |
| the classification token after processing through a linear layer and a tanh activation function. The linear | |
| layer weights are trained from the next sentence prediction (classification) objective during pretraining. | |
| - **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + | |
| one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. | |
| Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. | |
| - **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |
| heads.</retdesc></docstring> | |
| The [ClapAudioModel](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapAudioModel) forward method, overrides the `__call__` special method. | |
| <Tip> | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| </Tip> | |
| <ExampleCodeBlock anchor="transformers.ClapAudioModel.forward.example"> | |
| Examples: | |
| ```python | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoProcessor, ClapAudioModel | |
| >>> dataset = load_dataset("hf-internal-testing/ashraq-esc50-1-dog-example") | |
| >>> audio_sample = dataset["train"]["audio"][0]["array"] | |
| >>> model = ClapAudioModel.from_pretrained("laion/clap-htsat-fused") | |
| >>> processor = AutoProcessor.from_pretrained("laion/clap-htsat-fused") | |
| >>> inputs = processor(audios=audio_sample, return_tensors="pt") | |
| >>> outputs = model(**inputs) | |
| >>> last_hidden_state = outputs.last_hidden_state | |
| ``` | |
| </ExampleCodeBlock> | |
| </div></div> | |
| ## ClapAudioModelWithProjection[[transformers.ClapAudioModelWithProjection]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class transformers.ClapAudioModelWithProjection</name><anchor>transformers.ClapAudioModelWithProjection</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/modeling_clap.py#L1780</source><parameters>[{"name": "config", "val": ": ClapAudioConfig"}]</parameters><paramsdesc>- **config** ([ClapAudioConfig](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapAudioConfig)) -- | |
| Model configuration class with all the parameters of the model. Initializing with a config file does not | |
| load the weights associated with the model, only the configuration. Check out the | |
| [from_pretrained()](/docs/transformers/pr_33892/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| The Clap Model with a projection layer on top (a linear layer on top of the pooled output). | |
| This model inherits from [PreTrainedModel](/docs/transformers/pr_33892/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads | |
| etc.) | |
| This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. | |
| Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage | |
| and behavior. | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>forward</name><anchor>transformers.ClapAudioModelWithProjection.forward</anchor><source>https://github.com/huggingface/transformers/blob/vr_33892/src/transformers/models/clap/modeling_clap.py#L1795</source><parameters>[{"name": "input_features", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "is_longer", "val": ": typing.Optional[torch.BoolTensor] = None"}, {"name": "output_attentions", "val": ": typing.Optional[bool] = None"}, {"name": "output_hidden_states", "val": ": typing.Optional[bool] = None"}, {"name": "return_dict", "val": ": typing.Optional[bool] = None"}]</parameters><paramsdesc>- **input_features** (`torch.FloatTensor` of shape `(batch_size, sequence_length, feature_dim)`, *optional*) -- | |
| The tensors corresponding to the input audio features. Audio features can be obtained using | |
| [ClapFeatureExtractor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapFeatureExtractor). See `ClapFeatureExtractor.__call__()` for details ([ClapProcessor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapProcessor) uses | |
| [ClapFeatureExtractor](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapFeatureExtractor) for processing audios). | |
| - **is_longer** (`torch.FloatTensor`, of shape `(batch_size, 1)`, *optional*) -- | |
| Whether the audio clip is longer than `max_length`. If `True`, a feature fusion will be enabled to enhance | |
| the features. | |
| - **output_attentions** (`bool`, *optional*) -- | |
| Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned | |
| tensors for more detail. | |
| - **output_hidden_states** (`bool`, *optional*) -- | |
| Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for | |
| more detail. | |
| - **return_dict** (`bool`, *optional*) -- | |
| Whether or not to return a [ModelOutput](/docs/transformers/pr_33892/en/main_classes/output#transformers.utils.ModelOutput) instead of a plain tuple.</paramsdesc><paramgroups>0</paramgroups><rettype>`transformers.models.clap.modeling_clap.ClapAudioModelOutput` or `tuple(torch.FloatTensor)`</rettype><retdesc>A `transformers.models.clap.modeling_clap.ClapAudioModelOutput` or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([ClapConfig](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapConfig)) and inputs. | |
| - **audio_embeds** (`torch.FloatTensor` of shape `(batch_size, hidden_size)`) -- The Audio embeddings obtained by applying the projection layer to the pooler_output. | |
| - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*, defaults to `None`) -- Sequence of hidden-states at the output of the last layer of the model. | |
| - **hidden_states** (`tuple[torch.FloatTensor, ...]`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + | |
| one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. | |
| Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. | |
| - **attentions** (`tuple[torch.FloatTensor, ...]`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |
| heads.</retdesc></docstring> | |
| The [ClapAudioModelWithProjection](/docs/transformers/pr_33892/en/model_doc/clap#transformers.ClapAudioModelWithProjection) forward method, overrides the `__call__` special method. | |
| <Tip> | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| </Tip> | |
| <ExampleCodeBlock anchor="transformers.ClapAudioModelWithProjection.forward.example"> | |
| Examples: | |
| ```python | |
| >>> from datasets import load_dataset | |
| >>> from transformers import ClapAudioModelWithProjection, ClapProcessor | |
| >>> model = ClapAudioModelWithProjection.from_pretrained("laion/clap-htsat-fused") | |
| >>> processor = ClapProcessor.from_pretrained("laion/clap-htsat-fused") | |
| >>> dataset = load_dataset("hf-internal-testing/ashraq-esc50-1-dog-example") | |
| >>> audio_sample = dataset["train"]["audio"][0]["array"] | |
| >>> inputs = processor(audios=audio_sample, return_tensors="pt") | |
| >>> outputs = model(**inputs) | |
| >>> audio_embeds = outputs.audio_embeds | |
| ``` | |
| </ExampleCodeBlock> | |
| </div></div> | |
| <EditOnGithub source="https://github.com/huggingface/transformers/blob/main/docs/source/en/model_doc/clap.md" /> |
Xet Storage Details
- Size:
- 67.4 kB
- Xet hash:
- a45a02025615c3828a52ba501443735fae209d3654efbddb07563989a9d61538
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.