YoussefKejue commited on
Commit
706f16b
·
verified ·
1 Parent(s): cd16798

Upload UltravoxPipeline

Browse files
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
config.json CHANGED
@@ -4,24 +4,21 @@
4
  ],
5
  "audio_latency_block_size": null,
6
  "audio_model_id": "openai/whisper-large-v3-turbo",
7
- "audio_model_lora_config": {
8
- "lora_alpha": 8,
9
- "r": 0,
10
- "target_modules": [
11
- "k_proj",
12
- "q_proj",
13
- "linear_k",
14
- "linear_q"
15
- ]
16
- },
17
  "audio_token_index": 151669,
18
  "auto_map": {
19
  "AutoConfig": "ultravox_config.UltravoxConfig",
20
- "AutoModel": "ultravox_model.UltravoxModel",
21
- "AutoProcessor": "ultravox_processing.UltravoxProcessor"
 
 
 
 
 
 
 
 
 
22
  },
23
- "dtype": "bfloat16",
24
- "eos_token_id": 151645,
25
  "hidden_size": 4096,
26
  "ignore_index": -100,
27
  "initializer_range": 0.02,
@@ -33,16 +30,7 @@
33
  "projector_ln_mid": true,
34
  "stack_factor": 8,
35
  "text_model_id": "Qwen/Qwen3-32B",
36
- "text_model_lora_config": {
37
- "lora_alpha": 8,
38
- "r": 0,
39
- "target_modules": [
40
- "k_proj",
41
- "q_proj",
42
- "linear_k",
43
- "linear_q"
44
- ]
45
- },
46
- "transformers_version": "4.57.6",
47
  "vocab_size": 151936
48
- }
 
4
  ],
5
  "audio_latency_block_size": null,
6
  "audio_model_id": "openai/whisper-large-v3-turbo",
 
 
 
 
 
 
 
 
 
 
7
  "audio_token_index": 151669,
8
  "auto_map": {
9
  "AutoConfig": "ultravox_config.UltravoxConfig",
10
+ "AutoModel": "ultravox_model.UltravoxModel"
11
+ },
12
+ "custom_pipelines": {
13
+ "ultravox-pipeline": {
14
+ "impl": "ultravox_pipeline.UltravoxPipeline",
15
+ "pt": [
16
+ "AutoModel"
17
+ ],
18
+ "tf": [],
19
+ "type": "multimodal"
20
+ }
21
  },
 
 
22
  "hidden_size": 4096,
23
  "ignore_index": -100,
24
  "initializer_range": 0.02,
 
30
  "projector_ln_mid": true,
31
  "stack_factor": 8,
32
  "text_model_id": "Qwen/Qwen3-32B",
33
+ "torch_dtype": "bfloat16",
34
+ "transformers_version": "4.51.3",
 
 
 
 
 
 
 
 
 
35
  "vocab_size": 151936
36
+ }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1e8c23828793ed1ed3d7f19be86b2a8b0aaa9349bc037aaab5fa6cbc49b1b023
3
- size 1378876648
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f0f539fc56c7210733c76cec906a33fb283048ba9f916fdc6ddc7160fa13255f
3
+ size 104882656
special_tokens_map.json CHANGED
@@ -15,5 +15,11 @@
15
  "rstrip": false,
16
  "single_word": false
17
  },
18
- "pad_token": "<|im_end|>"
 
 
 
 
 
 
19
  }
 
15
  "rstrip": false,
16
  "single_word": false
17
  },
18
+ "pad_token": {
19
+ "content": "<|im_end|>",
20
+ "lstrip": false,
21
+ "normalized": false,
22
+ "rstrip": false,
23
+ "single_word": false
24
+ }
25
  }
ultravox_pipeline.py ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ from typing import Any, Dict, List, Optional
3
+
4
+ import numpy as np
5
+ import transformers
6
+
7
+ # We must use relative import in this directory to allow uploading to HF Hub
8
+ # Even "from . import X" pattern doesn't work (undocumented and unclear why)
9
+ from .ultravox_model import UltravoxModel
10
+ from .ultravox_processing import UltravoxProcessor
11
+ from .ultravox_tokenizer import from_pretrained_text_tokenizer
12
+ from .ultravox_tokenizer import get_audio_token_id
13
+
14
+
15
+ class UltravoxPipeline(transformers.Pipeline):
16
+ def __init__(
17
+ self,
18
+ model: UltravoxModel,
19
+ tokenizer: Optional[transformers.PreTrainedTokenizerBase] = None,
20
+ audio_processor: Optional[transformers.ProcessorMixin] = None,
21
+ chat_template: Optional[str] = None,
22
+ **kwargs
23
+ ):
24
+ if tokenizer is None:
25
+ try:
26
+ tokenizer = from_pretrained_text_tokenizer(model.config._name_or_path)
27
+ except: # noqa: E722
28
+ tokenizer = from_pretrained_text_tokenizer(
29
+ model.config.text_model_id or model.config.text_config._name_or_path
30
+ )
31
+ if chat_template:
32
+ tokenizer.chat_template = chat_template
33
+
34
+ model.config.audio_token_index = get_audio_token_id(tokenizer)
35
+
36
+ if audio_processor is None:
37
+ audio_processor = transformers.AutoProcessor.from_pretrained(
38
+ model.config.audio_model_id or model.config.audio_config._name_or_path
39
+ )
40
+
41
+ super().__init__(model=model, tokenizer=tokenizer, **kwargs)
42
+
43
+ self.processor = UltravoxProcessor(
44
+ audio_processor=audio_processor,
45
+ tokenizer=tokenizer,
46
+ stack_factor=model.config.stack_factor,
47
+ audio_context_size=model.audio_tower_context_length,
48
+ )
49
+
50
+ def _sanitize_parameters(self, **kwargs):
51
+ generation_keys = ["temperature", "max_new_tokens", "repetition_penalty"]
52
+ generation_kwargs = {k: kwargs[k] for k in kwargs if k in generation_keys}
53
+ return {}, generation_kwargs, {}
54
+
55
+ def preprocess(self, inputs: Dict[str, Any]):
56
+ turns: list = inputs.get("turns", [])
57
+
58
+ audio = inputs.get("audio", None)
59
+ # Convert to float32 if needed.
60
+ if isinstance(audio, np.ndarray):
61
+ if audio.dtype == np.float64:
62
+ audio = audio.astype(np.float32)
63
+ elif audio.dtype == np.int16:
64
+ audio = audio.astype(np.float32) / np.float32(32768.0)
65
+ elif audio.dtype == np.int32:
66
+ audio = audio.astype(np.float32) / np.float32(2147483648.0)
67
+
68
+ if audio is not None and (len(turns) == 0 or turns[-1]["role"] != "user"):
69
+ prompt = inputs.get("prompt", "<|audio|>")
70
+ if "<|audio|>" not in prompt:
71
+ logging.warning(
72
+ "Prompt does not contain '<|audio|>', appending '<|audio|>' to the end of the prompt."
73
+ )
74
+
75
+ prompt += " <|audio|>"
76
+ turns.append({"role": "user", "content": prompt})
77
+
78
+ text = self.processor.tokenizer.apply_chat_template(
79
+ turns, add_generation_prompt=True, tokenize=False
80
+ )
81
+
82
+ if "sampling_rate" not in inputs and audio is not None:
83
+ logging.warning(
84
+ "No sampling rate provided, using default of 16kHz. We highly recommend providing the correct sampling rate."
85
+ )
86
+
87
+ output = self.processor(
88
+ text=text,
89
+ audio=audio,
90
+ sampling_rate=inputs.get("sampling_rate", 16000),
91
+ )
92
+ if "audio_values" in output:
93
+ output["audio_values"] = output["audio_values"].to(self.model.dtype)
94
+
95
+ return output
96
+
97
+ def _forward(
98
+ self,
99
+ model_inputs: Dict[str, Any],
100
+ temperature: Optional[float] = None,
101
+ max_new_tokens: Optional[int] = None,
102
+ repetition_penalty: float = 1.1,
103
+ ) -> List[int]:
104
+ temperature = temperature or None
105
+ do_sample = temperature is not None
106
+
107
+ terminators = [self.tokenizer.eos_token_id]
108
+ if "<|eot_id|>" in self.tokenizer.added_tokens_encoder:
109
+ terminators.append(self.tokenizer.convert_tokens_to_ids("<|eot_id|>"))
110
+
111
+ input_len = model_inputs["input_ids"].shape[1]
112
+
113
+ outputs = self.model.generate(
114
+ **model_inputs,
115
+ do_sample=do_sample,
116
+ temperature=temperature,
117
+ max_new_tokens=max_new_tokens,
118
+ repetition_penalty=repetition_penalty,
119
+ eos_token_id=terminators
120
+ )
121
+ return outputs[0][input_len:]
122
+
123
+ def postprocess(self, model_outputs) -> str:
124
+ output_text = self.tokenizer.decode(model_outputs, skip_special_tokens=True)
125
+ return output_text
126
+
127
+
128
+ transformers.pipelines.PIPELINE_REGISTRY.register_pipeline(
129
+ "ultravox-pipeline",
130
+ pipeline_class=UltravoxPipeline,
131
+ pt_model=transformers.AutoModel,
132
+ type="multimodal",
133
+ )
ultravox_processing.py CHANGED
@@ -67,13 +67,14 @@ class DataCollatorForSeq2SeqWithAudio(transformers.DataCollatorForSeq2Seq):
67
  class UltravoxProcessor(transformers.ProcessorMixin):
68
  """
69
  Constructs an Ultravox processor which wraps an audio processor and a tokenizer into a single processor.
 
70
  Args:
71
  audio_processor: The audio processor for the audio encoder.
72
  tokenizer: The tokenizer for the language model.
73
  """
74
 
75
  attributes = ["audio_processor", "tokenizer"]
76
- audio_processor_class = ("WhisperProcessor",)
77
  tokenizer_class = (
78
  "PreTrainedTokenizer",
79
  "PreTrainedTokenizerFast",
@@ -112,12 +113,24 @@ class UltravoxProcessor(transformers.ProcessorMixin):
112
  tokenizer.eos_token is not None
113
  ), "The tokenizer has no EOS token. Cannot recover."
114
  self.vocab = tokenizer.get_vocab()
 
 
115
  self.audio_token_replacement = tokenizer.eos_token
116
  if tokenizer.pad_token_id is None:
117
  tokenizer.pad_token_id = tokenizer.eos_token_id
118
 
119
- super().__init__(audio_processor=audio_processor, tokenizer=tokenizer)
 
 
 
 
120
 
 
 
 
 
 
 
121
 
122
  @classmethod
123
  def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):
@@ -151,15 +164,18 @@ class UltravoxProcessor(transformers.ProcessorMixin):
151
  """
152
  Processes the audio batch by chunking any items in the batch according to the audio_context_size,
153
  padding the last chunk if needed, and returns a dictionary with updated audio data.
 
154
  Args:
155
  audio_values (torch.Tensor): A tensor of audio values (e.g., in B, D, T format).
156
  audio_lens (torch.Tensor): A tensor of audio lengths.
 
157
  Returns:
158
  Dict[str, Any]: Dictionary with the following keys:
159
  - "audio_values": The concatenated audio tensor after chunking and padding.
160
  - "audio_lens": Tensor of lengths for each chunk.
161
  - "audio_is_continuation": Tensor of booleans indicating if the chunk is a continuation of the previous chunk.
162
  - "audio_batch_size": A Tensor with one integer representing the number of chunks.
 
163
  """
164
  chunked_audio_values: List[torch.Tensor] = []
165
  chunked_audio_lens: List[int] = []
@@ -225,6 +241,7 @@ class UltravoxProcessor(transformers.ProcessorMixin):
225
  the text. To prepare the audio(s), this method forwards the `audio`, `sampling_rate` and `kwargs` arguments to
226
  audio processor's [`~WhisperProcessor.__call__`] if `audio` is not `None`. Please refer to the docstring
227
  of the above two methods for more information.
 
228
  Args:
229
  text (`str`, `List[str]`):
230
  The sequence to be encoded. Sequence can be a string or (pretokenized string).
@@ -237,12 +254,15 @@ class UltravoxProcessor(transformers.ProcessorMixin):
237
  you are doing.
238
  return_tensors (`str` or [`~utils.TensorType`], *optional*):
239
  If set, will return tensors of a particular framework. Acceptable values are:
 
240
  - `'tf'`: Return TensorFlow `tf.constant` objects.
241
  - `'pt'`: Return PyTorch `torch.Tensor` objects.
242
  - `'np'`: Return NumPy `np.ndarray` objects.
243
  - `'jax'`: Return JAX `jnp.ndarray` objects.
 
244
  Returns:
245
  [`BatchFeature`]: A [`BatchFeature`] with the following fields:
 
246
  - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
247
  - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
248
  `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
@@ -370,4 +390,4 @@ class UltravoxProcessor(transformers.ProcessorMixin):
370
 
371
  UltravoxProcessor.register_for_auto_class()
372
 
373
- transformers.AutoProcessor.register(UltravoxConfig, UltravoxProcessor)
 
67
  class UltravoxProcessor(transformers.ProcessorMixin):
68
  """
69
  Constructs an Ultravox processor which wraps an audio processor and a tokenizer into a single processor.
70
+
71
  Args:
72
  audio_processor: The audio processor for the audio encoder.
73
  tokenizer: The tokenizer for the language model.
74
  """
75
 
76
  attributes = ["audio_processor", "tokenizer"]
77
+ audio_processor_class = ("WhisperFeatureExtractor",)
78
  tokenizer_class = (
79
  "PreTrainedTokenizer",
80
  "PreTrainedTokenizerFast",
 
113
  tokenizer.eos_token is not None
114
  ), "The tokenizer has no EOS token. Cannot recover."
115
  self.vocab = tokenizer.get_vocab()
116
+ # VLLM currently relies on updating audio_token_replacement, hence to be safe
117
+ # we should not update it. This dependency should be removed in the future.
118
  self.audio_token_replacement = tokenizer.eos_token
119
  if tokenizer.pad_token_id is None:
120
  tokenizer.pad_token_id = tokenizer.eos_token_id
121
 
122
+ # Use a dummy audio processor to satisfy the base class for text-only training
123
+ if audio_processor is None:
124
+ audio_processor = transformers.AutoProcessor.from_pretrained(
125
+ "openai/whisper-tiny"
126
+ )
127
 
128
+ # Extract feature extractor if a full processor was passed,
129
+ # as transformers 5.x expects a FeatureExtractionMixin for this attribute.
130
+ if hasattr(audio_processor, "feature_extractor"):
131
+ audio_processor = audio_processor.feature_extractor
132
+
133
+ super().__init__(audio_processor=audio_processor, tokenizer=tokenizer)
134
 
135
  @classmethod
136
  def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):
 
164
  """
165
  Processes the audio batch by chunking any items in the batch according to the audio_context_size,
166
  padding the last chunk if needed, and returns a dictionary with updated audio data.
167
+
168
  Args:
169
  audio_values (torch.Tensor): A tensor of audio values (e.g., in B, D, T format).
170
  audio_lens (torch.Tensor): A tensor of audio lengths.
171
+
172
  Returns:
173
  Dict[str, Any]: Dictionary with the following keys:
174
  - "audio_values": The concatenated audio tensor after chunking and padding.
175
  - "audio_lens": Tensor of lengths for each chunk.
176
  - "audio_is_continuation": Tensor of booleans indicating if the chunk is a continuation of the previous chunk.
177
  - "audio_batch_size": A Tensor with one integer representing the number of chunks.
178
+
179
  """
180
  chunked_audio_values: List[torch.Tensor] = []
181
  chunked_audio_lens: List[int] = []
 
241
  the text. To prepare the audio(s), this method forwards the `audio`, `sampling_rate` and `kwargs` arguments to
242
  audio processor's [`~WhisperProcessor.__call__`] if `audio` is not `None`. Please refer to the docstring
243
  of the above two methods for more information.
244
+
245
  Args:
246
  text (`str`, `List[str]`):
247
  The sequence to be encoded. Sequence can be a string or (pretokenized string).
 
254
  you are doing.
255
  return_tensors (`str` or [`~utils.TensorType`], *optional*):
256
  If set, will return tensors of a particular framework. Acceptable values are:
257
+
258
  - `'tf'`: Return TensorFlow `tf.constant` objects.
259
  - `'pt'`: Return PyTorch `torch.Tensor` objects.
260
  - `'np'`: Return NumPy `np.ndarray` objects.
261
  - `'jax'`: Return JAX `jnp.ndarray` objects.
262
+
263
  Returns:
264
  [`BatchFeature`]: A [`BatchFeature`] with the following fields:
265
+
266
  - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
267
  - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
268
  `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
 
390
 
391
  UltravoxProcessor.register_for_auto_class()
392
 
393
+ transformers.AutoProcessor.register(UltravoxConfig, UltravoxProcessor)
ultravox_tokenizer.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+
3
+ import transformers
4
+
5
+ AUDIO_TOKEN = "<|audio|>"
6
+
7
+
8
+ def from_pretrained_text_tokenizer(
9
+ *args, **kwargs
10
+ ) -> transformers.PreTrainedTokenizerBase:
11
+ """
12
+ Create a tokenizer with the additional special token for audio.
13
+ This is mainly used for VLLM to work properly. This repo does not currently require it.
14
+ """
15
+
16
+ tokenizer = transformers.AutoTokenizer.from_pretrained(*args, **kwargs)
17
+ tokenizer.add_special_tokens({"additional_special_tokens": [AUDIO_TOKEN]})
18
+ logging.info(f"Audio token id: {get_audio_token_id(tokenizer)}")
19
+ return tokenizer
20
+
21
+
22
+ def get_audio_token_id(tokenizer: transformers.PreTrainedTokenizerBase) -> int:
23
+ audio_token_id = tokenizer.encode(AUDIO_TOKEN, add_special_tokens=False)
24
+ assert len(audio_token_id) == 1, "Audio token should be a single token"
25
+ return audio_token_id[0]