Upload UltravoxPipeline

Browse files

Files changed (7) hide show

README.md +199 -0
config.json +14 -26
model.safetensors +2 -2
special_tokens_map.json +7 -1
ultravox_pipeline.py +133 -0
ultravox_processing.py +23 -3
ultravox_tokenizer.py +25 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json CHANGED Viewed

@@ -4,24 +4,21 @@
   ],
   "audio_latency_block_size": null,
   "audio_model_id": "openai/whisper-large-v3-turbo",
-  "audio_model_lora_config": {
-    "lora_alpha": 8,
-    "r": 0,
-    "target_modules": [
-      "k_proj",
-      "q_proj",
-      "linear_k",
-      "linear_q"
-    ]
-  },
   "audio_token_index": 151669,
   "auto_map": {
     "AutoConfig": "ultravox_config.UltravoxConfig",
-    "AutoModel": "ultravox_model.UltravoxModel",
-    "AutoProcessor": "ultravox_processing.UltravoxProcessor"
   },
-  "dtype": "bfloat16",
-  "eos_token_id": 151645,
   "hidden_size": 4096,
   "ignore_index": -100,
   "initializer_range": 0.02,
@@ -33,16 +30,7 @@
   "projector_ln_mid": true,
   "stack_factor": 8,
   "text_model_id": "Qwen/Qwen3-32B",
-  "text_model_lora_config": {
-    "lora_alpha": 8,
-    "r": 0,
-    "target_modules": [
-      "k_proj",
-      "q_proj",
-      "linear_k",
-      "linear_q"
-    ]
-  },
-  "transformers_version": "4.57.6",
   "vocab_size": 151936
-}

   ],
   "audio_latency_block_size": null,
   "audio_model_id": "openai/whisper-large-v3-turbo",
   "audio_token_index": 151669,
   "auto_map": {
     "AutoConfig": "ultravox_config.UltravoxConfig",
+    "AutoModel": "ultravox_model.UltravoxModel"
+  },
+  "custom_pipelines": {
+    "ultravox-pipeline": {
+      "impl": "ultravox_pipeline.UltravoxPipeline",
+      "pt": [
+        "AutoModel"
+      ],
+      "tf": [],
+      "type": "multimodal"
+    }
   },
   "hidden_size": 4096,
   "ignore_index": -100,
   "initializer_range": 0.02,
   "projector_ln_mid": true,
   "stack_factor": 8,
   "text_model_id": "Qwen/Qwen3-32B",
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.51.3",
   "vocab_size": 151936
+}

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1e8c23828793ed1ed3d7f19be86b2a8b0aaa9349bc037aaab5fa6cbc49b1b023
-size 1378876648

 version https://git-lfs.github.com/spec/v1
+oid sha256:f0f539fc56c7210733c76cec906a33fb283048ba9f916fdc6ddc7160fa13255f
+size 104882656

special_tokens_map.json CHANGED Viewed

@@ -15,5 +15,11 @@
     "rstrip": false,
     "single_word": false
   },
-  "pad_token": "<|im_end|>"
 }

     "rstrip": false,
     "single_word": false
   },
+  "pad_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
 }

ultravox_pipeline.py ADDED Viewed

	@@ -0,0 +1,133 @@

+import logging
+from typing import Any, Dict, List, Optional
+import numpy as np
+import transformers
+# We must use relative import in this directory to allow uploading to HF Hub
+# Even "from . import X" pattern doesn't work (undocumented and unclear why)
+from .ultravox_model import UltravoxModel
+from .ultravox_processing import UltravoxProcessor
+from .ultravox_tokenizer import from_pretrained_text_tokenizer
+from .ultravox_tokenizer import get_audio_token_id
+class UltravoxPipeline(transformers.Pipeline):
+    def __init__(
+        self,
+        model: UltravoxModel,
+        tokenizer: Optional[transformers.PreTrainedTokenizerBase] = None,
+        audio_processor: Optional[transformers.ProcessorMixin] = None,
+        chat_template: Optional[str] = None,
+        **kwargs
+    ):
+        if tokenizer is None:
+            try:
+                tokenizer = from_pretrained_text_tokenizer(model.config._name_or_path)
+            except:  # noqa: E722
+                tokenizer = from_pretrained_text_tokenizer(
+                    model.config.text_model_id or model.config.text_config._name_or_path
+                )
+        if chat_template:
+            tokenizer.chat_template = chat_template
+        model.config.audio_token_index = get_audio_token_id(tokenizer)
+        if audio_processor is None:
+            audio_processor = transformers.AutoProcessor.from_pretrained(
+                model.config.audio_model_id or model.config.audio_config._name_or_path
+            )
+        super().__init__(model=model, tokenizer=tokenizer, **kwargs)
+        self.processor = UltravoxProcessor(
+            audio_processor=audio_processor,
+            tokenizer=tokenizer,
+            stack_factor=model.config.stack_factor,
+            audio_context_size=model.audio_tower_context_length,
+        )
+    def _sanitize_parameters(self, **kwargs):
+        generation_keys = ["temperature", "max_new_tokens", "repetition_penalty"]
+        generation_kwargs = {k: kwargs[k] for k in kwargs if k in generation_keys}
+        return {}, generation_kwargs, {}
+    def preprocess(self, inputs: Dict[str, Any]):
+        turns: list = inputs.get("turns", [])
+        audio = inputs.get("audio", None)
+        # Convert to float32 if needed.
+        if isinstance(audio, np.ndarray):
+            if audio.dtype == np.float64:
+                audio = audio.astype(np.float32)
+            elif audio.dtype == np.int16:
+                audio = audio.astype(np.float32) / np.float32(32768.0)
+            elif audio.dtype == np.int32:
+                audio = audio.astype(np.float32) / np.float32(2147483648.0)
+        if audio is not None and (len(turns) == 0 or turns[-1]["role"] != "user"):
+            prompt = inputs.get("prompt", "<|audio|>")
+            if "<|audio|>" not in prompt:
+                logging.warning(
+                    "Prompt does not contain '<|audio|>', appending '<|audio|>' to the end of the prompt."
+                )
+                prompt += " <|audio|>"
+            turns.append({"role": "user", "content": prompt})
+        text = self.processor.tokenizer.apply_chat_template(
+            turns, add_generation_prompt=True, tokenize=False
+        )
+        if "sampling_rate" not in inputs and audio is not None:
+            logging.warning(
+                "No sampling rate provided, using default of 16kHz. We highly recommend providing the correct sampling rate."
+            )
+        output = self.processor(
+            text=text,
+            audio=audio,
+            sampling_rate=inputs.get("sampling_rate", 16000),
+        )
+        if "audio_values" in output:
+            output["audio_values"] = output["audio_values"].to(self.model.dtype)
+        return output
+    def _forward(
+        self,
+        model_inputs: Dict[str, Any],
+        temperature: Optional[float] = None,
+        max_new_tokens: Optional[int] = None,
+        repetition_penalty: float = 1.1,
+    ) -> List[int]:
+        temperature = temperature or None
+        do_sample = temperature is not None
+        terminators = [self.tokenizer.eos_token_id]
+        if "<|eot_id|>" in self.tokenizer.added_tokens_encoder:
+            terminators.append(self.tokenizer.convert_tokens_to_ids("<|eot_id|>"))
+        input_len = model_inputs["input_ids"].shape[1]
+        outputs = self.model.generate(
+            **model_inputs,
+            do_sample=do_sample,
+            temperature=temperature,
+            max_new_tokens=max_new_tokens,
+            repetition_penalty=repetition_penalty,
+            eos_token_id=terminators
+        )
+        return outputs[0][input_len:]
+    def postprocess(self, model_outputs) -> str:
+        output_text = self.tokenizer.decode(model_outputs, skip_special_tokens=True)
+        return output_text
+transformers.pipelines.PIPELINE_REGISTRY.register_pipeline(
+    "ultravox-pipeline",
+    pipeline_class=UltravoxPipeline,
+    pt_model=transformers.AutoModel,
+    type="multimodal",
+)

ultravox_processing.py CHANGED Viewed

@@ -67,13 +67,14 @@ class DataCollatorForSeq2SeqWithAudio(transformers.DataCollatorForSeq2Seq):
 class UltravoxProcessor(transformers.ProcessorMixin):
     """
     Constructs an Ultravox processor which wraps an audio processor and a tokenizer into a single processor.
     Args:
         audio_processor: The audio processor for the audio encoder.
         tokenizer: The tokenizer for the language model.
     """
     attributes = ["audio_processor", "tokenizer"]
-    audio_processor_class = ("WhisperProcessor",)
     tokenizer_class = (
         "PreTrainedTokenizer",
         "PreTrainedTokenizerFast",
@@ -112,12 +113,24 @@ class UltravoxProcessor(transformers.ProcessorMixin):
             tokenizer.eos_token is not None
         ), "The tokenizer has no EOS token. Cannot recover."
         self.vocab = tokenizer.get_vocab()
         self.audio_token_replacement = tokenizer.eos_token
         if tokenizer.pad_token_id is None:
             tokenizer.pad_token_id = tokenizer.eos_token_id
-        super().__init__(audio_processor=audio_processor, tokenizer=tokenizer)
     @classmethod
     def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):
@@ -151,15 +164,18 @@ class UltravoxProcessor(transformers.ProcessorMixin):
         """
         Processes the audio batch by chunking any items in the batch according to the audio_context_size,
         padding the last chunk if needed, and returns a dictionary with updated audio data.
         Args:
             audio_values (torch.Tensor): A tensor of audio values (e.g., in B, D, T format).
             audio_lens (torch.Tensor): A tensor of audio lengths.
         Returns:
             Dict[str, Any]: Dictionary with the following keys:
                 - "audio_values": The concatenated audio tensor after chunking and padding.
                 - "audio_lens": Tensor of lengths for each chunk.
                 - "audio_is_continuation": Tensor of booleans indicating if the chunk is a continuation of the previous chunk.
                 - "audio_batch_size": A Tensor with one integer representing the number of chunks.
         """
         chunked_audio_values: List[torch.Tensor] = []
         chunked_audio_lens: List[int] = []
@@ -225,6 +241,7 @@ class UltravoxProcessor(transformers.ProcessorMixin):
         the text. To prepare the audio(s), this method forwards the `audio`, `sampling_rate` and `kwargs` arguments to
         audio processor's [`~WhisperProcessor.__call__`] if `audio` is not `None`. Please refer to the docstring
         of the above two methods for more information.
         Args:
             text (`str`, `List[str]`):
                 The sequence to be encoded. Sequence can be a string or (pretokenized string).
@@ -237,12 +254,15 @@ class UltravoxProcessor(transformers.ProcessorMixin):
                 you are doing.
             return_tensors (`str` or [`~utils.TensorType`], *optional*):
                 If set, will return tensors of a particular framework. Acceptable values are:
                 - `'tf'`: Return TensorFlow `tf.constant` objects.
                 - `'pt'`: Return PyTorch `torch.Tensor` objects.
                 - `'np'`: Return NumPy `np.ndarray` objects.
                 - `'jax'`: Return JAX `jnp.ndarray` objects.
         Returns:
             [`BatchFeature`]: A [`BatchFeature`] with the following fields:
             - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
             - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
               `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
@@ -370,4 +390,4 @@ class UltravoxProcessor(transformers.ProcessorMixin):
 UltravoxProcessor.register_for_auto_class()
-transformers.AutoProcessor.register(UltravoxConfig, UltravoxProcessor)

 class UltravoxProcessor(transformers.ProcessorMixin):
     """
     Constructs an Ultravox processor which wraps an audio processor and a tokenizer into a single processor.
     Args:
         audio_processor: The audio processor for the audio encoder.
         tokenizer: The tokenizer for the language model.
     """
     attributes = ["audio_processor", "tokenizer"]
+    audio_processor_class = ("WhisperFeatureExtractor",)
     tokenizer_class = (
         "PreTrainedTokenizer",
         "PreTrainedTokenizerFast",
             tokenizer.eos_token is not None
         ), "The tokenizer has no EOS token. Cannot recover."
         self.vocab = tokenizer.get_vocab()
+        # VLLM currently relies on updating audio_token_replacement, hence to be safe
+        # we should not update it. This dependency should be removed in the future.
         self.audio_token_replacement = tokenizer.eos_token
         if tokenizer.pad_token_id is None:
             tokenizer.pad_token_id = tokenizer.eos_token_id
+        # Use a dummy audio processor to satisfy the base class for text-only training
+        if audio_processor is None:
+            audio_processor = transformers.AutoProcessor.from_pretrained(
+                "openai/whisper-tiny"
+            )
+        # Extract feature extractor if a full processor was passed,
+        # as transformers 5.x expects a FeatureExtractionMixin for this attribute.
+        if hasattr(audio_processor, "feature_extractor"):
+            audio_processor = audio_processor.feature_extractor
+        super().__init__(audio_processor=audio_processor, tokenizer=tokenizer)
     @classmethod
     def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):
         """
         Processes the audio batch by chunking any items in the batch according to the audio_context_size,
         padding the last chunk if needed, and returns a dictionary with updated audio data.
         Args:
             audio_values (torch.Tensor): A tensor of audio values (e.g., in B, D, T format).
             audio_lens (torch.Tensor): A tensor of audio lengths.
         Returns:
             Dict[str, Any]: Dictionary with the following keys:
                 - "audio_values": The concatenated audio tensor after chunking and padding.
                 - "audio_lens": Tensor of lengths for each chunk.
                 - "audio_is_continuation": Tensor of booleans indicating if the chunk is a continuation of the previous chunk.
                 - "audio_batch_size": A Tensor with one integer representing the number of chunks.
         """
         chunked_audio_values: List[torch.Tensor] = []
         chunked_audio_lens: List[int] = []
         the text. To prepare the audio(s), this method forwards the `audio`, `sampling_rate` and `kwargs` arguments to
         audio processor's [`~WhisperProcessor.__call__`] if `audio` is not `None`. Please refer to the docstring
         of the above two methods for more information.
         Args:
             text (`str`, `List[str]`):
                 The sequence to be encoded. Sequence can be a string or (pretokenized string).
                 you are doing.
             return_tensors (`str` or [`~utils.TensorType`], *optional*):
                 If set, will return tensors of a particular framework. Acceptable values are:
                 - `'tf'`: Return TensorFlow `tf.constant` objects.
                 - `'pt'`: Return PyTorch `torch.Tensor` objects.
                 - `'np'`: Return NumPy `np.ndarray` objects.
                 - `'jax'`: Return JAX `jnp.ndarray` objects.
         Returns:
             [`BatchFeature`]: A [`BatchFeature`] with the following fields:
             - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
             - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
               `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
 UltravoxProcessor.register_for_auto_class()
+transformers.AutoProcessor.register(UltravoxConfig, UltravoxProcessor)

ultravox_tokenizer.py ADDED Viewed

	@@ -0,0 +1,25 @@

+import logging
+import transformers
+AUDIO_TOKEN = "<|audio|>"
+def from_pretrained_text_tokenizer(
+    *args, **kwargs
+) -> transformers.PreTrainedTokenizerBase:
+    """
+    Create a tokenizer with the additional special token for audio.
+    This is mainly used for VLLM to work properly. This repo does not currently require it.
+    """
+    tokenizer = transformers.AutoTokenizer.from_pretrained(*args, **kwargs)
+    tokenizer.add_special_tokens({"additional_special_tokens": [AUDIO_TOKEN]})
+    logging.info(f"Audio token id: {get_audio_token_id(tokenizer)}")
+    return tokenizer
+def get_audio_token_id(tokenizer: transformers.PreTrainedTokenizerBase) -> int:
+    audio_token_id = tokenizer.encode(AUDIO_TOKEN, add_special_tokens=False)
+    assert len(audio_token_id) == 1, "Audio token should be a single token"
+    return audio_token_id[0]