Training in progress - step 10

Browse files

Files changed (11) hide show

.gitattributes +1 -0
README.md +199 -0
asr_config.py +131 -0
asr_modeling.py +874 -0
asr_pipeline.py +293 -0
asr_processing.py +78 -0
chat_template.jinja +94 -0
preprocessor_config.json +21 -0
special_tokens_map.json +19 -0
tokenizer.json +3 -0
tokenizer_config.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

asr_config.py ADDED Viewed

	@@ -0,0 +1,131 @@

+from typing import Optional
+import transformers
+class ASRConfig(transformers.PretrainedConfig):
+    model_type = "asr_model"
+    is_composition = True
+    def __init__(
+        self,
+        audio_model_id: str = "openai/whisper-large-v3-turbo",
+        text_model_id: str = "HuggingFaceTB/SmolLM3-3B",
+        attn_implementation: str = "sdpa",
+        model_dtype: str = "bfloat16",
+        audio_downsample_rate: int = 5,  # Deprecated: use projector_pool_stride instead
+        num_beams: Optional[int] = None,
+        system_prompt: str = "/no_think /system_override",
+        user_prompt: str = "Transcribe: <audio>",
+        encoder_dim: Optional[int] = None,
+        llm_dim: Optional[int] = None,
+        # Audio processing constants
+        audio_sample_rate: int = 16000,
+        # Projector initialization constants
+        projector_init_std: float = 0.02,
+        projector_pool_stride: int = 2,  # AvgPool1d stride (2 = 4x total with Whisper, 1 = no pooling)
+        projector_hidden_dim: Optional[
+            int
+        ] = None,  # SwiGLU hidden dimension (defaults to encoder_dim * 4)
+        projector_dropout: float = 0.1,  # Dropout rate for projector layers
+        # Inference parameters
+        inference_diversity_penalty: float = 0.0,
+        inference_warmup_tokens: int = 10,
+        # Generation parameters
+        max_new_tokens: Optional[int] = None,
+        min_new_tokens: Optional[int] = None,
+        do_sample: Optional[bool] = None,
+        temperature: Optional[float] = None,
+        top_k: Optional[int] = None,
+        top_p: Optional[float] = None,
+        repetition_penalty: Optional[float] = None,
+        length_penalty: Optional[float] = None,
+        no_repeat_ngram_size: Optional[int] = None,
+        early_stopping: Optional[bool] = None,
+        use_cache: Optional[bool] = None,
+        **kwargs,
+    ):
+        # Set default generation parameters
+        generation_defaults = {
+            "num_beams": 1,
+            "max_new_tokens": 128,
+            "min_new_tokens": 1,
+            "do_sample": False,
+            "repetition_penalty": 1.05,
+            "no_repeat_ngram_size": 0,
+            "use_cache": True,
+        }
+        # Apply defaults (config.json values take precedence)
+        kwargs = {**generation_defaults, **kwargs}
+        self.audio_model_id = audio_model_id
+        self.text_model_id = text_model_id
+        self.attn_implementation = attn_implementation
+        self.model_dtype = model_dtype
+        self.audio_downsample_rate = audio_downsample_rate
+        self.system_prompt = system_prompt
+        self.user_prompt = user_prompt
+        self.encoder_dim = encoder_dim
+        self.llm_dim = llm_dim
+        self.audio_sample_rate = audio_sample_rate
+        self.projector_init_std = projector_init_std
+        self.projector_pool_stride = projector_pool_stride
+        self.projector_hidden_dim = projector_hidden_dim
+        self.projector_dropout = projector_dropout
+        self.inference_diversity_penalty = inference_diversity_penalty
+        self.inference_warmup_tokens = inference_warmup_tokens
+        if "audio_config" not in kwargs:
+            self.audio_config = transformers.AutoConfig.from_pretrained(audio_model_id)
+        else:
+            self.audio_config = kwargs.pop("audio_config")
+        if "text_config" not in kwargs:
+            self.text_config = transformers.AutoConfig.from_pretrained(
+                text_model_id, trust_remote_code=True
+            )
+        else:
+            self.text_config = kwargs.pop("text_config")
+        # Ensure configs are PretrainedConfig objects (in case loaded from dict)
+        if isinstance(self.text_config, dict):
+            # Reconstruct config from dict using the model_type stored in the dict
+            model_type = self.text_config.get("model_type")
+            if model_type:
+                config_class = transformers.AutoConfig.for_model(model_type).__class__
+                self.text_config = config_class(**self.text_config)
+            else:
+                # Fallback: try to load from model_id
+                self.text_config = transformers.AutoConfig.from_pretrained(
+                    text_model_id, trust_remote_code=True
+                )
+        if isinstance(self.audio_config, dict):
+            model_type = self.audio_config.get("model_type")
+            if model_type:
+                config_class = transformers.AutoConfig.for_model(model_type).__class__
+                self.audio_config = config_class(**self.audio_config)
+        super().__init__(**kwargs)
+        self.auto_map = {
+            "AutoConfig": "asr_config.ASRConfig",
+            "AutoModel": "asr_modeling.ASRModel",
+            "AutoModelForSpeechSeq2Seq": "asr_modeling.ASRModel",
+            "AutoProcessor": "asr_processing.ASRProcessor",
+        }
+        self.custom_pipelines = {
+            "automatic-speech-recognition": {
+                "impl": "asr_pipeline.ASRPipeline",
+                "pt": ["AutoModelForSpeechSeq2Seq"],
+                "tf": [],
+                "type": "audio",
+            }
+        }
+        self.architectures = ["ASRModel"]
+        self.pipeline_tag = "automatic-speech-recognition"
+# Register the config with transformers
+# This is needed for AutoConfig.from_pretrained to work
+transformers.AutoConfig.register("asr_model", ASRConfig)

asr_modeling.py ADDED Viewed

	@@ -0,0 +1,874 @@

+from pathlib import Path
+from typing import Optional, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
+from transformers import (
+    AutoConfig,
+    AutoModel,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    PreTrainedModel,
+    Wav2Vec2FeatureExtractor,
+)
+from transformers.generation.utils import (
+    GenerateBeamDecoderOnlyOutput,
+    GenerateBeamEncoderDecoderOutput,
+    GenerateDecoderOnlyOutput,
+    GenerateEncoderDecoderOutput,
+)
+try:
+    from .asr_config import ASRConfig
+except ImportError:
+    from asr_config import ASRConfig  # type: ignore[no-redef]
+class SwiGLU(nn.Module):
+    """
+    SwiGLU activation MLP - based on LlamaMLP but with flexible output dimension.
+    This implements the same gated activation pattern as transformers.models.llama.modeling_llama.LlamaMLP,
+    but allows for different input/output dimensions (needed for cross-modal projection).
+    Structure: w1 (gate), w2 (up), w3 (down) with w3(silu(w1) * w2)
+    """
+    def __init__(self, in_features, hidden_features, out_features, bias=False, dropout=0.0):
+        super().__init__()
+        self.w1 = nn.Linear(in_features, hidden_features, bias=bias)
+        self.w2 = nn.Linear(in_features, hidden_features, bias=bias)
+        self.w3 = nn.Linear(hidden_features, out_features, bias=bias)
+        self.act = nn.SiLU()
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        x_gate = self.act(self.w1(x))
+        x_val = self.w2(x)
+        x = x_gate * x_val
+        x = self.dropout(x)  # Apply dropout after the gating operation
+        return self.w3(x)
+class AudioProjector(nn.Module):
+    """
+    AudioProjector using a SwiGLU MLP with dropout.
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.k = getattr(config, "projector_pool_stride", 2)  # Downsampling rate
+        in_dim = config.encoder_dim * self.k  # Input is k frames concatenated
+        out_dim = config.llm_dim
+        hidden_dim = config.projector_hidden_dim
+        if hidden_dim is None:
+            hidden_dim = config.encoder_dim * 4  # Default: 4x encoder dim for SwiGLU
+        # Get dropout rate from config
+        dropout_rate = getattr(config, "projector_dropout", 0.1)
+        # SwiGLU MLP (now takes concatenated frames as input) with dropout
+        self.proj = SwiGLU(in_dim, hidden_dim, out_dim, dropout=dropout_rate)
+        # Optional output dropout layer for additional regularization
+        self.output_dropout = nn.Dropout(dropout_rate)
+        # Initialize weights following LLaMA-style initialization for SwiGLU
+        # Uses smaller std to account for the multiplicative gating
+        with torch.no_grad():
+            # Standard deviation from config or default (0.02 is common for transformers)
+            std = getattr(config, "projector_init_std", 0.02)
+            # Initialize gate and up projections
+            nn.init.normal_(self.proj.w1.weight, mean=0.0, std=std)
+            nn.init.normal_(self.proj.w2.weight, mean=0.0, std=std)
+            # Initialize down projection with scaling to preserve variance after SwiGLU
+            # The 1/sqrt(2) factor accounts for the multiplicative interaction
+            nn.init.normal_(self.proj.w3.weight, mean=0.0, std=std / (2**0.5))
+    def forward(self, x):
+        # x: [batch, seq_len, dim]
+        batch_size, seq_len, dim = x.size()
+        # Ensure input dtype matches the projector weights
+        # This is crucial for MPS devices where encoder may output bfloat16
+        # but projector weights might be in float32 when loaded from checkpoint
+        target_dtype = self.proj.w1.weight.dtype
+        if x.dtype != target_dtype:
+            x = x.to(target_dtype)
+        # Pad the sequence to be divisible by k instead of truncating
+        remainder = seq_len % self.k
+        if remainder:
+            pad_len = self.k - remainder
+            x = F.pad(x, (0, 0, 0, pad_len))
+        # Reshape for temporal compression - concatenate k consecutive frames
+        x = x.contiguous().view(batch_size, -1, dim * self.k)
+        # Apply SwiGLU block
+        x = self.proj(x)
+        # Apply output dropout for additional regularization
+        return self.output_dropout(x)
+class ASRModel(PreTrainedModel):
+    config_class = ASRConfig
+    base_model_prefix = "model"
+    main_input_name = "input_values"
+    _supports_flash_attn_2 = True
+    supports_gradient_checkpointing = True
+    _is_loading_from_pretrained: bool = False
+    _pretrained_model_path: Optional[str] = None
+    # Task to prompt mapping for generation
+    TASK_PROMPTS = {
+        "transcribe": "Transcribe: <audio>",
+        "continue": "Continue: <audio>",
+        "describe": "Describe: <audio>",
+        "emotion": "Emotion: <audio>",
+    }
+    @staticmethod
+    def _create_feature_extractor(audio_model_id: str):
+        """Factory method to create the appropriate feature extractor."""
+        is_whisper = "whisper" in audio_model_id.lower()
+        if is_whisper:
+            from transformers import WhisperConfig, WhisperFeatureExtractor
+            encoder_config = WhisperConfig.from_pretrained(audio_model_id)
+            num_mel_bins = encoder_config.num_mel_bins
+            return WhisperFeatureExtractor.from_pretrained(
+                audio_model_id,
+                feature_size=num_mel_bins,
+            )
+        return Wav2Vec2FeatureExtractor.from_pretrained(audio_model_id)
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
+        from transformers import AutoFeatureExtractor
+        config = kwargs.pop("config", None)
+        if config is None:
+            config = ASRConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        # Load feature extractor from saved model directory
+        kwargs["feature_extractor"] = AutoFeatureExtractor.from_pretrained(
+            pretrained_model_name_or_path, **kwargs
+        )
+        cls._is_loading_from_pretrained = True
+        cls._pretrained_model_path = pretrained_model_name_or_path
+        try:
+            # Let parent class handle loading config and model.safetensors
+            model = super().from_pretrained(
+                pretrained_model_name_or_path, *args, config=config, **kwargs
+            )
+            # Convert projector to target dtype after loading weights
+            target_dtype = getattr(torch, config.model_dtype)
+            model.projector = model.projector.to(dtype=target_dtype)
+            return model
+        finally:
+            cls._is_loading_from_pretrained = False
+            del cls._pretrained_model_path
+    def __init__(self, config: ASRConfig, **kwargs):
+        super().__init__(config)
+        feature_extractor = kwargs.pop("feature_extractor", None)
+        self.system_prompt = config.system_prompt
+        self.encoder = self._create_encoder(config)
+        is_whisper = "whisper" in config.audio_model_id.lower() or (
+            hasattr(self.encoder.config, "model_type")
+            and "whisper" in self.encoder.config.model_type.lower()
+        )
+        if is_whisper:
+            self.main_input_name = "input_features"
+        else:
+            self.main_input_name = "input_values"
+        if feature_extractor is not None:
+            self.feature_extractor = feature_extractor
+        else:
+            self.feature_extractor = self._create_feature_extractor(config.audio_model_id)
+        self.decoder = self._create_decoder(config)
+        self.generation_config = self.decoder.generation_config
+        self._init_tokenizer()
+        from types import SimpleNamespace
+        # Auto-detect encoder_dim and llm_dim if not specified
+        encoder_dim = config.encoder_dim
+        if encoder_dim is None:
+            if hasattr(self.encoder.config, "hidden_size"):
+                encoder_dim = self.encoder.config.hidden_size
+            elif hasattr(self.encoder.config, "d_model"):
+                encoder_dim = self.encoder.config.d_model
+            else:
+                raise ValueError("Could not auto-detect encoder_dim. Please specify in config.")
+        llm_dim = config.llm_dim
+        if llm_dim is None:
+            if hasattr(self.decoder.config, "hidden_size"):
+                llm_dim = self.decoder.config.hidden_size
+            elif hasattr(self.decoder.config, "d_model"):
+                llm_dim = self.decoder.config.d_model
+            else:
+                raise ValueError("Could not auto-detect llm_dim. Please specify in config.")
+        projector_config = SimpleNamespace(
+            encoder_dim=encoder_dim,
+            llm_dim=llm_dim,
+            projector_pool_stride=getattr(config, "projector_pool_stride", 2),
+            projector_hidden_dim=getattr(config, "projector_hidden_dim", None),
+            projector_init_std=getattr(config, "projector_init_std", 0.02),
+            projector_dropout=getattr(config, "projector_dropout", 0.1),
+        )
+        self.projector = AudioProjector(projector_config)
+        # Convert projector to the same dtype as encoder/decoder
+        target_dtype = getattr(torch, config.model_dtype)
+        self.projector = self.projector.to(dtype=target_dtype)
+        self._no_split_modules = self.decoder._no_split_modules
+    @classmethod
+    def _create_encoder(cls, config: ASRConfig):
+        """Create and configure the audio encoder.
+        Args:
+            config: Model configuration
+        Returns:
+            Configured encoder model
+        """
+        target_dtype = getattr(torch, config.model_dtype)
+        encoder_kwargs = {
+            "attn_implementation": config.attn_implementation,
+            "dtype": target_dtype,
+            "low_cpu_mem_usage": True,
+        }
+        if not cls._is_loading_from_pretrained:
+            encoder_kwargs["device_map"] = "auto"
+        if "whisper" in config.audio_model_id.lower():
+            from transformers import WhisperModel
+            full_model = WhisperModel.from_pretrained(config.audio_model_id, **encoder_kwargs)
+            encoder = full_model.encoder
+            del full_model
+        else:
+            encoder = AutoModel.from_pretrained(config.audio_model_id, **encoder_kwargs)
+        is_whisper = "whisper" in config.audio_model_id.lower() or (
+            hasattr(encoder.config, "model_type") and "whisper" in encoder.config.model_type.lower()
+        )
+        # Wrap encoder forward to handle Whisper's input_features vs input_values
+        original_forward = encoder.forward
+        input_key = "input_features" if is_whisper else "input_values"
+        def safe_encoder_forward(self_encoder, input_values=None, **kwargs):
+            # Catch and discard invalid kwargs like input_ids
+            kwargs.pop("input_ids", None)
+            return original_forward(**{input_key: input_values}, **kwargs)
+        import types
+        encoder.forward = types.MethodType(safe_encoder_forward, encoder)
+        # Freeze all encoder parameters
+        encoder.requires_grad_(False)
+        return encoder
+    @classmethod
+    def _create_decoder(cls, config: ASRConfig):
+        """Create and configure the language model decoder.
+        Args:
+            config: Model configuration
+        Returns:
+            Configured decoder model
+        """
+        target_dtype = getattr(torch, config.model_dtype)
+        # When loading from pretrained, avoid device_map="auto" to prevent meta tensor issues
+        decoder_kwargs = {
+            "attn_implementation": config.attn_implementation,
+            "dtype": target_dtype,
+            "trust_remote_code": True,
+        }
+        # Don't use device_map="auto" as it can cause meta tensor issues with Trainer
+        # The Trainer will handle device placement
+        decoder = AutoModelForCausalLM.from_pretrained(config.text_model_id, **decoder_kwargs)
+        # use_cache is now safe because we pre-expand audio tokens for consistent sequence length
+        # Cache can be enabled/disabled via config.use_cache
+        decoder.config.use_cache = config.use_cache
+        # Freeze all decoder parameters (only projector is trainable)
+        decoder.requires_grad_(False)
+        return decoder
+    def _init_weights(self, module):
+        """Initialize weights for trainable modules.
+        Note: This is a no-op since:
+        - AudioProjector self-initializes in its __init__
+        - Encoder/decoder are loaded from pretrained weights
+        """
+        pass
+    def can_generate(self) -> bool:
+        """Return True to indicate this model supports generation.
+        Required for Transformers 4.50+ where PreTrainedModel no longer
+        inherits from GenerationMixin.
+        """
+        return True
+    @property
+    def _tied_weights_keys(self):
+        """Return list of weight keys that should be tied.
+        In this model, input and output embeddings of the decoder may be tied.
+        """
+        if hasattr(self.decoder, "_tied_weights_keys"):
+            return [f"decoder.{k}" for k in self.decoder._tied_weights_keys]
+        return []
+    def _init_tokenizer(self):
+        model_path = (
+            self.__class__._pretrained_model_path
+            if self._is_loading_from_pretrained
+            else self.config.text_model_id
+        )
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+        # Set pad_token if not already set to avoid warnings during generation
+        # If pad_token is same as eos_token, we need a different token for padding
+        if (
+            self.tokenizer.pad_token is None
+            or self.tokenizer.pad_token_id == self.tokenizer.eos_token_id
+        ) and "<|finetune_right_pad_id|>" in self.tokenizer.get_vocab():
+            # For SmolLM3, use the dedicated finetune_right_pad_id token
+            self.tokenizer.pad_token = "<|finetune_right_pad_id|>"
+        existing_special = self.tokenizer.additional_special_tokens or []
+        # Add single audio token if not present
+        if "<audio>" not in existing_special:
+            special_tokens = {"additional_special_tokens": existing_special + ["<audio>"]}
+            num_added_tokens = self.tokenizer.add_special_tokens(special_tokens)
+            if num_added_tokens > 0:
+                # Use mean_resizing=False since this is a structural token, not semantic
+                self.decoder.resize_token_embeddings(len(self.tokenizer), mean_resizing=False)
+        current_embed_size = self.decoder.get_input_embeddings().weight.shape[0]
+        expected_size = len(self.tokenizer)
+        if current_embed_size != expected_size:
+            self.decoder.resize_token_embeddings(expected_size, mean_resizing=False)
+        self.audio_token_id = self.tokenizer.convert_tokens_to_ids("<audio>")
+        self.tokenizer.padding_side = "right"
+        for cfg in [self.config.text_config, self.decoder.config, self.generation_config]:
+            if isinstance(cfg, dict):
+                cfg["pad_token_id"] = self.tokenizer.pad_token_id
+                cfg["eos_token_id"] = self.tokenizer.eos_token_id
+                cfg["bos_token_id"] = self.tokenizer.bos_token_id
+            else:
+                cfg.pad_token_id = self.tokenizer.pad_token_id
+                cfg.eos_token_id = self.tokenizer.eos_token_id
+                cfg.bos_token_id = self.tokenizer.bos_token_id
+    def get_processor(self):
+        try:
+            from .asr_processing import ASRProcessor
+        except ImportError:
+            from asr_processing import ASRProcessor  # type: ignore[no-redef]
+        return ASRProcessor(feature_extractor=self.feature_extractor, tokenizer=self.tokenizer)
+    def state_dict(self, *args, **kwargs):
+        """Return only trainable parameters (projector weights).
+        Called by HuggingFace Trainer to save model.safetensors in checkpoints.
+        """
+        return self._get_trainable_state_dict()
+    def _get_trainable_state_dict(self):
+        """Get all trainable parameters as a single state dict.
+        This is used by Trainer for checkpointing during training.
+        """
+        state = {}
+        # Only projector params are trainable now (encoder and decoder are frozen)
+        projector_state = self.projector.state_dict()
+        for name, tensor in projector_state.items():
+            state[f"projector.{name}"] = tensor
+        return state
+    def get_input_embeddings(self):
+        """Delegate to decoder for proper HF Trainer integration."""
+        return self.decoder.get_input_embeddings()
+    def set_input_embeddings(self, value):
+        """Delegate to decoder for proper HF Trainer integration."""
+        self.decoder.set_input_embeddings(value)
+    def get_output_embeddings(self):
+        """Delegate to decoder for proper HF Trainer integration."""
+        return self.decoder.get_output_embeddings()
+    def set_output_embeddings(self, value):
+        """Delegate to decoder for proper HF Trainer integration."""
+        self.decoder.set_output_embeddings(value)
+    def _encode_audio(
+        self,
+        input_values: torch.Tensor,
+        audio_attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        # Ensure input is on encoder's device and has the right dtype
+        encoder_device = next(self.encoder.parameters()).device
+        encoder_dtype = next(self.encoder.parameters()).dtype
+        # Clone to prevent user tensor reuse contamination
+        input_values = input_values.clone().to(device=encoder_device, dtype=encoder_dtype)
+        # Only pass explicit valid arguments to encoder
+        # Never use **kwargs to prevent torch.compile from injecting decoder args like input_ids
+        # Always use no_grad since encoder is frozen
+        with torch.no_grad():
+            audio_features = self.encoder(
+                input_values=input_values,
+                attention_mask=audio_attention_mask,
+            ).last_hidden_state
+        # Project audio features and ensure dtype matches decoder
+        audio_embeds = self.projector(audio_features)
+        # Convert to decoder's dtype if needed (e.g., bfloat16)
+        decoder_dtype = next(self.decoder.parameters()).dtype
+        if audio_embeds.dtype != decoder_dtype:
+            audio_embeds = audio_embeds.to(dtype=decoder_dtype)
+        return audio_embeds
+    def _get_audio_expansion_details(self, input_ids: torch.Tensor, num_audio_tokens: int) -> dict:
+        """Calculate the positions and masks needed to expand audio tokens.
+        This helper consolidates the common cumsum logic used by both
+        _expand_audio_tokens and _expand_for_audio_tokens.
+        Args:
+            input_ids: Token IDs with single <audio> token per sample
+            num_audio_tokens: Number of tokens each audio token expands to
+        Returns:
+            Dictionary containing:
+            - new_seq_len: The total sequence length after expansion
+            - new_start_positions: [batch, old_seq_len] tensor mapping old indices to new
+            - audio_mask: [batch, old_seq_len] boolean mask for audio token positions
+        """
+        batch_size, seq_len = input_ids.shape
+        device = input_ids.device
+        # Find audio token positions
+        audio_mask = input_ids == self.audio_token_id
+        # Validate: each sample must have exactly one audio token
+        audio_counts = audio_mask.sum(dim=1)
+        if not (audio_counts == 1).all():
+            missing = (audio_counts == 0).any()
+            multiple = (audio_counts > 1).any()
+            if missing:
+                raise ValueError("Some samples are missing audio token")
+            if multiple:
+                raise ValueError("Some samples have multiple audio tokens")
+        # Create placeholder tensor: 1 for normal tokens, num_audio_tokens for audio token
+        token_counts = torch.where(audio_mask, num_audio_tokens, 1)
+        # Cumsum - 1 gives us the ENDING position of each token's expansion
+        cumsum_counts = torch.cumsum(token_counts, dim=1)
+        # The starting position of token i is cumsum[i-1]
+        new_start_positions = torch.cat(
+            [
+                torch.zeros(batch_size, 1, dtype=torch.long, device=device),
+                cumsum_counts[:, :-1],
+            ],
+            dim=1,
+        )
+        # Calculate new sequence length
+        new_seq_len = seq_len - 1 + num_audio_tokens
+        return {
+            "new_seq_len": new_seq_len,
+            "new_start_positions": new_start_positions,
+            "audio_mask": audio_mask,
+        }
+    def _expand_tensor_for_audio(
+        self,
+        input_ids: torch.Tensor,
+        tensor_to_expand: Optional[torch.Tensor],
+        num_audio_tokens: int,
+        fill_value: Optional[Union[int, float]] = None,
+        audio_fill_value: Optional[Union[int, float]] = None,
+    ) -> torch.Tensor:
+        """Generic method to expand any tensor to match audio token expansion.
+        Args:
+            input_ids: Token IDs with single <audio> token per sample
+            tensor_to_expand: Tensor to expand (input_ids, attention_mask, labels) or None
+            num_audio_tokens: Number of tokens each audio token expands to
+            fill_value: Default fill value for new tensor
+            audio_fill_value: Value to use for audio token positions (if different from fill_value)
+        Returns:
+            Expanded tensor matching the expanded sequence length
+        """
+        batch_size, seq_len = input_ids.shape
+        device = input_ids.device
+        details = self._get_audio_expansion_details(input_ids, num_audio_tokens)
+        new_seq_len = details["new_seq_len"]
+        new_start_positions = details["new_start_positions"]
+        audio_mask = details["audio_mask"]
+        # Determine the tensor we're actually expanding
+        if tensor_to_expand is None:
+            # Expanding input_ids themselves
+            tensor_to_expand = input_ids
+            fill_value = fill_value or self.tokenizer.pad_token_id
+            audio_fill_value = audio_fill_value or self.audio_token_id
+        else:
+            # Expanding other tensors (attention_mask, labels)
+            if fill_value is None:
+                raise ValueError("fill_value must be provided when expanding non-input_ids tensors")
+            if audio_fill_value is None:
+                audio_fill_value = fill_value
+        # Create output tensor
+        expanded = torch.full(
+            (batch_size, new_seq_len),
+            fill_value,
+            dtype=tensor_to_expand.dtype,
+            device=device,
+        )
+        # Scatter non-audio positions to their new positions
+        batch_indices = torch.arange(batch_size, device=device).unsqueeze(1).expand(-1, seq_len)
+        non_audio_mask = ~audio_mask
+        expanded[batch_indices[non_audio_mask], new_start_positions[non_audio_mask]] = (
+            tensor_to_expand[non_audio_mask]
+        )
+        # Fill audio positions if different from default fill
+        if audio_fill_value != fill_value:
+            audio_positions = audio_mask.int().argmax(dim=1)
+            audio_new_start = new_start_positions[
+                torch.arange(batch_size, device=device), audio_positions
+            ]
+            audio_token_indices = torch.arange(num_audio_tokens, device=device).unsqueeze(0)
+            audio_positions_expanded = audio_new_start.unsqueeze(1) + audio_token_indices
+            batch_idx_expanded = (
+                torch.arange(batch_size, device=device).unsqueeze(1).expand(-1, num_audio_tokens)
+            )
+            expanded[batch_idx_expanded, audio_positions_expanded] = audio_fill_value
+        return expanded
+    def _expand_audio_tokens(self, input_ids: torch.Tensor, num_audio_tokens: int) -> torch.Tensor:
+        """Convenience method for expanding input_ids."""
+        return self._expand_tensor_for_audio(input_ids, None, num_audio_tokens)
+    def _expand_for_audio_tokens(
+        self,
+        input_ids: torch.Tensor,
+        tensor_to_expand: torch.Tensor,
+        num_audio_tokens: int,
+        fill_value: Union[int, float],
+    ) -> torch.Tensor:
+        """Convenience method for expanding attention_mask or labels."""
+        return self._expand_tensor_for_audio(
+            input_ids, tensor_to_expand, num_audio_tokens, fill_value
+        )
+    def _prepare_audio_inputs_embeds(
+        self, expanded_input_ids: torch.Tensor, audio_embeds: torch.Tensor
+    ) -> torch.Tensor:
+        """Prepare inputs_embeds by replacing audio token embeddings with actual audio embeddings.
+        Args:
+            expanded_input_ids: Input IDs with expanded audio tokens
+            audio_embeds: Audio embeddings to inject
+        Returns:
+            inputs_embeds with audio embeddings injected
+        """
+        # Get text embeddings for expanded input_ids
+        inputs_embeds = self.decoder.get_input_embeddings()(expanded_input_ids)
+        # Simple masked scatter: replace audio token embeddings with actual audio embeddings
+        special_audio_mask = (expanded_input_ids == self.audio_token_id).unsqueeze(-1)
+        special_audio_mask = special_audio_mask.expand_as(inputs_embeds)
+        audio_embeds_flat = audio_embeds.reshape(-1, audio_embeds.shape[-1])
+        return inputs_embeds.masked_scatter(special_audio_mask, audio_embeds_flat)
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        input_values: Optional[torch.Tensor] = None,
+        input_features: Optional[torch.Tensor] = None,  # For Whisper
+        labels: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        num_items_in_batch: Optional[
+            int
+        ] = None,  # HF Trainer provides this for gradient accumulation
+        **kwargs,
+    ):
+        audio_inputs = input_values if input_values is not None else input_features
+        if audio_inputs is not None:
+            # During inference, the pipeline may call forward with only audio inputs
+            # In that case, we should raise an error directing to use generate() instead
+            if input_ids is None:
+                raise ValueError(
+                    "forward() requires both audio inputs and input_ids (for training). "
+                    "For inference, use the generate() method instead, or use the pipeline "
+                    "which will automatically call generate()."
+                )
+            # Extract audio-specific kwargs, don't pass input_ids to encoder
+            audio_attention_mask = kwargs.pop("audio_attention_mask", None)
+            # Remove any decoder-specific kwargs that shouldn't go to the encoder
+            kwargs.pop("past_key_values", None)
+            use_cache = kwargs.pop("use_cache", None)
+            # Encode audio to get embeddings
+            audio_embeds = self._encode_audio(
+                input_values=audio_inputs,  # Will be mapped to input_features for Whisper by safe_encoder_forward
+                audio_attention_mask=audio_attention_mask,
+            )
+            # Validate audio token ID before using it
+            if self.audio_token_id is None:
+                raise ValueError(f"Audio token not properly initialized: {self.audio_token_id}")
+            vocab_size = self.decoder.get_input_embeddings().weight.shape[0]
+            if self.audio_token_id >= vocab_size:
+                raise ValueError(
+                    f"Audio token ID out of range. ID: {self.audio_token_id}, Vocab size: {vocab_size}"
+                )
+            # Check that audio token exists
+            if not (input_ids == self.audio_token_id).any():
+                raise ValueError("Audio token <audio> must be present in input")
+            # Expand audio tokens to match audio embedding length
+            num_audio_tokens = audio_embeds.shape[1]
+            expanded_input_ids = self._expand_audio_tokens(input_ids, num_audio_tokens)
+            # Prepare inputs_embeds with audio embeddings injected
+            inputs_embeds = self._prepare_audio_inputs_embeds(expanded_input_ids, audio_embeds)
+            # Expand attention mask to match new sequence length (vectorized)
+            if attention_mask is not None:
+                full_attention_mask = self._expand_for_audio_tokens(
+                    input_ids, attention_mask, num_audio_tokens, fill_value=1
+                )
+            else:
+                full_attention_mask = None
+            # Expand labels to match new sequence length (vectorized, mark audio tokens as -100)
+            if labels is not None:
+                labels = self._expand_for_audio_tokens(
+                    input_ids, labels, num_audio_tokens, fill_value=-100
+                )
+        else:
+            inputs_embeds = self.decoder.get_input_embeddings()(input_ids)
+            full_attention_mask = attention_mask
+            use_cache = kwargs.pop("use_cache", None)
+        # Standard forward pass with built-in loss computation
+        return self.decoder(
+            inputs_embeds=inputs_embeds,
+            attention_mask=full_attention_mask,
+            labels=labels,
+            use_cache=use_cache if use_cache is not None else False,
+            **kwargs,
+        )
+    @torch.no_grad()
+    def generate(
+        self,
+        input_values: Optional[torch.Tensor] = None,
+        input_features: Optional[torch.Tensor] = None,  # For Whisper
+        system_prompt: Optional[str] = None,
+        user_prompt: Optional[str] = None,
+        task: Optional[str] = None,
+        **generate_kwargs,
+    ) -> Union[
+        torch.Tensor,
+        GenerateDecoderOnlyOutput,
+        GenerateEncoderDecoderOutput,
+        GenerateBeamDecoderOnlyOutput,
+        GenerateBeamEncoderDecoderOutput,
+    ]:
+        audio_inputs = input_values if input_values is not None else input_features
+        if audio_inputs is None:
+            raise ValueError("input_values or input_features must be provided for generation")
+        audio_embeds = self._encode_audio(audio_inputs)
+        batch_size = audio_embeds.shape[0]
+        device = audio_embeds.device
+        if system_prompt is None:
+            system_prompt = self.system_prompt
+        if user_prompt is None:
+            user_prompt = (
+                self.TASK_PROMPTS.get(task, self.config.user_prompt or "Transcribe: <audio>")
+                or "Transcribe: <audio>"
+            )
+        messages = []
+        if system_prompt:
+            messages.append({"role": "system", "content": system_prompt})
+        messages.append(
+            {
+                "role": "user",
+                "content": user_prompt,
+            }
+        )
+        prompt_ids = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=True,
+            return_tensors="pt",
+            enable_thinking=False,
+        ).to(device)
+        if len(prompt_ids.shape) == 1:
+            prompt_ids = prompt_ids.unsqueeze(0)
+        if prompt_ids.shape[0] == 1 and batch_size > 1:
+            prompt_ids = prompt_ids.expand(batch_size, -1)
+        if not (prompt_ids == self.audio_token_id).any():
+            raise ValueError("Audio token <audio> not found in prompt")
+        # Expand audio tokens to match audio embedding length
+        num_audio_tokens = audio_embeds.shape[1]
+        expanded_prompt_ids = self._expand_audio_tokens(prompt_ids, num_audio_tokens)
+        # Prepare inputs_embeds with audio embeddings injected
+        inputs_embeds = self._prepare_audio_inputs_embeds(expanded_prompt_ids, audio_embeds)
+        # Create attention mask for expanded sequence
+        total_seq_len = inputs_embeds.shape[1]
+        attention_mask = torch.ones(batch_size, total_seq_len, dtype=torch.long, device=device)
+        # Apply generation defaults from config
+        config_params = [
+            "max_new_tokens",
+            "min_new_tokens",
+            "num_beams",
+            "do_sample",
+            "temperature",
+            "top_k",
+            "top_p",
+            "repetition_penalty",
+            "length_penalty",
+            "no_repeat_ngram_size",
+            "early_stopping",
+        ]
+        for param in config_params:
+            if hasattr(self.config, param) and getattr(self.config, param) is not None:
+                generate_kwargs.setdefault(param, getattr(self.config, param))
+        # Add special token defaults
+        generate_kwargs.setdefault("use_cache", True)
+        generate_kwargs.setdefault(
+            "eos_token_id", self.tokenizer.convert_tokens_to_ids("<|im_end|>")
+        )
+        generate_kwargs.setdefault("pad_token_id", self.tokenizer.pad_token_id)
+        # Track the prompt length to extract only newly generated tokens
+        prompt_length = expanded_prompt_ids.shape[1]
+        # Generate the full sequence
+        generated_ids = self.decoder.generate(
+            input_ids=expanded_prompt_ids,
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            **generate_kwargs,
+        )
+        # Return only the newly generated tokens (exclude the prompt)
+        return generated_ids[:, prompt_length:]
+    def save_pretrained(self, save_directory: Union[str, Path], **kwargs):
+        import shutil
+        from pathlib import Path as PathlibPath
+        save_dir = PathlibPath(save_directory)
+        save_dir.mkdir(parents=True, exist_ok=True)
+        actual_vocab_size = self.decoder.config.vocab_size
+        self.config.vocab_size = actual_vocab_size
+        self.config.text_config.vocab_size = actual_vocab_size
+        if hasattr(self.encoder.config, "num_mel_bins"):
+            self.config.audio_config.num_mel_bins = self.encoder.config.num_mel_bins
+        # Use parent class to save config and model.safetensors
+        super().save_pretrained(save_dir, **kwargs)
+        self.tokenizer.save_pretrained(save_dir)
+        # For Whisper models, ensure feature_size matches num_mel_bins from encoder config
+        if hasattr(self.encoder.config, "num_mel_bins"):
+            # For Whisper models, explicitly set the correct feature_size before saving
+            num_mel_bins = self.encoder.config.num_mel_bins
+            self.feature_extractor.feature_size = num_mel_bins
+            self.feature_extractor.num_mel_bins = num_mel_bins  # Explicitly set num_mel_bins
+            if hasattr(self.feature_extractor, "n_mels"):
+                self.feature_extractor.n_mels = num_mel_bins
+            self.feature_extractor.nb_max_frames = 3000  # Whisper's max frames
+        self.get_processor().save_pretrained(save_dir)
+        src_dir = PathlibPath(__file__).parent
+        for asr_file in src_dir.glob("asr_*.py"):
+            shutil.copy(asr_file, save_dir / asr_file.name)
+AutoConfig.register("asr_model", ASRConfig)
+AutoModel.register(ASRConfig, ASRModel)

asr_pipeline.py ADDED Viewed

	@@ -0,0 +1,293 @@

+from typing import Any, Dict
+import torch
+import transformers
+from truecase import get_true_case
+try:
+    from .asr_modeling import ASRModel
+except ImportError:
+    from asr_modeling import ASRModel  # type: ignore[no-redef]
+class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
+    model: ASRModel
+    def __init__(self, model: ASRModel, **kwargs):
+        feature_extractor = kwargs.pop("feature_extractor", model.feature_extractor)
+        tokenizer = kwargs.pop("tokenizer", model.tokenizer)
+        super().__init__(
+            model=model, feature_extractor=feature_extractor, tokenizer=tokenizer, **kwargs
+        )
+        # Initialize text normalizer (same as train.py)
+        if hasattr(tokenizer, "normalize"):
+            self.text_normalizer = tokenizer
+        else:
+            # Fallback to whisper-tiny tokenizer for its normalize() method only
+            from transformers import WhisperTokenizer
+            self.text_normalizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny")
+    def __call__(self, inputs, **kwargs):
+        generate_kwargs = {}
+        for key in [
+            "max_new_tokens",
+            "num_beams",
+            "do_sample",
+            "length_penalty",
+            "repetition_penalty",
+            "no_repeat_ngram_size",
+            "early_stopping",
+            "num_beam_groups",
+            "diversity_penalty",
+            "top_k",
+            "temperature",
+            "top_p",
+            "user_prompt",
+            "task",
+            "text_input",
+        ]:
+            if key in kwargs:
+                generate_kwargs[key] = kwargs.pop(key)
+        # Handle text-only mode
+        task = generate_kwargs.get("task")
+        if task == "text" or generate_kwargs.get("text_input"):
+            return self._process_text_only(generate_kwargs)
+        if isinstance(inputs, list):
+            results = []
+            for single_input in inputs:
+                result = self.__call__(single_input, **kwargs, **generate_kwargs)
+                results.append(result)
+            return results
+        model_inputs = self.preprocess(inputs, **kwargs)
+        from collections.abc import Iterator
+        if isinstance(model_inputs, Iterator):
+            # Convert iterator to list to process chunks
+            chunks = list(model_inputs)
+            all_outputs = []
+            for _chunk_num, chunk in enumerate(chunks, start=1):
+                chunk_output = self._forward(chunk, **generate_kwargs)
+                # Move tensors to CPU before adding to outputs
+                for key, value in chunk_output.items():
+                    if torch.is_tensor(value):
+                        chunk_output[key] = value.cpu()
+                all_outputs.append(chunk_output)
+            # Merge chunks and decode ourselves to ensure skip_special_tokens=True
+            all_tokens: list[int] = []
+            for output in all_outputs:
+                tokens = output.get("tokens")
+                if tokens is None:
+                    tokens = output.get("generated_ids")
+                if tokens is not None:
+                    if torch.is_tensor(tokens):
+                        tokens = tokens.cpu()
+                    if len(tokens.shape) > 1:
+                        tokens = tokens[0]
+                    all_tokens.extend(tokens.tolist() if torch.is_tensor(tokens) else tokens)
+            # Decode the merged tokens with skip_special_tokens
+            text = self.tokenizer.decode(all_tokens, skip_special_tokens=True)
+            text = text.strip()
+            # Apply Whisper normalization (matches training)
+            text = self.text_normalizer.normalize(text)
+            # Apply truecasing for proper capitalization
+            text = get_true_case(text)
+            return {"text": text}
+        model_outputs = self._forward(model_inputs, **generate_kwargs)
+        return self.postprocess(model_outputs)
+    def preprocess(self, inputs, **preprocess_params):
+        if isinstance(inputs, list):
+            raise ValueError("Lists should not reach preprocess - bug in __call__")
+        # Set default chunking to 30 seconds with 5 second overlap
+        preprocess_params.setdefault("chunk_length_s", 30)
+        preprocess_params.setdefault("stride_length_s", (5, 5))
+        # Handle different formats from datasets
+        if isinstance(inputs, dict):
+            if "bytes" in inputs:
+                # Decode bytes to audio array using torchcodec
+                import tempfile
+                from torchcodec.decoders import AudioDecoder
+                wav_bytes = inputs["bytes"]
+                # Write to temp file for torchcodec to read
+                with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
+                    f.write(wav_bytes)
+                    temp_path = f.name
+                try:
+                    decoder = AudioDecoder(temp_path)
+                    # Get all audio samples
+                    audio_result = decoder.get_all_samples()
+                    audio_tensor = audio_result.data
+                    sample_rate = audio_result.sample_rate
+                    inputs = {"raw": audio_tensor.squeeze().numpy(), "sampling_rate": sample_rate}
+                finally:
+                    from pathlib import Path
+                    Path(temp_path).unlink()
+            elif "array" in inputs:
+                # Convert "array" key to "raw" key
+                inputs = {"raw": inputs["array"], "sampling_rate": inputs["sampling_rate"]}
+            # If it already has "raw" and "sampling_rate", it's good to go
+        elif hasattr(inputs, "array") and hasattr(inputs, "sampling_rate"):
+            # Audio object with attributes (not dict)
+            inputs = {"raw": inputs.array, "sampling_rate": inputs.sampling_rate}
+        elif hasattr(inputs, "__array__") and not isinstance(inputs, (dict, bytes, str)):
+            inputs = {"raw": inputs, "sampling_rate": self.model.config.audio_sample_rate}
+        elif torch.is_tensor(inputs):
+            inputs = {
+                "raw": inputs.cpu().numpy(),
+                "sampling_rate": self.model.config.audio_sample_rate,
+            }
+        return super().preprocess(inputs, **preprocess_params)
+    def _forward(self, model_inputs, **generate_kwargs):
+        # Extract task and set sampling parameters
+        task = generate_kwargs.pop("task", None)
+        # Task-specific sampling parameters
+        task_params: Dict[str, Dict[str, Any]] = {
+            "transcribe": {"do_sample": False},
+            "emotion": {"do_sample": True, "temperature": 0.7},
+            "describe": {"do_sample": True, "temperature": 0.7},
+            "continue": {"do_sample": True, "temperature": 1.0},
+        }
+        if task in task_params:
+            for key, value in task_params[task].items():
+                generate_kwargs.setdefault(key, value)
+        # Extract audio inputs from various formats
+        is_last = True
+        audio_inputs = None
+        is_whisper = False  # Track if this is Whisper input
+        # Normalize model_inputs to dict format
+        if isinstance(model_inputs, torch.Tensor):
+            audio_inputs = model_inputs
+        elif isinstance(model_inputs, (list, tuple)) and model_inputs:
+            model_inputs = (
+                model_inputs[0]
+                if isinstance(model_inputs[0], dict)
+                else {"input_values": model_inputs[0]}
+            )
+        if isinstance(model_inputs, dict):
+            # Pop metadata fields
+            is_last = model_inputs.pop("is_last", True)
+            model_inputs.pop("stride", None)
+            # Get audio input (Whisper uses input_features, others use input_values)
+            if "input_features" in model_inputs:
+                audio_inputs = model_inputs["input_features"]
+                is_whisper = True
+            else:
+                audio_inputs = model_inputs.get("input_values")
+        if audio_inputs is None:
+            raise ValueError(
+                f"Could not extract input_values or input_features from {type(model_inputs)}"
+            )
+        if isinstance(audio_inputs, torch.Tensor):
+            audio_inputs = audio_inputs.to(self.model.device)
+        else:
+            raise ValueError(f"audio inputs must be a tensor, got {type(audio_inputs)}")
+        im_end_id = self.model.tokenizer.convert_tokens_to_ids("<|im_end|>")
+        generate_kwargs.setdefault("eos_token_id", im_end_id)
+        generate_kwargs.setdefault("max_new_tokens", self.model.config.max_new_tokens)
+        # Pass the appropriate input type to generate
+        if is_whisper:
+            # Whisper model - use input_features
+            generated_ids = self.model.generate(
+                input_features=audio_inputs,
+                system_prompt=self.model.config.system_prompt,
+                task=task,
+                **generate_kwargs,
+            )
+        else:
+            # Wav2Vec2/HuBERT model - use input_values
+            generated_ids = self.model.generate(
+                input_values=audio_inputs,
+                system_prompt=self.model.config.system_prompt,
+                task=task,
+                **generate_kwargs,
+            )
+        return {"tokens": generated_ids, "is_last": is_last}
+    def _process_text_only(self, generate_kwargs):
+        """Process text-only input without audio encoding."""
+        text_input = generate_kwargs.pop("text_input", None)
+        if text_input is None:
+            raise ValueError("text_input is required for text task")
+        # Remove task from generate_kwargs to avoid duplicate argument
+        generate_kwargs.pop("task", None)
+        # Generate text using the model
+        generated_ids = self.model.generate(task="text", text_input=text_input, **generate_kwargs)
+        # Decode the generated text
+        generated_text = self.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
+        return {"text": generated_text}
+    def postprocess(
+        self, model_outputs: Dict[str, Any], return_timestamps=None, return_language=None
+    ):
+        # Handle chunked outputs from iterator
+        if isinstance(model_outputs, list):
+            # Move all tensors to CPU before calling parent postprocess
+            for output_dict in model_outputs:
+                for key, value in output_dict.items():
+                    if torch.is_tensor(value):
+                        output_dict[key] = value.cpu()
+            return super().postprocess(model_outputs)
+        if "is_last" in model_outputs:
+            model_outputs.pop("is_last")
+        tokens = model_outputs.get("tokens")
+        if tokens is None:
+            tokens = model_outputs.get("generated_ids")
+        if tokens is None:
+            raise ValueError(
+                f"Expected 'tokens' or 'generated_ids' in model_outputs, got: {model_outputs.keys()}"
+            )
+        # Move to CPU if on MPS or other device
+        if torch.is_tensor(tokens) and tokens.device.type != "cpu":
+            tokens = tokens.cpu()
+        if len(tokens.shape) > 1:
+            tokens = tokens[0]
+        text = self.tokenizer.decode(tokens, skip_special_tokens=True)
+        text = text.strip()
+        # Apply Whisper normalization (matches training)
+        text = self.text_normalizer.normalize(text)
+        # Apply truecasing for proper capitalization
+        text = get_true_case(text)
+        return {"text": text}

asr_processing.py ADDED Viewed

	@@ -0,0 +1,78 @@

+import transformers
+from transformers import AutoTokenizer, ProcessorMixin
+# Handle both package and standalone imports
+try:
+    from .asr_config import ASRConfig
+except ImportError:
+    from asr_config import ASRConfig  # type: ignore[no-redef]
+class ASRProcessor(ProcessorMixin):
+    """Generic processor that can handle both Wav2Vec2 and Whisper feature extractors."""
+    feature_extractor_class = "AutoFeatureExtractor"
+    tokenizer_class = "AutoTokenizer"
+    def __init__(self, feature_extractor, tokenizer):
+        self.feature_extractor = feature_extractor
+        self.tokenizer = tokenizer
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
+        from transformers import AutoFeatureExtractor
+        # Load feature extractor and tokenizer from saved model directory
+        feature_extractor = AutoFeatureExtractor.from_pretrained(
+            pretrained_model_name_or_path, **kwargs
+        )
+        tokenizer = AutoTokenizer.from_pretrained(
+            pretrained_model_name_or_path, trust_remote_code=True, **kwargs
+        )
+        return cls(feature_extractor=feature_extractor, tokenizer=tokenizer)
+    def save_pretrained(self, save_directory, **kwargs):
+        """Override save_pretrained to avoid attribute errors from base class."""
+        import json
+        from pathlib import Path
+        save_path = Path(save_directory)
+        save_path.mkdir(parents=True, exist_ok=True)
+        # Save the feature extractor (this creates preprocessor_config.json with all feature extractor settings)
+        if self.feature_extractor is not None:
+            self.feature_extractor.save_pretrained(save_directory)
+        # Save the tokenizer
+        if self.tokenizer is not None:
+            self.tokenizer.save_pretrained(save_directory)
+        # Load the existing preprocessor_config.json and add processor-specific metadata
+        config_path = save_path / "preprocessor_config.json"
+        if config_path.exists():
+            with config_path.open() as f:
+                processor_config = json.load(f)
+        else:
+            processor_config = {}
+        # Add/update processor metadata while preserving feature extractor settings
+        feature_extractor_type = self.feature_extractor.__class__.__name__
+        processor_config.update(
+            {
+                "processor_class": self.__class__.__name__,
+                "feature_extractor_class": self.feature_extractor_class,
+                "tokenizer_class": self.tokenizer_class,
+                "feature_extractor_type": feature_extractor_type,  # Dynamic based on actual type
+                "auto_map": {"AutoProcessor": "asr_processing.ASRProcessor"},
+            }
+        )
+        # Save the merged config
+        with config_path.open("w") as f:
+            json.dump(processor_config, f, indent=2)
+ASRProcessor.register_for_auto_class()
+transformers.AutoProcessor.register(ASRConfig, ASRProcessor)

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,94 @@

+{# ───── defaults ───── #}
+{%- if enable_thinking is not defined -%}
+{%- set enable_thinking = true -%}
+{%- endif -%}
+{# ───── reasoning mode ───── #}
+{%- if enable_thinking -%}
+  {%- set reasoning_mode = "/think" -%}
+{%- else -%}
+  {%- set reasoning_mode = "/no_think" -%}
+{%- endif -%}
+{# ───── header (system message) ───── #}
+{{- "<|im_start|>system\n" -}}
+{%- if messages[0].role == "system" -%}
+  {%- set system_message = messages[0].content -%}
+  {%- if "/no_think" in system_message -%}
+    {%- set reasoning_mode = "/no_think" -%}
+  {%- elif "/think" in system_message -%}
+    {%- set reasoning_mode = "/think" -%}
+  {%- endif -%}
+  {%- set custom_instructions = system_message.replace("/no_think", "").replace("/think", "").rstrip() -%}
+{%- endif -%}
+{%- if "/system_override" in system_message -%}
+  {{- custom_instructions.replace("/system_override", "").rstrip() -}}
+  {{- "<|im_end|>\n" -}}
+{%- else -%}
+  {{- "## Metadata\n\n" -}}
+  {{- "Knowledge Cutoff Date: June 2025\n" -}}
+  {%- set today = strftime_now("%d %B %Y") -%}
+  {{- "Today Date: " ~ today ~ "\n" -}}
+  {{- "Reasoning Mode: " + reasoning_mode + "\n\n" -}}
+  {{- "## Custom Instructions\n\n" -}}
+  {%- if custom_instructions -%}
+    {{- custom_instructions + "\n\n" -}}
+  {%- elif reasoning_mode == "/think" -%}
+    {{- "You are a helpful AI assistant named SmolLM, trained by Hugging Face. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracking, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion.\n\n" -}}
+  {%- else -%}
+    {{- "You are a helpful AI assistant named SmolLM, trained by Hugging Face.\n\n" -}}
+  {%- endif -%}
+  {%- if xml_tools or python_tools or tools -%}
+    {{- "### Tools\n\n" -}}
+    {%- if xml_tools or tools -%}
+      {%- if tools -%}
+        {%- set xml_tools = tools -%}
+      {%- endif -%}
+      {%- set ns = namespace(xml_tool_string="You may call one or more functions to assist with the user query.\nYou are provided with function signatures within <tools></tools> XML tags:\n\n<tools>\n") -%}
+      {%- for tool in xml_tools[:] -%} {# The slicing makes sure that xml_tools is a list #}
+        {%- set ns.xml_tool_string = ns.xml_tool_string ~ (tool | string) ~ "\n" -%}
+      {%- endfor -%}
+      {%- set xml_tool_string = ns.xml_tool_string + "</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>" -%}
+      {{- xml_tool_string -}}
+    {%- endif -%}
+    {%- if python_tools -%}
+      {%- set ns = namespace(python_tool_string="When you send a message containing Python code between '<code>' and '</code>' tags, it will be executed in a stateful Jupyter notebook environment, and you will then be given the output to continued reasoning in an agentic loop.\n\nYou can use the following tools in your python code like regular functions:\n<tools>\n") -%}
+      {%- for tool in python_tools[:] -%} {# The slicing makes sure that python_tools is a list #}
+        {%- set ns.python_tool_string = ns.python_tool_string ~ (tool | string) ~ "\n" -%}
+      {%- endfor -%}
+      {%- set python_tool_string = ns.python_tool_string + "</tools>\n\nThe state persists between code executions: so variables that you define in one step are still available thereafter." -%}
+      {{- python_tool_string -}}
+    {%- endif -%}
+    {{- "\n\n" -}}
+    {{- "<|im_end|>\n" -}}
+  {%- endif -%}
+{%- endif -%}
+{# ───── main loop ───── #}
+{%- for message in messages -%}
+    {%- set content = message.content if message.content is string else "" -%}
+    {%- if message.role == "user" -%}
+        {{ "<|im_start|>" + message.role + "\n"  + content + "<|im_end|>\n" }}
+    {%- elif message.role == "assistant" -%}
+        {% generation %}
+        {%- if reasoning_mode == "/think" -%}
+            {{ "<|im_start|>assistant\n" + content.lstrip("\n") + "<|im_end|>\n" }}
+        {%- else -%}
+            {{ "<|im_start|>assistant\n" + "<think>\n\n</think>\n" + content.lstrip("\n") + "<|im_end|>\n" }}
+        {%- endif -%}
+        {% endgeneration %}
+    {%- elif message.role == "tool" -%}
+    {{ "<|im_start|>" + "user\n"  + content + "<|im_end|>\n" }}
+    {%- endif -%}
+{%- endfor -%}
+{# ───── generation prompt ───── #}
+{%- if add_generation_prompt -%}
+    {%- if reasoning_mode == "/think" -%}
+        {{ "<|im_start|>assistant\n" }}
+    {%- else -%}
+        {{ "<|im_start|>assistant\n" + "<think>\n\n</think>\n"  }}
+    {%- endif -%}
+{%- endif -%}

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+  "chunk_length": 30,
+  "dither": 0.0,
+  "feature_extractor_type": "WhisperFeatureExtractor",
+  "feature_size": 128,
+  "hop_length": 160,
+  "n_fft": 400,
+  "n_samples": 480000,
+  "nb_max_frames": 3000,
+  "num_mel_bins": 128,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "processor_class": "ASRProcessor",
+  "return_attention_mask": false,
+  "sampling_rate": 16000,
+  "feature_extractor_class": "AutoFeatureExtractor",
+  "tokenizer_class": "AutoTokenizer",
+  "auto_map": {
+    "AutoProcessor": "asr_processing.ASRProcessor"
+  }
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "additional_special_tokens": [
+    {
+      "content": "<audio>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|finetune_right_pad_id|>"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d4aeaf198f783cbf58d8cd59812baac429ffe49147bf9648f6618de20b8d4a4c
+size 17209003

tokenizer_config.json ADDED Viewed

Binary file (50.6 kB). View file