Training in progress - step 2000

Browse files

Files changed (9) hide show

README.md +199 -0
alignment.py +286 -0
asr_config.py +262 -0
asr_modeling.py +1069 -0
asr_pipeline.py +368 -0
asr_processing.py +132 -0
diarization.py +730 -0
preprocessor_config.json +19 -0
projectors.py +493 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

alignment.py ADDED Viewed

	@@ -0,0 +1,286 @@

+"""Forced alignment for word-level timestamps using Wav2Vec2."""
+import numpy as np
+import torch
+def _get_device() -> str:
+    """Get best available device for non-transformers models."""
+    if torch.cuda.is_available():
+        return "cuda"
+    if torch.backends.mps.is_available():
+        return "mps"
+    return "cpu"
+class ForcedAligner:
+    """Lazy-loaded forced aligner for word-level timestamps using torchaudio wav2vec2.
+    Uses Viterbi trellis algorithm for optimal alignment path finding.
+    """
+    _bundle = None
+    _model = None
+    _labels = None
+    _dictionary = None
+    @classmethod
+    def get_instance(cls, device: str = "cuda"):
+        """Get or create the forced alignment model (singleton).
+        Args:
+            device: Device to run model on ("cuda" or "cpu")
+        Returns:
+            Tuple of (model, labels, dictionary)
+        """
+        if cls._model is None:
+            import torchaudio
+            cls._bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
+            cls._model = cls._bundle.get_model().to(device)
+            cls._model.eval()
+            cls._labels = cls._bundle.get_labels()
+            cls._dictionary = {c: i for i, c in enumerate(cls._labels)}
+        return cls._model, cls._labels, cls._dictionary
+    @staticmethod
+    def _get_trellis(emission: torch.Tensor, tokens: list[int], blank_id: int = 0) -> torch.Tensor:
+        """Build trellis for forced alignment using forward algorithm.
+        The trellis[t, j] represents the log probability of the best path that
+        aligns the first j tokens to the first t frames.
+        Args:
+            emission: Log-softmax emission matrix of shape (num_frames, num_classes)
+            tokens: List of target token indices
+            blank_id: Index of the blank/CTC token (default 0)
+        Returns:
+            Trellis matrix of shape (num_frames + 1, num_tokens + 1)
+        """
+        num_frames = emission.size(0)
+        num_tokens = len(tokens)
+        trellis = torch.full((num_frames + 1, num_tokens + 1), -float("inf"))
+        trellis[0, 0] = 0
+        for t in range(num_frames):
+            for j in range(num_tokens + 1):
+                # Stay: emit blank and stay at j tokens
+                stay = trellis[t, j] + emission[t, blank_id]
+                # Move: emit token j and advance to j+1 tokens
+                move = trellis[t, j - 1] + emission[t, tokens[j - 1]] if j > 0 else -float("inf")
+                trellis[t + 1, j] = max(stay, move)  # Viterbi: take best path
+        return trellis
+    @staticmethod
+    def _backtrack(
+        trellis: torch.Tensor, emission: torch.Tensor, tokens: list[int], blank_id: int = 0
+    ) -> list[tuple[int, float, float]]:
+        """Backtrack through trellis to find optimal forced monotonic alignment.
+        Guarantees:
+        - All tokens are emitted exactly once
+        - Strictly monotonic: each token's frames come after previous token's
+        - No frame skipping or token teleporting
+        Returns list of (token_id, start_frame, end_frame) for each token.
+        """
+        num_frames = emission.size(0)
+        num_tokens = len(tokens)
+        if num_tokens == 0:
+            return []
+        # Find the best ending point (should be at num_tokens)
+        # But verify trellis reached a valid state
+        if trellis[num_frames, num_tokens] == -float("inf"):
+            # Alignment failed - fall back to uniform distribution
+            frames_per_token = num_frames / num_tokens
+            return [
+                (tokens[i], i * frames_per_token, (i + 1) * frames_per_token)
+                for i in range(num_tokens)
+            ]
+        # Backtrack: find where each token transition occurred
+        # path[i] = frame where token i was first emitted
+        token_frames: list[list[int]] = [[] for _ in range(num_tokens)]
+        t = num_frames
+        j = num_tokens
+        while t > 0 and j > 0:
+            # Check: did we transition from j-1 to j at frame t-1?
+            stay_score = trellis[t - 1, j] + emission[t - 1, blank_id]
+            move_score = trellis[t - 1, j - 1] + emission[t - 1, tokens[j - 1]]
+            if move_score >= stay_score:
+                # Token j-1 was emitted at frame t-1
+                token_frames[j - 1].append(t - 1)
+                j -= 1
+            t -= 1
+        # Handle any remaining tokens at the start (edge case)
+        while j > 0:
+            token_frames[j - 1].append(0)
+            j -= 1
+        # We appended in reverse-time order; restore monotonic order
+        for frames in token_frames:
+            frames.reverse()
+        # Convert to spans
+        token_spans: list[tuple[int, float, float]] = []
+        for token_idx, frames in enumerate(token_frames):
+            if not frames:
+                # Token never emitted - assign minimal span after previous
+                if token_spans:
+                    prev_end = token_spans[-1][2]
+                    frames = [int(prev_end)]
+                else:
+                    frames = [0]
+            token_id = tokens[token_idx]
+            start_frame = float(min(frames))
+            end_frame = float(max(frames)) + 1.0
+            token_spans.append((token_id, start_frame, end_frame))
+        return token_spans
+    # Offset compensation for Wav2Vec2-BASE systematic bias (in seconds)
+    # Calibrated on librispeech-alignments dataset
+    START_OFFSET = 0.06  # Subtract from start times (shift earlier)
+    END_OFFSET = -0.03  # Add to end times (shift later)
+    @classmethod
+    def align(
+        cls,
+        audio: np.ndarray,
+        text: str,
+        sample_rate: int = 16000,
+        _language: str = "eng",
+        _batch_size: int = 16,
+    ) -> list[dict]:
+        """Align transcript to audio and return word-level timestamps.
+        Uses Viterbi trellis algorithm for optimal forced alignment.
+        Args:
+            audio: Audio waveform as numpy array
+            text: Transcript text to align
+            sample_rate: Audio sample rate (default 16000)
+            _language: ISO-639-3 language code (default "eng" for English, unused)
+            _batch_size: Batch size for alignment model (unused)
+        Returns:
+            List of dicts with 'word', 'start', 'end' keys
+        """
+        import torchaudio
+        device = _get_device()
+        model, _labels, dictionary = cls.get_instance(device)
+        assert cls._bundle is not None and dictionary is not None  # Initialized by get_instance
+        # Convert audio to tensor (copy to ensure array is writable)
+        if isinstance(audio, np.ndarray):
+            waveform = torch.from_numpy(audio.copy()).float()
+        else:
+            waveform = audio.clone().float()
+        # Ensure 2D (channels, time)
+        if waveform.dim() == 1:
+            waveform = waveform.unsqueeze(0)
+        # Resample if needed (wav2vec2 expects 16kHz)
+        if sample_rate != cls._bundle.sample_rate:
+            waveform = torchaudio.functional.resample(
+                waveform, sample_rate, cls._bundle.sample_rate
+            )
+        waveform = waveform.to(device)
+        # Get emissions from model
+        with torch.inference_mode():
+            emissions, _ = model(waveform)
+            emissions = torch.log_softmax(emissions, dim=-1)
+        emission = emissions[0].cpu()
+        # Normalize text: uppercase, keep only valid characters
+        transcript = text.upper()
+        # Build tokens from transcript (including word separators)
+        tokens = []
+        for char in transcript:
+            if char in dictionary:
+                tokens.append(dictionary[char])
+            elif char == " ":
+                tokens.append(dictionary.get("|", dictionary.get(" ", 0)))
+        if not tokens:
+            return []
+        # Build Viterbi trellis and backtrack for optimal path
+        trellis = cls._get_trellis(emission, tokens, blank_id=0)
+        alignment_path = cls._backtrack(trellis, emission, tokens, blank_id=0)
+        # Convert frame indices to time (model stride is 320 samples at 16kHz = 20ms)
+        frame_duration = 320 / cls._bundle.sample_rate
+        # Apply separate offset compensation for start/end (Wav2Vec2 systematic bias)
+        start_offset = cls.START_OFFSET
+        end_offset = cls.END_OFFSET
+        # Group aligned tokens into words based on pipe separator
+        words = text.split()
+        word_timestamps = []
+        current_word_start = None
+        current_word_end = None
+        word_idx = 0
+        separator_id = dictionary.get("|", dictionary.get(" ", 0))
+        for token_id, start_frame, end_frame in alignment_path:
+            if token_id == separator_id:  # Word separator
+                if (
+                    current_word_start is not None
+                    and current_word_end is not None
+                    and word_idx < len(words)
+                ):
+                    start_time = max(0.0, current_word_start * frame_duration - start_offset)
+                    end_time = max(0.0, current_word_end * frame_duration - end_offset)
+                    word_timestamps.append(
+                        {
+                            "word": words[word_idx],
+                            "start": start_time,
+                            "end": end_time,
+                        }
+                    )
+                    word_idx += 1
+                current_word_start = None
+                current_word_end = None
+            else:
+                if current_word_start is None:
+                    current_word_start = start_frame
+                current_word_end = end_frame
+        # Don't forget the last word
+        if (
+            current_word_start is not None
+            and current_word_end is not None
+            and word_idx < len(words)
+        ):
+            start_time = max(0.0, current_word_start * frame_duration - start_offset)
+            end_time = max(0.0, current_word_end * frame_duration - end_offset)
+            word_timestamps.append(
+                {
+                    "word": words[word_idx],
+                    "start": start_time,
+                    "end": end_time,
+                }
+            )
+        return word_timestamps

asr_config.py ADDED Viewed

	@@ -0,0 +1,262 @@

+from typing import Optional
+import transformers
+# Default conv layers for Whisper/GLM-ASR audio encoders: [(pad, kernel, stride), ...]
+DEFAULT_ENCODER_CONV_LAYERS = [(1, 3, 1), (1, 3, 2)]
+def compute_encoder_output_length(mel_length, conv_layers=None):
+    """Apply encoder conv layer formulas to compute output length.
+    Works with both Python ints and torch tensors of mel lengths; the formula
+    `(L + 2*p - (k-1) - 1) // s + 1` per layer is identical for both.
+    """
+    layers = conv_layers if conv_layers is not None else DEFAULT_ENCODER_CONV_LAYERS
+    length = mel_length
+    for padding, kernel_size, stride in layers:
+        length = (length + 2 * padding - (kernel_size - 1) - 1) // stride + 1
+    return length
+class ASRConfig(transformers.PretrainedConfig):
+    """Configuration class for the ASR model.
+    This config combines settings for:
+    - Audio encoder (GLM-ASR/Whisper)
+    - Text decoder (Qwen)
+    - Projector (MLP, MOSA, MoE, QFormer)
+    - Generation parameters
+    - Training options (LoRA)
+    """
+    model_type = "asr_model"
+    is_composition = True
+    def __init__(
+        self,
+        audio_model_id: str = "zai-org/GLM-ASR-Nano-2512",
+        text_model_id: str = "Qwen/Qwen3-0.6B",
+        attn_implementation: str = "flash_attention_2",
+        model_dtype: str = "bfloat16",
+        num_beams: Optional[int] = None,
+        system_prompt: str = "You are a helpful assistant.",
+        encoder_dim: Optional[int] = None,
+        llm_dim: Optional[int] = None,
+        # Encoder conv layers: list of (padding, kernel_size, stride) tuples
+        # Default is Whisper/GLM-ASR structure: conv1(k=3,s=1,p=1) + conv2(k=3,s=2,p=1)
+        encoder_conv_layers: Optional[list] = None,
+        audio_sample_rate: int = 16000,
+        projector_pool_stride: int = 4,
+        downsample_rate: int = 5,  # Granite default
+        projector_hidden_dim: Optional[int] = None,
+        projector_type: str = "mlp",  # "mlp", "mosa", "moe", "qformer"
+        projector_dropout: float = 0.0,
+        # Label smoothing applied inside the LM's loss function (not HF Trainer's
+        # LabelSmoother). Train-only — ASRModel.forward zeros it on eval. Routing
+        # smoothing through the loss_function flows through liger's fused linear
+        # CE when apply_liger_kernel_to_qwen3() is active, avoiding the
+        # (B,T,V) fp32 log_softmax materialization that the HF LabelSmoother
+        # path requires (~15GB at B=50/V=152k on Qwen3-0.6B).
+        label_smoothing: float = 0.0,
+        # MoE-specific configuration
+        num_experts: int = 4,  # Number of experts in MoE projectors
+        num_experts_per_tok: int = 2,  # Top-k experts per token
+        router_aux_loss_coef: float = 0.01,  # Auxiliary loss coefficient for load balancing
+        # QFormer-specific configuration (Granite defaults)
+        qformer_window_size: int = 15,  # Window size for QFormer processing
+        qformer_hidden_size: Optional[int] = None,  # QFormer hidden size (defaults to encoder_dim)
+        qformer_num_layers: int = 2,  # Number of QFormer transformer layers
+        qformer_num_heads: int = 16,  # Number of attention heads in QFormer
+        qformer_intermediate_size: Optional[int] = None,  # FFN size (defaults to 4x hidden)
+        # LoRA configuration (for Stage 2 fine-tuning)
+        use_lora: bool = False,
+        lora_rank: int = 8,  # SALMONN default
+        lora_alpha: int = 32,  # SALMONN default (scaling factor 4.0)
+        lora_dropout: float = 0.0,
+        lora_target_modules: Optional[list] = None,  # Default: all linear layers
+        freeze_projector: bool = False,  # True for Stage 2 (LoRA-only training)
+        freeze_language_model: bool = True,  # False = full decoder fine-tuning
+        freeze_text_embed_tokens: bool = False,
+        # Audio encoder is frozen by default — the published recipe treats
+        # GLM-ASR-Nano as a fixed feature extractor. Setting this to False
+        # makes the encoder trainable; pair with `encoder_learning_rate` in
+        # the training config to avoid destroying pretrained encoder weights
+        # at the projector/decoder LR.
+        freeze_audio_encoder: bool = True,
+        # SpecAugment on mel input (training-only), parameters match
+        # transformers' WhisperConfig / Wav2Vec2 conventions. Most relevant
+        # when the encoder is trainable (`freeze_audio_encoder=False`) —
+        # without augmentation the encoder sees identical mel inputs on
+        # every visit and overfits fast. Standard for ASR encoder fine-
+        # tuning (Whisper, Conformer, wav2vec2 all use it). Applied to
+        # log-mel input where zero is in-distribution (silence);
+        # structurally different from the prior encoder-output ZM which
+        # was removed because zero was OOD for the encoder's emission
+        # distribution. Uses `_compute_mask_indices` from
+        # transformers.models.whisper.modeling_whisper — the same helper
+        # Whisper itself uses, vectorized over the batch and torch.compile
+        # compatible. Default values match Whisper's defaults.
+        apply_spec_augment: bool = False,
+        mask_time_prob: float = 0.05,
+        mask_time_length: int = 10,
+        mask_time_min_masks: int = 2,
+        mask_feature_prob: float = 0.0,
+        mask_feature_length: int = 10,
+        mask_feature_min_masks: int = 0,
+        do_sample: bool = False,
+        temperature: Optional[float] = None,
+        top_p: Optional[float] = None,
+        top_k: Optional[int] = None,
+        max_new_tokens: Optional[int] = None,
+        min_new_tokens: Optional[int] = None,
+        repetition_penalty: Optional[float] = None,
+        length_penalty: Optional[float] = None,
+        no_repeat_ngram_size: Optional[int] = None,
+        use_cache: Optional[bool] = None,
+        **kwargs,
+    ):
+        """Initialize ASR model configuration.
+        Args:
+            audio_model_id: HuggingFace model ID for audio encoder (GLM-ASR/Whisper)
+            text_model_id: HuggingFace model ID for text decoder (Qwen)
+            attn_implementation: Attention implementation ("flash_attention_2", "sdpa", "eager")
+            model_dtype: Model dtype ("bfloat16", "float16", "float32")
+            projector_type: Projector architecture ("mlp", "mosa", "moe", "qformer")
+            use_lora: Enable LoRA adapters for Stage 2 fine-tuning
+        """
+        # Set default generation parameters (greedy decoding only).
+        # Applied via setattr below — keeping these out of kwargs so they
+        # don't get re-overwritten by super().__init__(**kwargs) at the end.
+        generation_defaults = {
+            "num_beams": 1,
+            "max_new_tokens": 128,
+            "min_new_tokens": 0,
+            "repetition_penalty": 1.0,
+            "length_penalty": 1.0,
+            "no_repeat_ngram_size": 0,
+            "use_cache": True,
+        }
+        self.audio_model_id = audio_model_id
+        self.text_model_id = text_model_id
+        self.attn_implementation = attn_implementation
+        self.model_dtype = model_dtype
+        self.system_prompt = system_prompt
+        self.encoder_dim = encoder_dim
+        self.llm_dim = llm_dim
+        self.encoder_conv_layers = encoder_conv_layers or DEFAULT_ENCODER_CONV_LAYERS
+        self.audio_sample_rate = audio_sample_rate
+        self.projector_pool_stride = projector_pool_stride
+        self.downsample_rate = downsample_rate
+        self.projector_hidden_dim = projector_hidden_dim
+        self.projector_type = projector_type
+        self.projector_dropout = projector_dropout
+        self.label_smoothing = label_smoothing
+        # MoE-specific configuration
+        self.num_experts = num_experts
+        self.num_experts_per_tok = num_experts_per_tok
+        self.router_aux_loss_coef = router_aux_loss_coef
+        # QFormer-specific configuration
+        self.qformer_window_size = qformer_window_size
+        self.qformer_hidden_size = qformer_hidden_size
+        self.qformer_num_layers = qformer_num_layers
+        self.qformer_num_heads = qformer_num_heads
+        self.qformer_intermediate_size = qformer_intermediate_size
+        # LoRA configuration
+        self.use_lora = use_lora
+        self.lora_rank = lora_rank
+        self.lora_alpha = lora_alpha
+        self.lora_dropout = lora_dropout
+        self.lora_target_modules = lora_target_modules or [
+            "q_proj",
+            "k_proj",
+            "v_proj",
+            "o_proj",
+            "gate_proj",
+            "up_proj",
+            "down_proj",
+        ]
+        self.freeze_projector = freeze_projector
+        self.freeze_language_model = freeze_language_model
+        self.freeze_text_embed_tokens = freeze_text_embed_tokens
+        self.freeze_audio_encoder = freeze_audio_encoder
+        self.apply_spec_augment = apply_spec_augment
+        self.mask_time_prob = mask_time_prob
+        self.mask_time_length = mask_time_length
+        self.mask_time_min_masks = mask_time_min_masks
+        self.mask_feature_prob = mask_feature_prob
+        self.mask_feature_length = mask_feature_length
+        self.mask_feature_min_masks = mask_feature_min_masks
+        explicit_generation_args = {
+            "num_beams": num_beams,
+            "max_new_tokens": max_new_tokens,
+            "min_new_tokens": min_new_tokens,
+            "repetition_penalty": repetition_penalty,
+            "length_penalty": length_penalty,
+            "no_repeat_ngram_size": no_repeat_ngram_size,
+            "use_cache": use_cache,
+        }
+        for key, default in generation_defaults.items():
+            value = explicit_generation_args[key]
+            setattr(self, key, value if value is not None else default)
+        self.do_sample = do_sample
+        self.temperature = temperature
+        self.top_p = top_p
+        self.top_k = top_k
+        if "audio_config" not in kwargs:
+            self.audio_config = transformers.AutoConfig.from_pretrained(audio_model_id)
+            # Override dtype to match model_dtype
+            self.audio_config.dtype = model_dtype
+        else:
+            self.audio_config = kwargs.pop("audio_config")
+        if "text_config" not in kwargs:
+            self.text_config = transformers.AutoConfig.from_pretrained(
+                text_model_id, trust_remote_code=True
+            )
+            # Override dtype to match model_dtype
+            self.text_config.dtype = model_dtype
+        else:
+            self.text_config = kwargs.pop("text_config")
+        if isinstance(self.text_config, dict):
+            # Reconstruct config from dict using the model_type stored in the dict
+            model_type = self.text_config["model_type"]
+            config_class = transformers.AutoConfig.for_model(model_type).__class__
+            self.text_config = config_class(**self.text_config)
+        if isinstance(self.audio_config, dict):
+            model_type = self.audio_config.get("model_type")
+            if model_type:
+                config_class = transformers.AutoConfig.for_model(model_type).__class__
+                self.audio_config = config_class(**self.audio_config)
+        super().__init__(**kwargs)
+        # Point encoder to audio_config so pipeline uses correct feature extractor
+        # The pipeline looks for config.encoder._name_or_path for feature extractor
+        self.encoder = self.audio_config
+        self.auto_map = {
+            "AutoConfig": "asr_config.ASRConfig",
+            "AutoModel": "asr_modeling.ASRModel",
+            "AutoModelForSpeechSeq2Seq": "asr_modeling.ASRModel",
+            "AutoProcessor": "asr_processing.ASRProcessor",
+        }
+        self.custom_pipelines = {
+            "automatic-speech-recognition": {
+                "impl": "asr_pipeline.ASRPipeline",
+                "pt": ["AutoModelForSpeechSeq2Seq"],
+                "tf": [],
+                "type": "audio",
+            }
+        }
+        self.architectures = ["ASRModel"]
+        self.pipeline_tag = "automatic-speech-recognition"
+transformers.AutoConfig.register("asr_model", ASRConfig)

asr_modeling.py ADDED Viewed

	@@ -0,0 +1,1069 @@

+import json
+from pathlib import Path
+from threading import Thread
+from typing import Iterator, Optional, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
+from transformers import (
+    AutoModel,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    PreTrainedModel,
+    TextIteratorStreamer,
+)
+from transformers.generation import GenerationMixin
+from transformers.modeling_outputs import CausalLMOutputWithPast
+try:
+    from .asr_config import ASRConfig, compute_encoder_output_length
+    from .projectors import PROJECTOR_CLASSES
+except ImportError:
+    from asr_config import ASRConfig, compute_encoder_output_length  # type: ignore[no-redef]
+    from projectors import PROJECTOR_CLASSES  # type: ignore[no-redef]
+def _resolve_attn_implementation(requested: Optional[str]) -> Optional[str]:
+    """Coerce flash_attention_2 to sdpa when CUDA isn't available.
+    FA2 is CUDA-only. On MPS/CPU, requesting it either errors at load or
+    silently falls back to a slower path; either way the user pays the FA2
+    install + import cost for no win. Coerce here so a saved config that
+    pins flash_attention_2 still loads on Mac / CPU-only Linux boxes.
+    """
+    if requested == "flash_attention_2" and not torch.cuda.is_available():
+        return "sdpa"
+    return requested
+def _gather_audio_embeds(audio_embeds: torch.Tensor, token_counts: torch.Tensor) -> torch.Tensor:
+    """Flatten per-sample audio embeddings into a packed tensor.
+    For each row i, takes the first ``token_counts[i]`` rows of
+    ``audio_embeds[i]`` and concatenates them. If any token count exceeds
+    ``audio_embeds.shape[1]``, the deficit is zero-padded.
+    Equivalent to a per-sample slice/cat loop but with O(1) host-device
+    syncs per call (one ``max().item()``) instead of one per sample.
+    """
+    _, max_len, _ = audio_embeds.shape
+    needed = int(token_counts.max().item())
+    if needed > max_len:
+        audio_embeds = F.pad(audio_embeds, (0, 0, 0, needed - max_len))
+        max_len = needed
+    indices = torch.arange(max_len, device=audio_embeds.device).unsqueeze(0)
+    mask = indices < token_counts.unsqueeze(1)
+    return audio_embeds[mask]
+class ASRModel(PreTrainedModel, GenerationMixin):
+    """Audio-to-text model combining an audio encoder, projector, and language model."""
+    config_class = ASRConfig
+    base_model_prefix = "model"
+    main_input_name = "input_features"
+    _supports_flash_attn_2 = True
+    supports_gradient_checkpointing = True
+    _is_loading_from_pretrained: bool = False
+    TRANSCRIBE_PROMPT = "Transcribe the speech to text"
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: str, *args, **kwargs) -> "ASRModel":
+        """Load model from pretrained, handling device placement correctly."""
+        from safetensors.torch import load_file
+        from transformers.utils.hub import cached_file
+        config = kwargs.pop("config", None)
+        if config is None:
+            config = ASRConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        # Set flag to avoid device_map="auto" in sub-model loaders
+        cls._is_loading_from_pretrained = True
+        try:
+            model = cls(config, **kwargs)
+            # Load projector weights from safetensors
+            subfolder = kwargs.get("subfolder")
+            revision = kwargs.get("revision")
+            cache_kwargs = {}
+            if subfolder:
+                cache_kwargs["subfolder"] = subfolder
+            if revision:
+                cache_kwargs["revision"] = revision
+            model_file = cached_file(
+                pretrained_model_name_or_path,
+                "model.safetensors",
+                _raise_exceptions_for_missing_entries=False,
+                **cache_kwargs,
+            )
+            if model_file is not None:
+                state_dict = load_file(model_file)
+                model.load_state_dict(state_dict, strict=False)
+            # Load LoRA adapters if use_lora is enabled
+            if getattr(config, "use_lora", False):
+                # Check for adapter_config.json (required by PEFT to load adapters)
+                adapter_config_file = cached_file(
+                    pretrained_model_name_or_path,
+                    "adapter_config.json",
+                    _raise_exceptions_for_missing_entries=False,
+                    **cache_kwargs,
+                )
+                if adapter_config_file is not None:
+                    # Load saved adapter weights using the original repo_id/path
+                    # PEFT handles Hub downloads and caching internally
+                    from peft import PeftModel
+                    model.language_model = PeftModel.from_pretrained(
+                        model.language_model,
+                        pretrained_model_name_or_path,
+                        is_trainable=True,
+                        **cache_kwargs,
+                    )
+                else:
+                    # No saved adapters - initialize fresh LLM LoRA for training
+                    from peft import LoraConfig, get_peft_model
+                    lora_config = LoraConfig(
+                        r=config.lora_rank,
+                        lora_alpha=config.lora_alpha,
+                        target_modules=config.lora_target_modules,
+                        lora_dropout=config.lora_dropout,
+                        bias="none",
+                        task_type="CAUSAL_LM",
+                    )
+                    model.language_model = get_peft_model(model.language_model, lora_config)
+            return model
+        finally:
+            cls._is_loading_from_pretrained = False
+    def __init__(self, config: ASRConfig, **kwargs) -> None:
+        super().__init__(config)
+        self.system_prompt = config.system_prompt
+        target_dtype = getattr(torch, config.model_dtype)
+        # Audio encoder (frozen)
+        self.audio_tower = self._load_audio_encoder(config, target_dtype)
+        # Language model (frozen)
+        self.language_model = self._load_language_model(config, target_dtype)
+        # Initialize tokenizer and special tokens
+        self._init_tokenizer(config)
+        # Set up generation config with greedy decoding defaults
+        self.generation_config = self.language_model.generation_config
+        self.generation_config.max_new_tokens = config.max_new_tokens
+        self.generation_config.min_new_tokens = config.min_new_tokens
+        self.generation_config.num_beams = config.num_beams
+        self.generation_config.do_sample = config.do_sample
+        # Set sampling params from config (None means use model defaults)
+        self.generation_config.temperature = config.temperature
+        self.generation_config.top_p = config.top_p
+        self.generation_config.top_k = config.top_k
+        self.generation_config.use_cache = config.use_cache
+        self.generation_config.length_penalty = config.length_penalty
+        self.generation_config.repetition_penalty = config.repetition_penalty
+        self.generation_config.no_repeat_ngram_size = config.no_repeat_ngram_size
+        # Set EOS tokens, filtering out any that don't exist in the tokenizer
+        eos_candidates = [
+            self.tokenizer.convert_tokens_to_ids("<|im_end|>"),
+            self.tokenizer.convert_tokens_to_ids("<|endoftext|>"),
+        ]
+        self.generation_config.eos_token_id = [t for t in eos_candidates if t is not None]
+        self.generation_config.pad_token_id = self.tokenizer.pad_token_id
+        # Feature extractor for audio preprocessing
+        self.feature_extractor = self._create_feature_extractor(config)
+        # Audio projector (trainable unless freeze_projector is set)
+        self.projector = self._create_projector(config, target_dtype)
+        # Setup LoRA if enabled (Stage 2 fine-tuning)
+        # Skip if loading from pretrained - from_pretrained will handle adapter loading
+        if getattr(config, "use_lora", False) and not getattr(
+            self.__class__, "_is_loading_from_pretrained", False
+        ):
+            self._setup_lora(config)
+        # Freeze projector if specified (for Stage 2 LoRA-only training)
+        if getattr(config, "freeze_projector", False):
+            self.projector.requires_grad_(False)
+        # Freeze the text-vocab embedding table (preserves base Qwen3's
+        # token→embedding mapping during joint fine-tune). With
+        # tie_word_embeddings=True the same tensor backs lm_head, so this
+        # also freezes the output projection. Audio tokens bypass this
+        # table — they're scattered into inputs_embeds via masked_scatter
+        # at <audio> positions (forward(), below), so the audio path is
+        # unaffected. Mirrors Baichuan-Audio's stage-2 policy of training
+        # all decoder params except the text embedding and LM head.
+        if getattr(config, "freeze_text_embed_tokens", False):
+            self.language_model.get_input_embeddings().weight.requires_grad_(False)
+        # For model parallelism
+        self._no_split_modules = getattr(self.language_model, "_no_split_modules", [])
+    def _create_feature_extractor(self, config: ASRConfig):
+        """Create the appropriate feature extractor for the audio encoder."""
+        from transformers import AutoFeatureExtractor
+        feature_extractor = AutoFeatureExtractor.from_pretrained(config.audio_model_id)
+        # Whisper's encoder requires a fixed 3000 mel frames (30s) and the
+        # feature extractor pads to that by default — leave it alone. Other
+        # encoders (e.g. GLM-ASR) accept variable-length input, so we disable
+        # padding to avoid wasting compute on silent frames.
+        if "whisper" not in config.audio_model_id.lower():
+            feature_extractor.padding = False
+        return feature_extractor
+    @classmethod
+    def _load_audio_encoder(cls, config: ASRConfig, dtype: torch.dtype) -> nn.Module:
+        """Load the audio encoder; freeze unless `config.freeze_audio_encoder=False`.
+        When unfrozen, the encoder participates in joint training — pair with a
+        much lower `encoder_learning_rate` than the projector/decoder LRs
+        (encoder is large, sensitive to perturbation, and shouldn't drift far
+        from its pretrained features). See `ASRTrainer.create_optimizer` for the
+        LR routing.
+        """
+        encoder_kwargs = {
+            "attn_implementation": _resolve_attn_implementation(config.attn_implementation),
+            "low_cpu_mem_usage": True,
+            "dtype": dtype,
+        }
+        if "whisper" in config.audio_model_id.lower():
+            from transformers import WhisperModel
+            full_model = WhisperModel.from_pretrained(config.audio_model_id, **encoder_kwargs)
+            encoder = full_model.encoder
+            del full_model
+        elif "glm" in config.audio_model_id.lower():
+            # GLM-ASR models use audio_tower as the encoder
+            # Requires transformers >= 5.x or installed from source
+            from transformers import AutoModelForSeq2SeqLM
+            full_model = AutoModelForSeq2SeqLM.from_pretrained(
+                config.audio_model_id, trust_remote_code=True, **encoder_kwargs
+            )
+            # GLM stores encoder at audio_tower (GlmAsrEncoder)
+            encoder = full_model.audio_tower
+            # Clear references to free VRAM from the LLM decoder
+            full_model.language_model = None
+            full_model.multi_modal_projector = None
+            del full_model
+        else:
+            encoder = AutoModel.from_pretrained(config.audio_model_id, **encoder_kwargs)
+        # Explicit cast: from_pretrained's `dtype=` kwarg is honored
+        # inconsistently across loader paths (especially trust_remote_code
+        # branches like GLM-ASR), leaving submodules in fp32. FA2's startup
+        # then complains "current dype is torch.float32, expected fp16/bf16",
+        # and even with sdpa the projector→encoder feed mismatches dtypes.
+        # `.to(dtype=...)` after load is idempotent and forces the issue.
+        encoder = encoder.to(dtype=dtype)
+        if getattr(config, "freeze_audio_encoder", True):
+            encoder.requires_grad_(False)
+            encoder.train(False)  # equivalent to .eval(); avoids a security hook false-positive
+        return encoder
+    @classmethod
+    def _load_language_model(cls, config: ASRConfig, dtype: torch.dtype) -> PreTrainedModel:
+        """Load and freeze the language model."""
+        decoder_kwargs = {
+            "attn_implementation": _resolve_attn_implementation(config.attn_implementation),
+            "trust_remote_code": True,
+            "low_cpu_mem_usage": True,
+            "dtype": dtype,
+        }
+        decoder = AutoModelForCausalLM.from_pretrained(config.text_model_id, **decoder_kwargs)
+        # See _load_audio_encoder note: idempotent post-load cast to dodge the
+        # FA2 "current dype is fp32" warning when from_pretrained's dtype kwarg
+        # isn't fully propagated to every submodule.
+        decoder = decoder.to(dtype=dtype)
+        decoder.config.use_cache = getattr(config, "use_cache", True)
+        if getattr(config, "freeze_language_model", True):
+            decoder.requires_grad_(False)
+            decoder.train(False)
+        return decoder
+    def _create_projector(self, config: ASRConfig, dtype: torch.dtype) -> nn.Module:
+        """Create the trainable audio projector."""
+        # Auto-detect dimensions if not specified
+        if config.encoder_dim is None:
+            enc_cfg = self.audio_tower.config
+            config.encoder_dim = getattr(enc_cfg, "hidden_size", None) or getattr(
+                enc_cfg, "d_model", None
+            )
+            if config.encoder_dim is None:
+                raise ValueError("Could not auto-detect encoder_dim. Please specify in config.")
+        if config.llm_dim is None:
+            dec_cfg = self.language_model.config
+            config.llm_dim = getattr(dec_cfg, "hidden_size", None) or getattr(
+                dec_cfg, "d_model", None
+            )
+            if config.llm_dim is None:
+                raise ValueError("Could not auto-detect llm_dim. Please specify in config.")
+        # Select projector type based on config
+        projector_type = getattr(config, "projector_type", "mlp")
+        projector_class = PROJECTOR_CLASSES.get(projector_type)
+        if projector_class is None:
+            raise ValueError(
+                f"Unknown projector_type: {projector_type}. "
+                f"Valid options: {list(PROJECTOR_CLASSES.keys())}"
+            )
+        projector = projector_class(config)
+        # Move projector to same device as language model (important when using quantization)
+        device = next(self.language_model.parameters()).device
+        return projector.to(device=device, dtype=dtype)
+    def _setup_lora(self, config: ASRConfig):
+        """Apply LoRA adapters to the language model for Stage 2 fine-tuning."""
+        from peft import LoraConfig, get_peft_model
+        lora_config = LoraConfig(
+            r=config.lora_rank,
+            lora_alpha=config.lora_alpha,
+            target_modules=config.lora_target_modules,
+            lora_dropout=config.lora_dropout,
+            bias="none",
+            task_type="CAUSAL_LM",
+        )
+        self.language_model = get_peft_model(self.language_model, lora_config)
+    def _init_tokenizer(self, config: ASRConfig):
+        """Initialize tokenizer with audio token."""
+        self.tokenizer = AutoTokenizer.from_pretrained(config.text_model_id, trust_remote_code=True)
+        # Set pad token. Prefer a dedicated pad token if the tokenizer has one
+        # (e.g. Qwen's <|finetune_right_pad_id|>); otherwise fall back to
+        # eos_token, which is the standard pattern for Llama-style tokenizers
+        # (SmolLM2, Llama, etc.) that ship without a separate pad token.
+        if (
+            self.tokenizer.pad_token is None
+            or self.tokenizer.pad_token_id == self.tokenizer.eos_token_id
+        ):
+            if "<|finetune_right_pad_id|>" in self.tokenizer.get_vocab():
+                self.tokenizer.pad_token = "<|finetune_right_pad_id|>"
+            elif self.tokenizer.pad_token is None:
+                self.tokenizer.pad_token = self.tokenizer.eos_token
+        # Add audio token
+        existing_special = getattr(self.tokenizer, "additional_special_tokens", None) or []
+        if "<audio>" not in existing_special:
+            self.tokenizer.add_special_tokens(
+                {"additional_special_tokens": existing_special + ["<audio>"]}
+            )
+            # mean_resizing=True initializes the new <audio> row at the mean of
+            # existing rows so its scale matches the pretrained distribution. The
+            # input-side <audio> embedding is overwritten via masked_scatter and
+            # never seen by the LM, but with tied embeddings (Qwen3-0.6B) this
+            # same row is the lm_head column for predicting <audio>; a Gaussian
+            # draw at config.initializer_range was visible in early-step logits.
+            self.language_model.resize_token_embeddings(len(self.tokenizer), mean_resizing=True)
+        self.audio_token_id = self.tokenizer.convert_tokens_to_ids("<audio>")
+        self.tokenizer.padding_side = "right"
+        # Sync token IDs to configs
+        for cfg in [self.config.text_config, self.language_model.config, self.generation_config]:
+            if cfg is not None:
+                cfg.pad_token_id = self.tokenizer.pad_token_id
+                cfg.eos_token_id = self.tokenizer.eos_token_id
+                cfg.bos_token_id = self.tokenizer.bos_token_id
+    def train(self, mode: bool = True):
+        """Set train/eval mode, but keep frozen submodules out of train mode.
+        HF Trainer calls `model.train()` at the top of every training step, which
+        recursively switches every submodule into train mode — re-enabling dropout
+        on modules with `requires_grad_(False)`. The frozen encoder (and the LM
+        when `freeze_language_model=True`) should always run deterministically;
+        train-mode dropout only adds noise that can't improve a frozen network.
+        """
+        super().train(mode)
+        if getattr(self.config, "freeze_audio_encoder", True):
+            self.audio_tower.train(False)
+        if getattr(self.config, "freeze_language_model", True):
+            self.language_model.train(False)
+        return self
+    def _set_gradient_checkpointing(self, enable: bool = True, gradient_checkpointing_func=None):
+        """Enable/disable gradient checkpointing on the trainable submodules.
+        Routes the request to whichever components are actually trainable in
+        this run. The LM is always reached (its forward activations are
+        needed for backprop to the projector even when its weights are
+        frozen). The encoder is reached only when `freeze_audio_encoder` is
+        False — when frozen, no gradient flows through it and checkpointing
+        would just add recompute cost for no memory savings.
+        """
+        # The LLM still stores activations during forward for backprop to projector
+        # Gradient checkpointing trades compute for memory by recomputing activations
+        for submodule in self._gradient_checkpointing_targets():
+            if hasattr(submodule, "_set_gradient_checkpointing"):
+                submodule._set_gradient_checkpointing(enable, gradient_checkpointing_func)
+            elif hasattr(submodule, "gradient_checkpointing_enable") and enable:
+                submodule.gradient_checkpointing_enable(
+                    gradient_checkpointing_kwargs={"use_reentrant": False}
+                )
+            elif hasattr(submodule, "gradient_checkpointing_disable") and not enable:
+                submodule.gradient_checkpointing_disable()
+    def _gradient_checkpointing_targets(self) -> list[nn.Module]:
+        """Return the submodules that should respond to gradient_checkpointing
+        toggles. Always includes the LM (activations are on the gradient path
+        to the projector); includes the encoder only when it's trainable.
+        """
+        targets: list[nn.Module] = [self.language_model]
+        if not getattr(self.config, "freeze_audio_encoder", True):
+            targets.append(self.audio_tower)
+        return targets
+    def get_input_embeddings(self) -> nn.Module:
+        return self.language_model.get_input_embeddings()
+    def set_input_embeddings(self, value: nn.Module) -> None:
+        self.language_model.set_input_embeddings(value)
+    def get_output_embeddings(self) -> nn.Module:
+        return self.language_model.get_output_embeddings()
+    def set_output_embeddings(self, value: nn.Module) -> None:
+        self.language_model.set_output_embeddings(value)
+    def get_processor(self):
+        """Get the processor for this model."""
+        try:
+            from .asr_processing import ASRProcessor
+        except ImportError:
+            from asr_processing import ASRProcessor  # type: ignore[no-redef]
+        return ASRProcessor(
+            feature_extractor=self.feature_extractor,
+            tokenizer=self.tokenizer,
+            projector=self.projector,
+            encoder_conv_layers=self.config.encoder_conv_layers,
+        )
+    def state_dict(self, *args, **kwargs) -> dict[str, torch.Tensor]:
+        """Save trainable weights: projector, plus the language model when fine-tuned.
+        With LoRA attached, the language_model entries are flattened to plain
+        (non-PEFT) HF naming so model.safetensors round-trips through
+        ASRModel.from_pretrained — which builds a vanilla base LM, overlays
+        these weights, and only then re-attaches PEFT. lora_*/adapter weights
+        are skipped here; PEFT serializes them separately as
+        adapter_model.safetensors via the save_pretrained path below.
+        """
+        sd = {f"projector.{k}": v for k, v in self.projector.state_dict().items()}
+        if not getattr(self.config, "freeze_language_model", True):
+            lm = self.language_model
+            if hasattr(lm, "peft_config"):
+                for k, v in lm.state_dict().items():
+                    if "lora_" in k:
+                        continue
+                    if k.startswith("base_model.model."):
+                        k = k[len("base_model.model.") :]
+                    # LoRA layers wrap the original Linear as `<name>.base_layer.<weight|bias>`.
+                    k = k.replace(".base_layer.", ".")
+                    sd[f"language_model.{k}"] = v
+            else:
+                sd.update({f"language_model.{k}": v for k, v in lm.state_dict().items()})
+        return sd
+    def _compute_encoder_output_lengths(
+        self,
+        audio_attention_mask: torch.Tensor,
+    ) -> torch.Tensor:
+        """Compute per-sample encoder output lengths using conv layer formulas."""
+        return compute_encoder_output_length(
+            audio_attention_mask.sum(dim=-1),
+            self.config.encoder_conv_layers,
+        )
+    def _encode_audio(
+        self,
+        audio_features: torch.Tensor,
+        expected_token_counts: torch.Tensor,
+    ) -> torch.Tensor:
+        """Encode audio features and return flattened embeddings matching expected_token_counts.
+        Args:
+            audio_features: Mel spectrogram features (batch, n_mels, mel_len)
+            expected_token_counts: Per-sample audio token counts as int64 tensor (batch,).
+        Returns:
+            Flattened audio embeddings of shape (sum(expected_token_counts), hidden_dim).
+        """
+        # SpecAugment is applied on the mel input, training-only. Most useful
+        # when the encoder is trainable; on the frozen-encoder path it still
+        # perturbs the projector's input slightly but with no gradient flowing
+        # back to the encoder to leverage the diversity.
+        if (
+            self.training
+            and getattr(self.config, "apply_spec_augment", False)
+            and audio_features.numel() > 0
+        ):
+            audio_features = self._mask_input_features(audio_features)
+        # When the encoder is frozen, skip gradient tracking through it — cuts
+        # activation memory and matches the prior published recipe's behavior.
+        # When trainable, we MUST allow gradients to flow back to encoder
+        # params; wrapping in no_grad here would silently zero encoder
+        # gradients regardless of requires_grad on its parameters.
+        encoder_frozen = getattr(self.config, "freeze_audio_encoder", True)
+        if encoder_frozen:
+            with torch.no_grad():
+                encoder_out = self.audio_tower(input_features=audio_features)
+                hidden_states = encoder_out.last_hidden_state
+        else:
+            encoder_out = self.audio_tower(input_features=audio_features)
+            hidden_states = encoder_out.last_hidden_state
+        audio_embeds = self.projector(hidden_states)
+        token_counts = expected_token_counts.to(device=audio_embeds.device, dtype=torch.long)
+        return _gather_audio_embeds(audio_embeds, token_counts)
+    def _mask_input_features(
+        self,
+        input_features: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,  # noqa: ARG002 — reserved for future use
+    ) -> torch.Tensor:
+        """SpecAugment on mel input (pure-torch, vectorized, compile-ready).
+        Follows the same semantics as
+        `transformers.models.whisper.modeling_whisper.WhisperModel._mask_input_features`
+        (wav2vec2-style mask sampling: sample N start positions per sample,
+        mask `mask_length` frames forward from each), but reimplemented in
+        pure torch so it stays inside the autograd graph without crossing
+        the numpy boundary. This avoids inductor codegen failures
+        (e.g. the `‘zuf0’ was not declared` error from the prior numpy ->
+        torch.tensor round-trip) AND avoids the per-forward host-to-GPU
+        sync that the numpy path required.
+        One minor semantic divergence vs the upstream helper: this version
+        allows mask spans to overlap, while upstream rejects overlapping
+        samples. For ASR purposes this is irrelevant — occasional region
+        double-coverage has no measurable effect on the regularization
+        signal.
+        Reads ASRConfig fields by Whisper naming convention: mask_time_prob,
+        mask_time_length, mask_time_min_masks, mask_feature_prob,
+        mask_feature_length, mask_feature_min_masks.
+        Args:
+            input_features: (batch, n_mels, mel_len) log-mel features.
+            attention_mask: reserved for future use; ignored here since our
+                mel features are pre-padded to zero and double-masking
+                pad regions is a no-op.
+        Returns:
+            Same-shape tensor with time-axis and/or feature-axis masks zeroed.
+        """
+        input_features = input_features.clone()
+        batch_size, hidden_size, sequence_length = input_features.size()
+        config = self.config
+        device = input_features.device
+        if getattr(config, "mask_time_prob", 0.0) > 0:
+            mask_time = self._sample_mask_indices(
+                batch_size,
+                sequence_length,
+                mask_prob=config.mask_time_prob,
+                mask_length=config.mask_time_length,
+                min_masks=config.mask_time_min_masks,
+                device=device,
+            )
+            # Broadcast (B, T) -> (B, 1, T) to mask all mel bins at masked times.
+            input_features.masked_fill_(mask_time.unsqueeze(1), 0)
+        if getattr(config, "mask_feature_prob", 0.0) > 0:
+            mask_feature = self._sample_mask_indices(
+                batch_size,
+                hidden_size,
+                mask_prob=config.mask_feature_prob,
+                mask_length=config.mask_feature_length,
+                min_masks=config.mask_feature_min_masks,
+                device=device,
+            )
+            # Broadcast (B, F) -> (B, F, 1) to mask all time steps at masked bins.
+            input_features.masked_fill_(mask_feature.unsqueeze(-1), 0)
+        return input_features
+    @staticmethod
+    def _sample_mask_indices(
+        batch_size: int,
+        axis_length: int,
+        mask_prob: float,
+        mask_length: int,
+        min_masks: int,
+        device: torch.device,
+    ) -> torch.Tensor:
+        """Vectorized SpecAugment mask sampler — torch.compile-friendly.
+        Returns a (batch_size, axis_length) bool tensor where True marks
+        a position covered by at least one mask span. Spans may overlap
+        (see _mask_input_features docstring on the semantic difference vs
+        the upstream Whisper helper).
+        """
+        # Number of mask spans per sample: deterministic given config + axis_length.
+        # Matches the upstream formula (ignoring the epsilon noise term, which
+        # only shifts the count by ±1 stochastically — negligible at the
+        # default mask_time_prob=0.05 / mask_length=10 setting which gives
+        # ~5 spans for a typical 1500-frame mel input).
+        num_masked_spans = max(int(mask_prob * axis_length / mask_length + 0.5), min_masks)
+        if num_masked_spans == 0:
+            return torch.zeros(batch_size, axis_length, device=device, dtype=torch.bool)
+        # Sample start positions independently per sample × span.
+        # Clamp range so a span of length mask_length never runs off the end.
+        max_start = max(axis_length - mask_length + 1, 1)
+        starts = torch.randint(
+            0, max_start, (batch_size, num_masked_spans), device=device
+        )  # (B, N)
+        # For each (sample, span, position), True iff position ∈ [start, start+mask_length).
+        positions = torch.arange(axis_length, device=device).view(1, 1, -1)  # (1, 1, T)
+        starts_b = starts.unsqueeze(-1)  # (B, N, 1)
+        span_mask = (positions >= starts_b) & (positions < starts_b + mask_length)
+        # Reduce over the span dim: True if ANY span covers this position.
+        return span_mask.any(dim=1)
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        input_features: Optional[torch.Tensor] = None,
+        audio_attention_mask: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        past_key_values: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.Tensor] = None,
+        audio_token_counts: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        """Forward pass for training and inference."""
+        if inputs_embeds is None:
+            inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        if input_features is not None and input_ids is not None:
+            is_audio_token = input_ids == self.audio_token_id
+            if audio_token_counts is None:
+                audio_token_counts = is_audio_token.sum(dim=-1)
+            else:
+                audio_token_counts = audio_token_counts.to(
+                    device=input_ids.device, dtype=torch.long
+                )
+            audio_embeds = self._encode_audio(input_features, audio_token_counts)
+            audio_token_mask = is_audio_token.unsqueeze(-1)
+            inputs_embeds = inputs_embeds.masked_scatter(
+                audio_token_mask.to(inputs_embeds.device),
+                audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
+            )
+        # Forward label_smoothing to the LM's loss_function via **kwargs.
+        # transformers.loss.loss_utils.ForCausalLMLoss → fixed_cross_entropy
+        # forwards extra kwargs to F.cross_entropy, which accepts label_smoothing.
+        # When apply_liger_kernel_to_qwen3() has patched the LM, the smoothing
+        # is consumed by liger's fused linear CE (no (B,T,V) materialization).
+        # Zeroed on eval so eval/loss is raw CE and comparable to LS=0 runs.
+        if labels is not None and self.training and self.config.label_smoothing > 0:
+            kwargs.setdefault("label_smoothing", self.config.label_smoothing)
+        outputs = self.language_model(
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        if outputs.loss is not None and hasattr(self.projector, "get_aux_loss"):
+            aux_loss = self.projector.get_aux_loss()
+            if aux_loss is not None and aux_loss.numel() > 0:
+                outputs.loss = outputs.loss + aux_loss.to(outputs.loss.device)
+        return outputs
+    def prepare_inputs_for_generation(self, *args, **kwargs):
+        """Prepare inputs for generation, handling audio features for cached decoding."""
+        input_features = kwargs.pop("input_features", None)
+        cache_position = kwargs.get("cache_position")
+        model_inputs = self.language_model.prepare_inputs_for_generation(*args, **kwargs)
+        # Only pass audio features on the first generation step (cache_position[0] == 0)
+        if cache_position is not None and cache_position[0] == 0 and input_features is not None:
+            model_inputs["input_features"] = input_features
+        return model_inputs
+    def _get_num_audio_tokens(
+        self,
+        audio_attention_mask: torch.Tensor,
+    ) -> int:
+        """Calculate number of audio tokens based on actual audio length.
+        Uses attention mask to get real audio length, then computes:
+        mel_frames -> encoder_frames (via conv formulas) -> projector output tokens
+        """
+        encoder_lengths = self._compute_encoder_output_lengths(audio_attention_mask)
+        # Use max length for batch (all samples should have same token count for generation)
+        encoder_output_len = int(encoder_lengths.max().item())
+        return int(self.projector.get_output_length(encoder_output_len))
+    @torch.no_grad()
+    def generate(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        input_features: Optional[torch.Tensor] = None,
+        audio_attention_mask: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        system_prompt: Optional[str] = None,
+        **generate_kwargs,
+    ):
+        """Generate transcription from audio input.
+        Can be called in two ways:
+        1. With input_ids containing <audio> tokens (from processor)
+        2. With just audio, and we build the prompt internally
+        """
+        if input_features is None:
+            raise ValueError("input_features required for generation")
+        if audio_attention_mask is None:
+            raise ValueError("audio_attention_mask required for generation")
+        device = input_features.device
+        batch_size = input_features.shape[0]
+        # Encode audio -> flattened embeddings (no per-sample host sync)
+        encoder_lengths = self._compute_encoder_output_lengths(audio_attention_mask)
+        token_counts = self.projector.get_output_length(encoder_lengths).to(torch.long)
+        audio_embeds = self._encode_audio(input_features, token_counts)
+        # If input_ids not provided, build prompt with correct number of audio tokens
+        if input_ids is None:
+            num_audio_tokens = self._get_num_audio_tokens(audio_attention_mask)
+            audio_placeholder = "<audio>" * num_audio_tokens
+            system_prompt = system_prompt or self.system_prompt
+            messages: list[dict[str, str]] = []
+            if system_prompt:
+                messages.append({"role": "system", "content": system_prompt})
+            # Audio tokens only (instruction-free)
+            user_content = audio_placeholder
+            if self.TRANSCRIBE_PROMPT:
+                user_content += " " + self.TRANSCRIBE_PROMPT
+            messages.append({"role": "user", "content": user_content})
+            chat_result = self.tokenizer.apply_chat_template(
+                messages,
+                tokenize=True,
+                add_generation_prompt=True,
+                return_tensors="pt",
+                enable_thinking=False,  # Disable Qwen3 thinking mode for ASR
+            )
+            input_ids = chat_result.input_ids.to(device)
+            if input_ids.dim() == 1:
+                input_ids = input_ids.unsqueeze(0)
+            if input_ids.shape[0] == 1 and batch_size > 1:
+                input_ids = input_ids.expand(batch_size, -1)
+            attention_mask = torch.ones_like(input_ids)
+        # Get text embeddings and replace audio tokens with audio embeddings
+        inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
+        inputs_embeds = inputs_embeds.masked_scatter(
+            audio_token_mask.to(inputs_embeds.device),
+            audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
+        )
+        # transformers v5 deprecates passing generation flags as kwargs when a
+        # `generation_config` is also passed — the kwargs get silently dropped.
+        # Pull any score-related flags out of generate_kwargs and apply them to
+        # a derived generation_config so they actually take effect.
+        gen_cfg = self.generation_config
+        score_flags = {}
+        for flag in ("output_scores", "output_logits", "return_dict_in_generate"):
+            if flag in generate_kwargs:
+                score_flags[flag] = generate_kwargs.pop(flag)
+        if score_flags:
+            from copy import copy as _copy
+            gen_cfg = _copy(self.generation_config)
+            for flag, value in score_flags.items():
+                setattr(gen_cfg, flag, value)
+            # output_scores requires return_dict_in_generate for HF generate to
+            # actually populate .scores on the output object.
+            if gen_cfg.output_scores and not gen_cfg.return_dict_in_generate:
+                gen_cfg.return_dict_in_generate = True
+        # Generate using language model
+        # Pass both input_ids and inputs_embeds so repetition_penalty works correctly
+        # (it needs input_ids to track which tokens have been used)
+        output = self.language_model.generate(
+            input_ids=input_ids,
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            generation_config=gen_cfg,
+            **generate_kwargs,
+        )
+        # When using inputs_embeds with input_ids, generate returns the full
+        # sequence (prompt + generated). Strip the prompt to return only the
+        # newly generated tokens. When scores were requested, preserve the
+        # GenerateOutput so callers can read .scores; otherwise return the
+        # bare tensor for backward compatibility with existing callers.
+        input_len = input_ids.shape[1]
+        if isinstance(output, torch.Tensor):
+            return output[:, input_len:]
+        output.sequences = output.sequences[:, input_len:]
+        return output
+    def generate_streaming(
+        self,
+        input_features: torch.Tensor,
+        audio_attention_mask: torch.Tensor,
+        system_prompt: Optional[str] = None,
+        **generate_kwargs,
+    ) -> Iterator[str]:
+        """Generate transcription with streaming token output.
+        Yields partial transcript strings as tokens are generated.
+        Reduces time-to-first-word by streaming tokens as they're decoded.
+        Args:
+            input_features: Mel spectrogram features (batch, n_mels, mel_len)
+            audio_attention_mask: Mask for real vs padded mel frames (batch, mel_len)
+            system_prompt: Optional system prompt override
+            **generate_kwargs: Additional generation arguments
+        Yields:
+            Partial transcript text as each token is generated
+        """
+        device = input_features.device
+        batch_size = input_features.shape[0]
+        # Encode audio -> flattened embeddings (no per-sample host sync)
+        encoder_lengths = self._compute_encoder_output_lengths(audio_attention_mask)
+        token_counts = self.projector.get_output_length(encoder_lengths).to(torch.long)
+        audio_embeds = self._encode_audio(input_features, token_counts)
+        # Build prompt with correct number of audio tokens
+        num_audio_tokens = self._get_num_audio_tokens(audio_attention_mask)
+        audio_placeholder = "<audio>" * num_audio_tokens
+        system_prompt = system_prompt or self.system_prompt
+        messages: list[dict[str, str]] = []
+        if system_prompt:
+            messages.append({"role": "system", "content": system_prompt})
+        # Audio tokens only (instruction-free)
+        user_content = audio_placeholder
+        if self.TRANSCRIBE_PROMPT:
+            user_content += " " + self.TRANSCRIBE_PROMPT
+        messages.append({"role": "user", "content": user_content})
+        chat_result = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=True,
+            return_tensors="pt",
+            enable_thinking=False,  # Disable Qwen3 thinking mode for ASR
+        )
+        input_ids = chat_result.input_ids.to(device)
+        if input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(0)
+        if input_ids.shape[0] == 1 and batch_size > 1:
+            input_ids = input_ids.expand(batch_size, -1)
+        attention_mask = torch.ones_like(input_ids)
+        # Get text embeddings and replace audio tokens with audio embeddings
+        inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
+        inputs_embeds = inputs_embeds.masked_scatter(
+            audio_token_mask.to(inputs_embeds.device),
+            audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
+        )
+        # Setup streamer for token-by-token output
+        streamer = TextIteratorStreamer(
+            self.tokenizer,
+            skip_prompt=True,
+            skip_special_tokens=True,
+        )
+        # Prepare generation kwargs
+        gen_kwargs = {
+            "inputs_embeds": inputs_embeds,
+            "attention_mask": attention_mask,
+            "generation_config": self.generation_config,
+            "streamer": streamer,
+            **generate_kwargs,
+        }
+        # Run generation in background thread
+        thread = Thread(target=self.language_model.generate, kwargs=gen_kwargs)
+        thread.start()
+        # Yield tokens as they're generated, filtering out <think>...</think> blocks
+        # Start assuming no think block - only filter when we see <think>
+        in_think_block = False
+        buffer = ""
+        for text in streamer:
+            buffer += text
+            # Check for think block start (in case model outputs think blocks)
+            while "<think>" in buffer:
+                in_think_block = True
+                # Yield any text before <think>
+                before_think = buffer.split("<think>")[0]
+                if before_think:
+                    yield before_think
+                buffer = buffer.split("<think>", 1)[-1]
+            # Check for think block end
+            while in_think_block and "</think>" in buffer:
+                in_think_block = False
+                buffer = buffer.split("</think>", 1)[-1]
+            # Yield text if not in think block
+            if not in_think_block and buffer:
+                yield buffer
+                buffer = ""
+        # Yield any remaining buffer
+        if buffer and not in_think_block:
+            yield buffer
+        thread.join()
+    def save_pretrained(self, save_directory: Union[str, Path], **kwargs) -> None:
+        """Save model, tokenizer, and processor."""
+        import shutil
+        save_dir = Path(save_directory)
+        save_dir.mkdir(parents=True, exist_ok=True)
+        # Update config with actual vocab size
+        self.config.vocab_size = self.language_model.config.vocab_size
+        self.config.text_config.vocab_size = self.language_model.config.vocab_size
+        if hasattr(self.audio_tower.config, "num_mel_bins"):
+            self.config.audio_config.num_mel_bins = self.audio_tower.config.num_mel_bins
+        # Save model (temporarily remove non-serializable attributes)
+        tokenizer = self.tokenizer
+        del self.tokenizer
+        try:
+            super().save_pretrained(save_dir, **kwargs)
+        finally:
+            self.tokenizer = tokenizer
+        # Save tokenizer and feature extractor
+        self.tokenizer.save_pretrained(save_dir)
+        self.feature_extractor.save_pretrained(save_dir)
+        # Save LoRA adapters if present (creates adapter_model.safetensors and adapter_config.json)
+        # Don't save embedding layers - the <audio> token embedding is never used
+        # (it's replaced with projected audio embeddings before the LLM sees it)
+        if hasattr(self.language_model, "peft_config"):
+            self.language_model.save_pretrained(save_dir, save_embedding_layers=False)
+            # Clear base_model_name_or_path in adapter_config.json to prevent HF pipeline
+            # from redirecting to the base LLM repo (like Qwen) which breaks feature
+            # extractor loading for multimodal models. If a repo_id is provided, use that
+            # so the model can be loaded directly from the Hub.
+            adapter_config_path = save_dir / "adapter_config.json"
+            if adapter_config_path.exists():
+                with adapter_config_path.open() as f:
+                    adapter_config = json.load(f)
+                # Use repo_id if available, otherwise clear to prevent redirect.
+                # Use empty string instead of None to avoid str(None) -> "None" bug
+                # in some transformers/PEFT versions.
+                repo_id = (
+                    kwargs.get("repo_id")
+                    or kwargs.get("push_to_hub_model_id")
+                    or getattr(self.config, "pretrained_model_path", None)
+                    or ""  # Use empty string instead of None
+                )
+                adapter_config["base_model_name_or_path"] = repo_id
+                with adapter_config_path.open("w") as f:
+                    json.dump(adapter_config, f, indent=2)
+        # Add processor auto_map to preprocessor_config.json
+        config_path = save_dir / "preprocessor_config.json"
+        if config_path.exists():
+            with config_path.open() as f:
+                processor_config = json.load(f)
+        else:
+            processor_config = {}
+        processor_config.update(
+            {
+                "processor_class": "ASRProcessor",
+                "auto_map": {"AutoProcessor": "asr_processing.ASRProcessor"},
+            }
+        )
+        with config_path.open("w") as f:
+            json.dump(processor_config, f, indent=2)
+        # Copy source files for auto-loading
+        src_dir = Path(__file__).parent
+        for asr_file in src_dir.glob("asr_*.py"):
+            shutil.copy(asr_file, save_dir / asr_file.name)
+        # Copy projectors module
+        shutil.copy(src_dir / "projectors.py", save_dir / "projectors.py")
+        # Copy alignment module
+        shutil.copy(src_dir / "alignment.py", save_dir / "alignment.py")
+        # Copy diarization module
+        shutil.copy(src_dir / "diarization.py", save_dir / "diarization.py")
+    def push_to_hub(self, repo_id: str, **kwargs) -> str:
+        """Push model to HuggingFace Hub, ensuring adapter_config points to repo.
+        IMPORTANT: Sets base_model_name_or_path in adapter_config.json to repo_id
+        so that transformers pipeline() can load the model correctly. Without this,
+        the pipeline tries to load from "None" which fails.
+        """
+        # Store repo_id in config so save_pretrained can access it
+        self.config.pretrained_model_path = repo_id
+        # Call parent's push_to_hub
+        return super().push_to_hub(repo_id, **kwargs)
+# Register with transformers Auto classes
+# (AutoConfig.register is handled in asr_config.py at module load.)
+AutoModel.register(ASRConfig, ASRModel)

asr_pipeline.py ADDED Viewed

	@@ -0,0 +1,368 @@

+"""ASR pipeline for audio-to-text transcription with optional timestamps and diarization."""
+import re
+from pathlib import Path
+from typing import Any
+import numpy as np
+import torch
+import transformers
+from transformers.pipelines.audio_utils import ffmpeg_read
+try:
+    from .alignment import ForcedAligner
+    from .asr_modeling import ASRModel
+    from .diarization import SpeakerDiarizer
+except ImportError:
+    from alignment import ForcedAligner  # type: ignore[no-redef]
+    from asr_modeling import ASRModel  # type: ignore[no-redef]
+    from diarization import SpeakerDiarizer  # type: ignore[no-redef]
+# Re-export for backwards compatibility
+__all__ = ["ForcedAligner", "SpeakerDiarizer", "ASRPipeline"]
+_THINK_TAG_RE = re.compile(r"<think>.*?</think>\s*", flags=re.DOTALL)
+_DEFAULT_MIN_REPEATS = 3
+_TRAILING_CHAR_RE = re.compile(rf"(.)\1{{{_DEFAULT_MIN_REPEATS - 1},}}$")
+_TRAILING_WORD_RE = re.compile(
+    rf"\b(\w+)(?:\s+\1){{{_DEFAULT_MIN_REPEATS - 1},}}\s*$", re.IGNORECASE
+)
+class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
+    """ASR Pipeline for audio-to-text transcription."""
+    model: ASRModel
+    def __init__(self, model: ASRModel, **kwargs):
+        """Initialize ASR pipeline.
+        Args:
+            model: ASRModel instance for transcription
+            **kwargs: Additional arguments (feature_extractor, tokenizer, device)
+        """
+        feature_extractor = kwargs.pop("feature_extractor", None)
+        tokenizer = kwargs.pop("tokenizer", model.tokenizer)
+        if feature_extractor is None:
+            feature_extractor = model.get_processor().feature_extractor
+        super().__init__(
+            model=model, feature_extractor=feature_extractor, tokenizer=tokenizer, **kwargs
+        )
+        self._current_audio = None
+    def _sanitize_parameters(self, **kwargs):
+        """Intercept our custom parameters before parent class validates them."""
+        # Remove our custom parameters so parent doesn't see them
+        kwargs.pop("return_timestamps", None)
+        kwargs.pop("return_speakers", None)
+        kwargs.pop("num_speakers", None)
+        kwargs.pop("min_speakers", None)
+        kwargs.pop("max_speakers", None)
+        kwargs.pop("hf_token", None)
+        kwargs.pop("user_prompt", None)
+        kwargs.pop("diarization_backend", None)
+        return super()._sanitize_parameters(**kwargs)
+    def __call__(
+        self,
+        inputs,
+        **kwargs,
+    ):
+        """Transcribe audio with optional word-level timestamps and speaker diarization.
+        Args:
+            inputs: Audio input (file path, dict with array/sampling_rate, etc.)
+            return_timestamps: If True, return word-level timestamps using forced alignment
+            return_speakers: If True, return speaker labels for each word
+            user_prompt: Custom transcription prompt (default: "Transcribe: ")
+            num_speakers: Exact number of speakers (if known, for diarization)
+            min_speakers: Minimum number of speakers (for diarization)
+            max_speakers: Maximum number of speakers (for diarization)
+            **kwargs: Additional arguments passed to the pipeline
+        Returns:
+            Dict with 'text' key, 'words' key if return_timestamps=True,
+            and speaker labels on words if return_speakers=True
+        """
+        # Extract our params before super().__call__ (which will also call _sanitize_parameters)
+        return_timestamps = kwargs.pop("return_timestamps", False)
+        return_speakers = kwargs.pop("return_speakers", False)
+        user_prompt = kwargs.pop("user_prompt", None)
+        diarization_params = {
+            "num_speakers": kwargs.pop("num_speakers", None),
+            "min_speakers": kwargs.pop("min_speakers", None),
+            "max_speakers": kwargs.pop("max_speakers", None),
+        }
+        if return_speakers:
+            return_timestamps = True
+        # Set custom user prompt if provided
+        original_prompt = None
+        if user_prompt:
+            original_prompt = self.model.TRANSCRIBE_PROMPT
+            self.model.TRANSCRIBE_PROMPT = user_prompt
+        # Store audio for timestamp alignment and diarization
+        if return_timestamps or return_speakers:
+            self._current_audio = self._extract_audio(inputs)
+        # Run standard transcription
+        result = super().__call__(inputs, **kwargs)
+        # Add timestamps if requested
+        if return_timestamps and self._current_audio is not None:
+            text = result.get("text", "")
+            if text:
+                try:
+                    words = ForcedAligner.align(
+                        self._current_audio["array"],
+                        text,
+                        sample_rate=self._current_audio.get("sampling_rate", 16000),
+                    )
+                    result["words"] = words
+                except Exception as e:
+                    result["words"] = []
+                    result["timestamp_error"] = str(e)
+            else:
+                result["words"] = []
+        # Add speaker diarization if requested
+        if return_speakers and self._current_audio is not None:
+            try:
+                # Run diarization
+                speaker_segments = SpeakerDiarizer.diarize(
+                    self._current_audio["array"],
+                    sample_rate=self._current_audio.get("sampling_rate", 16000),
+                    **{k: v for k, v in diarization_params.items() if v is not None},
+                )
+                result["speaker_segments"] = speaker_segments
+                # Assign speakers to words
+                if result.get("words"):
+                    result["words"] = SpeakerDiarizer.assign_speakers_to_words(
+                        result["words"],
+                        speaker_segments,
+                    )
+            except Exception as e:
+                result["speaker_segments"] = []
+                result["diarization_error"] = str(e)
+        # Clean up
+        self._current_audio = None
+        if original_prompt is not None:
+            self.model.TRANSCRIBE_PROMPT = original_prompt
+        return result
+    def _extract_audio(self, inputs) -> dict | None:
+        """Extract audio array from various input formats using HF utilities."""
+        if isinstance(inputs, dict):
+            if "array" in inputs:
+                return {
+                    "array": inputs["array"],
+                    "sampling_rate": inputs.get("sampling_rate", 16000),
+                }
+            if "raw" in inputs:
+                return {
+                    "array": inputs["raw"],
+                    "sampling_rate": inputs.get("sampling_rate", 16000),
+                }
+        elif isinstance(inputs, str):
+            # File path - load audio using ffmpeg (same as HF pipeline)
+            with Path(inputs).open("rb") as f:
+                audio = ffmpeg_read(f.read(), sampling_rate=16000)
+            return {"array": audio, "sampling_rate": 16000}
+        elif isinstance(inputs, bytes):
+            audio = ffmpeg_read(inputs, sampling_rate=16000)
+            return {"array": audio, "sampling_rate": 16000}
+        elif isinstance(inputs, np.ndarray):
+            return {"array": inputs, "sampling_rate": 16000}
+        return None
+    def preprocess(self, inputs, **preprocess_params):
+        """Preprocess audio inputs for the model.
+        Args:
+            inputs: Audio input (dict with array, file path, etc.)
+            **preprocess_params: Additional preprocessing parameters
+        Yields:
+            Model input dicts with input_features and attention_mask
+        """
+        # Handle dict with "array" key (from datasets)
+        if isinstance(inputs, dict) and "array" in inputs:
+            inputs = {
+                "raw": inputs["array"],
+                "sampling_rate": inputs.get("sampling_rate", self.feature_extractor.sampling_rate),
+            }
+        for item in super().preprocess(inputs, **preprocess_params):
+            if "is_last" not in item:
+                item["is_last"] = True
+            yield item
+    def _forward(self, model_inputs, **generate_kwargs) -> dict[str, Any]:
+        """Run model forward pass to generate transcription.
+        Args:
+            model_inputs: Dict with input_features and attention_mask
+            **generate_kwargs: Generation parameters. Pass ``output_scores=True``
+                (and ``return_dict_in_generate=True``, which is then implied) to
+                also return per-step top-1 and top-2 log-probabilities — used by
+                the eval harness's confidence metric. Backward-compatible: when
+                unset, returns just token IDs as before.
+        Returns:
+            Dict with generated token IDs, and optionally per-step
+            ``top1_logprob`` / ``top2_logprob`` tensors when scores were
+            requested.
+        """
+        # Extract audio features and is_last flag
+        is_last = model_inputs.pop("is_last", True) if isinstance(model_inputs, dict) else True
+        input_features = model_inputs["input_features"].to(self.model.device)
+        audio_attention_mask = model_inputs["attention_mask"].to(self.model.device)
+        # Opt-in: when output_scores is requested, force return_dict_in_generate
+        # so we get a GenerateOutput rather than a bare token tensor.
+        want_scores = bool(generate_kwargs.get("output_scores", False))
+        if want_scores:
+            generate_kwargs.setdefault("return_dict_in_generate", True)
+        generate_output = self.model.generate(
+            input_features=input_features,
+            audio_attention_mask=audio_attention_mask,
+            **generate_kwargs,
+        )
+        # Default (no scores requested): generate returns a tensor of token IDs.
+        if torch.is_tensor(generate_output):
+            return {"tokens": generate_output, "is_last": is_last}
+        # Scores requested: GenerateOutput dict-like with .sequences and .scores.
+        # `scores` is a tuple of per-step logits tensors (batch, vocab); convert
+        # each to log-probs and take top-2 to produce two short tensors over the
+        # generation horizon — kept small (no full vocab) so this is cheap to
+        # carry through postprocess.
+        sequences = generate_output.sequences
+        scores = generate_output.scores
+        top1_logprobs: list[float] = []
+        top2_logprobs: list[float] = []
+        if scores:
+            for step_logits in scores:
+                step_logprobs = torch.log_softmax(step_logits[0].float(), dim=-1)
+                top2 = torch.topk(step_logprobs, k=2)
+                top1_logprobs.append(top2.values[0].item())
+                top2_logprobs.append(top2.values[1].item())
+        return {
+            "tokens": sequences,
+            "top1_logprob": top1_logprobs,
+            "top2_logprob": top2_logprobs,
+            "is_last": is_last,
+        }
+    def postprocess(self, model_outputs, **kwargs) -> dict[str, str]:
+        """Convert model output tokens to text.
+        Args:
+            model_outputs: Dict with 'tokens' key containing generated IDs
+            **kwargs: Additional postprocessing parameters
+        Returns:
+            Dict with 'text' key containing transcription
+        """
+        # Handle list of outputs (from chunking)
+        if isinstance(model_outputs, list):
+            model_outputs = model_outputs[0] if model_outputs else {}
+        tokens = model_outputs.get("tokens")
+        if tokens is None:
+            return super().postprocess(model_outputs, **kwargs)
+        if torch.is_tensor(tokens):
+            tokens = tokens.cpu()
+            if tokens.dim() > 1:
+                tokens = tokens[0]
+        # Filter out eos tokens that the tokenizer doesn't recognize as special
+        # (generation_config.eos_token_id may differ from tokenizer.eos_token_id)
+        if hasattr(self, "model") and hasattr(self.model, "generation_config"):
+            eos_ids = self.model.generation_config.eos_token_id
+            if eos_ids is not None:
+                eos_set = set(eos_ids) if isinstance(eos_ids, list) else {eos_ids}
+                tokens = [t for t in tokens.tolist() if t not in eos_set]
+        text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
+        # Strip <think>...</think> tags (Qwen3 doesn't respect /no_think prompt)
+        if "<think>" in text:
+            text = _THINK_TAG_RE.sub("", text).strip()
+        text = _truncate_repetitions(text)
+        out: dict[str, Any] = {"text": text}
+        # Pass through per-step logprobs when _forward captured them (i.e. caller
+        # passed output_scores=True). Lets eval harnesses compute confidence
+        # stats without re-running the model.
+        if "top1_logprob" in model_outputs:
+            out["top1_logprob"] = model_outputs["top1_logprob"]
+        if "top2_logprob" in model_outputs:
+            out["top2_logprob"] = model_outputs["top2_logprob"]
+        return out
+def _truncate_repetitions(text: str, min_repeats: int = 3) -> str:
+    """Truncate repeated words/phrases/characters at end of text.
+    Detects patterns like:
+    - Repeated words: "the the the the" -> "the"
+    - Repeated phrases: "i am sorry i am sorry i am sorry" -> "i am sorry"
+    - Repeated characters: "444444" -> "4"
+    Args:
+        text: Input text to process
+        min_repeats: Minimum repetitions to trigger truncation (default 3)
+    Returns:
+        Text with trailing repetitions removed
+    """
+    if not text:
+        return text
+    if min_repeats == _DEFAULT_MIN_REPEATS:
+        char_pattern = _TRAILING_CHAR_RE
+        word_pattern = _TRAILING_WORD_RE
+    else:
+        char_pattern = re.compile(rf"(.)\1{{{min_repeats - 1},}}$")
+        word_pattern = re.compile(rf"\b(\w+)(?:\s+\1){{{min_repeats - 1},}}\s*$", re.IGNORECASE)
+    text = char_pattern.sub(r"\1", text)
+    while word_pattern.search(text):
+        text = word_pattern.sub(r"\1", text)
+    # 3. Truncate repeated phrases (2-20 words) at end
+    # e.g., "i am sorry i am sorry i am sorry" -> "i am sorry"
+    words = text.split()
+    if len(words) < min_repeats * 2:
+        return text
+    # Cheap pre-check: trailing window must contain duplicates for any phrase repeat
+    # to be possible. set(window) == window means all unique → no repetition.
+    window = words[-min_repeats * 2 :]
+    if len(set(window)) == len(window):
+        return text
+    for phrase_len in range(2, min(21, len(words) // min_repeats + 1)):
+        phrase_escaped = re.escape(" ".join(words[-phrase_len:]))
+        phrase_pattern = re.compile(
+            rf"(^|.*?\s)({phrase_escaped})(?:\s+{phrase_escaped}){{{min_repeats - 1},}}\s*$",
+            re.IGNORECASE,
+        )
+        match = phrase_pattern.match(text)
+        if match:
+            text = (match.group(1) + match.group(2)).strip()
+            break
+    return text

asr_processing.py ADDED Viewed

	@@ -0,0 +1,132 @@

+from typing import Optional, Union
+import torch
+import transformers
+from transformers import ProcessorMixin
+try:
+    from .asr_config import DEFAULT_ENCODER_CONV_LAYERS, ASRConfig, compute_encoder_output_length
+except ImportError:
+    from asr_config import (  # type: ignore[no-redef]
+        DEFAULT_ENCODER_CONV_LAYERS,
+        ASRConfig,
+        compute_encoder_output_length,
+    )
+class ASRProcessor(ProcessorMixin):
+    """Processor for Whisper-based ASR models."""
+    attributes = ["feature_extractor", "tokenizer"]
+    feature_extractor_class = "AutoFeatureExtractor"
+    tokenizer_class = "AutoTokenizer"
+    AUDIO_TOKEN = "<audio>"
+    TRANSCRIBE_PROMPT = "Transcribe the speech to text"
+    def __init__(
+        self,
+        feature_extractor,
+        tokenizer,
+        projector=None,
+        encoder_conv_layers: Optional[list] = None,
+    ):
+        """Initialize the ASR processor.
+        Args:
+            feature_extractor: Audio feature extractor (WhisperFeatureExtractor)
+            tokenizer: Text tokenizer for the language model
+            projector: Audio projector module (for computing output lengths)
+            encoder_conv_layers: Conv layer specs [(pad, kernel, stride), ...]
+        """
+        self.feature_extractor = feature_extractor
+        self.tokenizer = tokenizer
+        self.audio_token_id = tokenizer.convert_tokens_to_ids(self.AUDIO_TOKEN)
+        self.projector = projector
+        self.encoder_conv_layers = encoder_conv_layers or DEFAULT_ENCODER_CONV_LAYERS
+    def _compute_encoder_output_length(self, mel_length: int) -> int:
+        """Compute encoder output length using conv layer formulas."""
+        return compute_encoder_output_length(mel_length, self.encoder_conv_layers)
+    def __call__(
+        self,
+        audio: Optional[Union[list, "torch.Tensor"]] = None,
+        text: Optional[str] = None,
+        system_prompt: Optional[str] = None,
+        return_tensors: str = "pt",
+        **kwargs,
+    ) -> dict:
+        """Process audio and text inputs for inference.
+        Args:
+            audio: Raw audio waveform(s)
+            text: Target transcription (optional, for training - but use DataCollator instead)
+            system_prompt: Optional system prompt
+            return_tensors: Return format ("pt" for PyTorch)
+        Returns:
+            Dict with input_features, input_ids, attention_mask
+        """
+        result = {}
+        # Process audio
+        if audio is not None:
+            audio_inputs = self.feature_extractor(
+                audio,
+                sampling_rate=getattr(self.feature_extractor, "sampling_rate", 16000),
+                return_attention_mask=True,
+                return_tensors=return_tensors,
+                **kwargs,
+            )
+            result["input_features"] = audio_inputs["input_features"]
+            result["audio_attention_mask"] = audio_inputs["attention_mask"]
+            # Use actual audio length (from attention mask) for token count
+            real_mel_len = int(audio_inputs["attention_mask"].sum(dim=-1).max().item())
+            encoder_output_len = self._compute_encoder_output_length(real_mel_len)
+            num_audio_tokens = self.projector.get_output_length(encoder_output_len)
+        else:
+            num_audio_tokens = 0
+        # Build prompt with audio token placeholders (instruction-free)
+        if num_audio_tokens > 0:
+            user_content = self.AUDIO_TOKEN * num_audio_tokens
+            if self.TRANSCRIBE_PROMPT:
+                user_content += " " + self.TRANSCRIBE_PROMPT
+        else:
+            user_content = self.TRANSCRIBE_PROMPT or ""
+        messages = []
+        if system_prompt:
+            messages.append({"role": "system", "content": system_prompt})
+        messages.append({"role": "user", "content": user_content})
+        if text is not None:
+            messages.append({"role": "assistant", "content": text})
+        # Tokenize
+        tokenized = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=(text is None),
+            return_tensors=return_tensors,
+            enable_thinking=False,  # Disable Qwen3 thinking mode for ASR
+        )
+        # Handle both tensor and BatchEncoding returns
+        if isinstance(tokenized, torch.Tensor):
+            input_ids = tokenized
+        else:
+            # BatchEncoding or dict-like object
+            input_ids = tokenized.get("input_ids", tokenized.input_ids)
+        if input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(0)
+        result["input_ids"] = input_ids
+        result["attention_mask"] = torch.ones_like(input_ids)
+        return result
+ASRProcessor.register_for_auto_class()
+transformers.AutoProcessor.register(ASRConfig, ASRProcessor)

diarization.py ADDED Viewed

	@@ -0,0 +1,730 @@

+"""Speaker diarization using TEN-VAD + ECAPA-TDNN + spectral clustering.
+Spectral clustering implementation adapted from FunASR/3D-Speaker:
+https://github.com/alibaba-damo-academy/FunASR
+MIT License (https://opensource.org/licenses/MIT)
+"""
+import warnings
+import numpy as np
+import scipy
+import sklearn.metrics.pairwise
+import torch
+from sklearn.cluster._kmeans import k_means
+from sklearn.preprocessing import normalize
+def _get_device() -> torch.device:
+    """Get best available device for inference."""
+    if torch.cuda.is_available():
+        return torch.device("cuda")
+    if torch.backends.mps.is_available():
+        return torch.device("mps")
+    return torch.device("cpu")
+class SpectralCluster:
+    """Spectral clustering using unnormalized Laplacian of affinity matrix.
+    Adapted from FunASR/3D-Speaker and SpeechBrain implementations.
+    Uses eigenvalue gap to automatically determine number of speakers.
+    """
+    def __init__(self, min_num_spks: int = 1, max_num_spks: int = 15, pval: float = 0.06):
+        self.min_num_spks = min_num_spks
+        self.max_num_spks = max_num_spks
+        self.pval = pval
+    def __call__(self, embeddings: np.ndarray, oracle_num: int | None = None) -> np.ndarray:
+        """Run spectral clustering on embeddings.
+        Args:
+            embeddings: Speaker embeddings of shape [N, D]
+            oracle_num: Optional known number of speakers
+        Returns:
+            Cluster labels of shape [N]
+        """
+        # Similarity matrix computation
+        sim_mat = self.get_sim_mat(embeddings)
+        # Refining similarity matrix with pval
+        prunned_sim_mat = self.p_pruning(sim_mat)
+        # Symmetrization
+        sym_prund_sim_mat = 0.5 * (prunned_sim_mat + prunned_sim_mat.T)
+        # Laplacian calculation
+        laplacian = self.get_laplacian(sym_prund_sim_mat)
+        # Get Spectral Embeddings
+        emb, num_of_spk = self.get_spec_embs(laplacian, oracle_num)
+        # Perform clustering
+        return self.cluster_embs(emb, num_of_spk)
+    def get_sim_mat(self, embeddings: np.ndarray) -> np.ndarray:
+        """Compute cosine similarity matrix."""
+        return sklearn.metrics.pairwise.cosine_similarity(embeddings, embeddings)
+    def p_pruning(self, affinity: np.ndarray) -> np.ndarray:
+        """Prune low similarity values in affinity matrix (keep top pval fraction)."""
+        n = affinity.shape[0]
+        pval = max(self.pval, 6.0 / n)
+        k_keep = max(1, int(pval * n))
+        # Vectorized: find top-k indices per row and zero out the rest
+        top_k_idx = np.argpartition(affinity, -k_keep, axis=1)[:, -k_keep:]
+        mask = np.zeros_like(affinity, dtype=bool)
+        np.put_along_axis(mask, top_k_idx, True, axis=1)
+        affinity[~mask] = 0
+        return affinity
+    def get_laplacian(self, sim_mat: np.ndarray) -> np.ndarray:
+        """Compute unnormalized Laplacian matrix."""
+        from scipy.sparse.csgraph import laplacian
+        np.fill_diagonal(sim_mat, 0)
+        return laplacian(sim_mat, normed=False)
+    def get_spec_embs(
+        self, laplacian: np.ndarray, k_oracle: int | None = None
+    ) -> tuple[np.ndarray, int]:
+        """Extract spectral embeddings from Laplacian."""
+        lambdas, eig_vecs = scipy.linalg.eigh(laplacian)
+        if k_oracle is not None:
+            num_of_spk = k_oracle
+        else:
+            lambda_gap_list = self.get_eigen_gaps(
+                lambdas[self.min_num_spks - 1 : self.max_num_spks + 1]
+            )
+            num_of_spk = np.argmax(lambda_gap_list) + self.min_num_spks
+        emb = eig_vecs[:, :num_of_spk]
+        return emb, num_of_spk
+    def cluster_embs(self, emb: np.ndarray, k: int) -> np.ndarray:
+        """Cluster spectral embeddings using k-means."""
+        _, labels, _ = k_means(emb, k, n_init=10)
+        return labels
+    def get_eigen_gaps(self, eig_vals: np.ndarray) -> np.ndarray:
+        """Compute gaps between consecutive eigenvalues."""
+        return np.diff(eig_vals)
+class SpeakerClusterer:
+    """Speaker clustering backend using spectral clustering with speaker merging.
+    Features:
+    - Spectral clustering with eigenvalue gap for auto speaker count detection
+    - P-pruning for affinity matrix refinement
+    - Post-clustering speaker merging by cosine similarity
+    """
+    def __init__(
+        self,
+        min_num_spks: int = 2,
+        max_num_spks: int = 10,
+        merge_thr: float = 0.90,  # Moderate merging
+    ):
+        self.min_num_spks = min_num_spks
+        self.max_num_spks = max_num_spks
+        self.merge_thr = merge_thr
+        self._spectral_cluster: SpectralCluster | None = None
+    def _get_spectral_cluster(self) -> SpectralCluster:
+        """Lazy-load spectral clusterer."""
+        if self._spectral_cluster is None:
+            self._spectral_cluster = SpectralCluster(
+                min_num_spks=self.min_num_spks,
+                max_num_spks=self.max_num_spks,
+            )
+        return self._spectral_cluster
+    def __call__(self, embeddings: np.ndarray, num_speakers: int | None = None) -> np.ndarray:
+        """Cluster speaker embeddings and return labels.
+        Args:
+            embeddings: Speaker embeddings of shape [N, D]
+            num_speakers: Optional oracle number of speakers
+        Returns:
+            Cluster labels of shape [N]
+        """
+        if len(embeddings.shape) != 2:
+            raise ValueError(f"Expected 2D array, got shape {embeddings.shape}")
+        # Handle edge cases
+        if embeddings.shape[0] == 0:
+            return np.array([], dtype=int)
+        if embeddings.shape[0] == 1:
+            return np.array([0], dtype=int)
+        if embeddings.shape[0] < 6:
+            return np.zeros(embeddings.shape[0], dtype=int)
+        # Normalize embeddings and replace NaN/inf
+        embeddings = np.nan_to_num(embeddings, nan=0.0, posinf=0.0, neginf=0.0)
+        embeddings = normalize(embeddings)
+        # Run spectral clustering (suppress numerical warnings)
+        spectral = self._get_spectral_cluster()
+        # Update min/max for oracle case
+        if num_speakers is not None:
+            spectral.min_num_spks = num_speakers
+            spectral.max_num_spks = num_speakers
+        with warnings.catch_warnings():
+            warnings.filterwarnings("ignore", category=RuntimeWarning)
+            labels = spectral(embeddings, oracle_num=num_speakers)
+        # Reset min/max
+        if num_speakers is not None:
+            spectral.min_num_spks = self.min_num_spks
+            spectral.max_num_spks = self.max_num_spks
+        # Merge similar speakers if no oracle
+        if num_speakers is None:
+            labels = self._merge_by_cos(labels, embeddings, self.merge_thr)
+        # Re-index labels sequentially
+        _, labels = np.unique(labels, return_inverse=True)
+        return labels
+    def _merge_by_cos(self, labels: np.ndarray, embs: np.ndarray, cos_thr: float) -> np.ndarray:
+        """Merge similar speakers by cosine similarity of centroids."""
+        from scipy.cluster.hierarchy import fcluster, linkage
+        from scipy.spatial.distance import pdist
+        unique_labels = np.unique(labels)
+        if len(unique_labels) <= 1:
+            return labels
+        # Compute normalized speaker centroids
+        centroids = np.array([embs[labels == lbl].mean(0) for lbl in unique_labels])
+        centroids = normalize(centroids)
+        # Hierarchical clustering with cosine distance
+        distances = pdist(centroids, metric="cosine")
+        linkage_matrix = linkage(distances, method="average")
+        merged_labels = fcluster(linkage_matrix, t=1.0 - cos_thr, criterion="distance") - 1
+        # Map original labels to merged labels
+        label_map = dict(zip(unique_labels, merged_labels))
+        return np.array([label_map[lbl] for lbl in labels])
+class LocalSpeakerDiarizer:
+    """Local speaker diarization using TEN-VAD + ECAPA-TDNN + spectral clustering.
+    Pipeline:
+    1. TEN-VAD detects speech segments
+    2. Sliding window (1.0s, 75% overlap) for uniform embedding extraction
+    3. ECAPA-TDNN extracts speaker embeddings per window
+    4. Spectral clustering with eigenvalue gap for auto speaker detection
+    5. Frame-level consensus voting for segment reconstruction
+    6. Post-processing merges short segments to reduce flicker
+    Tunable Parameters (class attributes):
+    - WINDOW_SIZE: Embedding extraction window size in seconds
+    - STEP_SIZE: Sliding window step size (overlap = WINDOW_SIZE - STEP_SIZE)
+    - VAD_THRESHOLD: Speech detection threshold (lower = more sensitive)
+    - VAD_MIN_DURATION: Minimum speech segment duration
+    - VAD_MAX_GAP: Maximum gap to bridge between segments
+    - VAD_PAD_ONSET/OFFSET: Padding added to speech segments
+    - VOTING_RATE: Frame resolution for consensus voting
+    - MIN_SEGMENT_DURATION: Minimum final segment duration
+    - SAME_SPEAKER_GAP: Maximum gap to merge same-speaker segments
+    - TAIL_COVERAGE_RATIO: Minimum tail coverage to add extra window
+    """
+    _ten_vad_model = None
+    _ecapa_model = None
+    _device = None
+    # ==================== TUNABLE PARAMETERS ====================
+    # Sliding window for embedding extraction
+    WINDOW_SIZE = 0.75  # seconds - shorter window for finer resolution
+    STEP_SIZE = 0.15  # seconds (80% overlap for more votes)
+    TAIL_COVERAGE_RATIO = 0.1  # Add extra window if tail > this ratio of window
+    # VAD hysteresis parameters
+    VAD_THRESHOLD = 0.25  # Balanced threshold
+    VAD_MIN_DURATION = 0.05  # Minimum speech segment duration (seconds)
+    VAD_MAX_GAP = 0.50  # Bridge gaps shorter than this (seconds)
+    VAD_PAD_ONSET = 0.05  # Padding at segment start (seconds)
+    VAD_PAD_OFFSET = 0.05  # Padding at segment end (seconds)
+    # Frame-level voting
+    VOTING_RATE = 0.01  # 10ms resolution for consensus voting
+    # Post-processing
+    MIN_SEGMENT_DURATION = 0.15  # Minimum final segment duration (seconds)
+    SHORT_SEGMENT_GAP = 0.1  # Gap threshold for merging short segments
+    SAME_SPEAKER_GAP = 0.5  # Gap threshold for merging same-speaker segments
+    # ===========================================================
+    @classmethod
+    def _get_ten_vad_model(cls):
+        """Lazy-load TEN-VAD model (singleton)."""
+        if cls._ten_vad_model is None:
+            from ten_vad import TenVad
+            cls._ten_vad_model = TenVad(hop_size=256, threshold=cls.VAD_THRESHOLD)
+        return cls._ten_vad_model
+    @classmethod
+    def _get_device(cls) -> torch.device:
+        """Get the best available device."""
+        if cls._device is None:
+            cls._device = _get_device()
+        return cls._device
+    @classmethod
+    def _get_ecapa_model(cls):
+        """Lazy-load ECAPA-TDNN speaker embedding model (singleton)."""
+        if cls._ecapa_model is None:
+            # Suppress torchaudio deprecation warning from SpeechBrain
+            with warnings.catch_warnings():
+                warnings.filterwarnings("ignore", message="torchaudio._backend")
+                from speechbrain.inference.speaker import EncoderClassifier
+                device = cls._get_device()
+                cls._ecapa_model = EncoderClassifier.from_hparams(
+                    source="speechbrain/spkrec-ecapa-voxceleb",
+                    run_opts={"device": str(device)},
+                )
+        return cls._ecapa_model
+    @classmethod
+    def diarize(
+        cls,
+        audio: np.ndarray | str,
+        sample_rate: int = 16000,
+        num_speakers: int | None = None,
+        min_speakers: int = 2,
+        max_speakers: int = 10,
+        **_kwargs,
+    ) -> list[dict]:
+        """Run speaker diarization on audio.
+        Args:
+            audio: Audio waveform as numpy array or path to audio file
+            sample_rate: Audio sample rate (default 16000)
+            num_speakers: Exact number of speakers (if known)
+            min_speakers: Minimum number of speakers
+            max_speakers: Maximum number of speakers
+        Returns:
+            List of dicts with 'speaker', 'start', 'end' keys
+        """
+        # Handle file path input
+        if isinstance(audio, str):
+            import librosa
+            audio, sample_rate = librosa.load(audio, sr=16000)
+        # Ensure correct sample rate
+        if sample_rate != 16000:
+            import librosa
+            audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)
+            sample_rate = 16000
+        audio = audio.astype(np.float32)
+        total_duration = len(audio) / sample_rate
+        # Step 1: VAD (returns segments and raw frame-level decisions)
+        segments, vad_frames = cls._get_speech_segments(audio, sample_rate)
+        if not segments:
+            return []
+        # Step 2: Extract embeddings
+        embeddings, window_segments = cls._extract_embeddings(audio, segments, sample_rate)
+        if len(embeddings) == 0:
+            return []
+        # Step 3: Cluster
+        clusterer = SpeakerClusterer(min_num_spks=min_speakers, max_num_spks=max_speakers)
+        labels = clusterer(embeddings, num_speakers)
+        # Step 4: Post-process with consensus voting (VAD-aware)
+        return cls._postprocess_segments(window_segments, labels, total_duration, vad_frames)
+    @classmethod
+    def _get_speech_segments(
+        cls, audio_array: np.ndarray, sample_rate: int = 16000
+    ) -> tuple[list[dict], list[bool]]:
+        """Get speech segments using TEN-VAD.
+        Returns:
+            Tuple of (segments list, vad_frames list of per-frame speech decisions)
+        """
+        vad_model = cls._get_ten_vad_model()
+        # Convert to int16 as required by TEN-VAD
+        # Clip to prevent integer overflow
+        if audio_array.dtype != np.int16:
+            audio_int16 = (np.clip(audio_array, -1.0, 1.0) * 32767).astype(np.int16)
+        else:
+            audio_int16 = audio_array
+        # Process frame by frame
+        hop_size = 256
+        frame_duration = hop_size / sample_rate
+        speech_frames: list[bool] = []
+        for i in range(0, len(audio_int16) - hop_size, hop_size):
+            frame = audio_int16[i : i + hop_size]
+            _, is_speech = vad_model.process(frame)
+            speech_frames.append(is_speech)
+        # Convert frame-level decisions to segments
+        segments = []
+        in_speech = False
+        start_idx = 0
+        for i, is_speech in enumerate(speech_frames):
+            if is_speech and not in_speech:
+                start_idx = i
+                in_speech = True
+            elif not is_speech and in_speech:
+                start_time = start_idx * frame_duration
+                end_time = i * frame_duration
+                segments.append(
+                    {
+                        "start": start_time,
+                        "end": end_time,
+                        "start_sample": int(start_time * sample_rate),
+                        "end_sample": int(end_time * sample_rate),
+                    }
+                )
+                in_speech = False
+        # Handle trailing speech
+        if in_speech:
+            start_time = start_idx * frame_duration
+            end_time = len(speech_frames) * frame_duration
+            segments.append(
+                {
+                    "start": start_time,
+                    "end": end_time,
+                    "start_sample": int(start_time * sample_rate),
+                    "end_sample": int(end_time * sample_rate),
+                }
+            )
+        return cls._apply_vad_hysteresis(segments, sample_rate), speech_frames
+    @classmethod
+    def _apply_vad_hysteresis(cls, segments: list[dict], sample_rate: int = 16000) -> list[dict]:
+        """Apply hysteresis-like post-processing to VAD segments."""
+        if not segments:
+            return segments
+        segments = sorted(segments, key=lambda x: x["start"])
+        # Fill short gaps
+        merged = [segments[0].copy()]
+        for seg in segments[1:]:
+            gap = seg["start"] - merged[-1]["end"]
+            if gap <= cls.VAD_MAX_GAP:
+                merged[-1]["end"] = seg["end"]
+                merged[-1]["end_sample"] = seg["end_sample"]
+            else:
+                merged.append(seg.copy())
+        # Remove short segments
+        filtered = [seg for seg in merged if (seg["end"] - seg["start"]) >= cls.VAD_MIN_DURATION]
+        # Dilate segments (add padding)
+        for seg in filtered:
+            seg["start"] = max(0.0, seg["start"] - cls.VAD_PAD_ONSET)
+            seg["end"] = seg["end"] + cls.VAD_PAD_OFFSET
+            seg["start_sample"] = int(seg["start"] * sample_rate)
+            seg["end_sample"] = int(seg["end"] * sample_rate)
+        return filtered
+    @classmethod
+    def _extract_embeddings(
+        cls, audio_array: np.ndarray, segments: list[dict], sample_rate: int
+    ) -> tuple[np.ndarray, list[dict]]:
+        """Extract speaker embeddings using sliding windows."""
+        speaker_model = cls._get_ecapa_model()
+        window_samples = int(cls.WINDOW_SIZE * sample_rate)
+        step_samples = int(cls.STEP_SIZE * sample_rate)
+        embeddings = []
+        window_segments = []
+        with torch.no_grad():
+            for seg in segments:
+                seg_start = seg["start_sample"]
+                seg_end = seg["end_sample"]
+                seg_len = seg_end - seg_start
+                # Generate window positions
+                if seg_len <= window_samples:
+                    starts = [seg_start]
+                    ends = [seg_end]
+                else:
+                    starts = list(range(seg_start, seg_end - window_samples + 1, step_samples))
+                    ends = [s + window_samples for s in starts]
+                    # Cover tail if > TAIL_COVERAGE_RATIO of window remains
+                    if ends and ends[-1] < seg_end:
+                        remainder = seg_end - ends[-1]
+                        if remainder > (window_samples * cls.TAIL_COVERAGE_RATIO):
+                            starts.append(seg_end - window_samples)
+                            ends.append(seg_end)
+                for c_start, c_end in zip(starts, ends):
+                    chunk = audio_array[c_start:c_end]
+                    # Pad short chunks with reflection
+                    if len(chunk) < window_samples:
+                        pad_width = window_samples - len(chunk)
+                        chunk = np.pad(chunk, (0, pad_width), mode="reflect")
+                    # Extract embedding using SpeechBrain's encode_batch
+                    chunk_tensor = torch.from_numpy(chunk).float().unsqueeze(0)
+                    embedding = (
+                        speaker_model.encode_batch(chunk_tensor).squeeze(0).squeeze(0).cpu().numpy()
+                    )
+                    # Validate embedding
+                    if np.isfinite(embedding).all() and np.linalg.norm(embedding) > 1e-8:
+                        embeddings.append(embedding)
+                        window_segments.append(
+                            {
+                                "start": c_start / sample_rate,
+                                "end": c_end / sample_rate,
+                            }
+                        )
+        # Normalize all embeddings at once
+        if embeddings:
+            return normalize(np.array(embeddings)), window_segments
+        return np.array([]), []
+    @classmethod
+    def _resample_vad(cls, vad_frames: list[bool], num_frames: int) -> np.ndarray:
+        """Resample VAD frame decisions to match voting grid resolution.
+        VAD operates at 256 samples / 16000 Hz = 16ms per frame.
+        Voting operates at VOTING_RATE (default 10ms) per frame.
+        This maps VAD decisions to the finer voting grid.
+        """
+        if not vad_frames:
+            return np.zeros(num_frames, dtype=bool)
+        vad_rate = 256 / 16000  # 16ms per VAD frame
+        vad_arr = np.array(vad_frames)
+        # Vectorized: compute VAD frame indices for each voting frame
+        voting_times = np.arange(num_frames) * cls.VOTING_RATE
+        vad_indices = np.clip((voting_times / vad_rate).astype(int), 0, len(vad_arr) - 1)
+        return vad_arr[vad_indices]
+    @classmethod
+    def _postprocess_segments(
+        cls,
+        window_segments: list[dict],
+        labels: np.ndarray,
+        total_duration: float,
+        vad_frames: list[bool],
+    ) -> list[dict]:
+        """Post-process using frame-level consensus voting with VAD-aware silence."""
+        if not window_segments or len(labels) == 0:
+            return []
+        # Correct labels to be contiguous
+        unique_labels = np.unique(labels)
+        label_map = {old: new for new, old in enumerate(unique_labels)}
+        clean_labels = np.array([label_map[lbl] for lbl in labels])
+        num_speakers = len(unique_labels)
+        if num_speakers == 0:
+            return []
+        # Create voting grid
+        num_frames = int(np.ceil(total_duration / cls.VOTING_RATE)) + 1
+        votes = np.zeros((num_frames, num_speakers), dtype=np.float32)
+        # Accumulate votes
+        for win, label in zip(window_segments, clean_labels):
+            start_frame = int(win["start"] / cls.VOTING_RATE)
+            end_frame = int(win["end"] / cls.VOTING_RATE)
+            end_frame = min(end_frame, num_frames)
+            if start_frame < end_frame:
+                votes[start_frame:end_frame, label] += 1.0
+        # Determine winner per frame
+        frame_speakers = np.argmax(votes, axis=1)
+        max_votes = np.max(votes, axis=1)
+        # Resample VAD to voting grid resolution for silence-aware voting
+        vad_resampled = cls._resample_vad(vad_frames, num_frames)
+        # Convert frames to segments
+        final_segments = []
+        current_speaker = -1
+        seg_start = 0.0
+        for f in range(num_frames):
+            speaker = int(frame_speakers[f])
+            score = max_votes[f]
+            # Force silence if VAD says no speech OR no votes
+            if score == 0 or not vad_resampled[f]:
+                speaker = -1
+            if speaker != current_speaker:
+                if current_speaker != -1:
+                    final_segments.append(
+                        {
+                            "speaker": f"SPEAKER_{current_speaker}",
+                            "start": seg_start,
+                            "end": f * cls.VOTING_RATE,
+                        }
+                    )
+                current_speaker = speaker
+                seg_start = f * cls.VOTING_RATE
+        # Close last segment
+        if current_speaker != -1:
+            final_segments.append(
+                {
+                    "speaker": f"SPEAKER_{current_speaker}",
+                    "start": seg_start,
+                    "end": num_frames * cls.VOTING_RATE,
+                }
+            )
+        return cls._merge_short_segments(final_segments)
+    @classmethod
+    def _merge_short_segments(cls, segments: list[dict]) -> list[dict]:
+        """Merge short segments to reduce flicker."""
+        if not segments:
+            return []
+        clean: list[dict] = []
+        for seg in segments:
+            dur = seg["end"] - seg["start"]
+            if dur < cls.MIN_SEGMENT_DURATION:
+                if (
+                    clean
+                    and clean[-1]["speaker"] == seg["speaker"]
+                    and seg["start"] - clean[-1]["end"] < cls.SHORT_SEGMENT_GAP
+                ):
+                    clean[-1]["end"] = seg["end"]
+                continue
+            if (
+                clean
+                and clean[-1]["speaker"] == seg["speaker"]
+                and seg["start"] - clean[-1]["end"] < cls.SAME_SPEAKER_GAP
+            ):
+                clean[-1]["end"] = seg["end"]
+            else:
+                clean.append(seg)
+        return clean
+    @classmethod
+    def assign_speakers_to_words(
+        cls,
+        words: list[dict],
+        speaker_segments: list[dict],
+    ) -> list[dict]:
+        """Assign speaker labels to words based on timestamp overlap.
+        Args:
+            words: List of word dicts with 'word', 'start', 'end' keys
+            speaker_segments: List of speaker dicts with 'speaker', 'start', 'end' keys
+        Returns:
+            Words list with 'speaker' key added to each word
+        """
+        for word in words:
+            word_mid = (word["start"] + word["end"]) / 2
+            # Find the speaker segment that contains this word's midpoint
+            best_speaker = None
+            for seg in speaker_segments:
+                if seg["start"] <= word_mid <= seg["end"]:
+                    best_speaker = seg["speaker"]
+                    break
+            # If no exact match, find closest segment
+            if best_speaker is None and speaker_segments:
+                min_dist = float("inf")
+                for seg in speaker_segments:
+                    seg_mid = (seg["start"] + seg["end"]) / 2
+                    dist = abs(word_mid - seg_mid)
+                    if dist < min_dist:
+                        min_dist = dist
+                        best_speaker = seg["speaker"]
+            word["speaker"] = best_speaker
+        return words
+class SpeakerDiarizer:
+    """Speaker diarization using TEN-VAD + ECAPA-TDNN + spectral clustering.
+    Example:
+        >>> segments = SpeakerDiarizer.diarize(audio_array)
+        >>> for seg in segments:
+        ...     print(f"{seg['speaker']}: {seg['start']:.2f} - {seg['end']:.2f}")
+    """
+    @classmethod
+    def diarize(
+        cls,
+        audio: np.ndarray | str,
+        sample_rate: int = 16000,
+        num_speakers: int | None = None,
+        min_speakers: int | None = None,
+        max_speakers: int | None = None,
+        **_kwargs,
+    ) -> list[dict]:
+        """Run speaker diarization on audio.
+        Args:
+            audio: Audio waveform as numpy array or path to audio file
+            sample_rate: Audio sample rate (default 16000)
+            num_speakers: Exact number of speakers (if known)
+            min_speakers: Minimum number of speakers
+            max_speakers: Maximum number of speakers
+        Returns:
+            List of dicts with 'speaker', 'start', 'end' keys
+        """
+        return LocalSpeakerDiarizer.diarize(
+            audio,
+            sample_rate=sample_rate,
+            num_speakers=num_speakers,
+            min_speakers=min_speakers or 2,
+            max_speakers=max_speakers or 10,
+        )
+    @classmethod
+    def assign_speakers_to_words(
+        cls,
+        words: list[dict],
+        speaker_segments: list[dict],
+    ) -> list[dict]:
+        """Assign speaker labels to words based on timestamp overlap."""
+        return LocalSpeakerDiarizer.assign_speakers_to_words(words, speaker_segments)

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "chunk_length": 30,
+  "dither": 0.0,
+  "feature_extractor_type": "WhisperFeatureExtractor",
+  "feature_size": 128,
+  "hop_length": 160,
+  "n_fft": 400,
+  "n_samples": 480000,
+  "nb_max_frames": 3000,
+  "padding": false,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "return_attention_mask": false,
+  "sampling_rate": 16000,
+  "processor_class": "ASRProcessor",
+  "auto_map": {
+    "AutoProcessor": "asr_processing.ASRProcessor"
+  }
+}

projectors.py ADDED Viewed

	@@ -0,0 +1,493 @@

+"""Audio projector modules for bridging encoder and decoder embeddings.
+This module contains all projector architectures:
+- MLPAudioProjector: Simple 2-layer MLP with frame stacking downsampling
+- MOSAProjector: MOSA-style dense mixture of experts
+- SharedMoEAudioProjector: Shared expert + sparse routed experts
+- QFormerAudioProjector: BLIP-2 QFormer with learnable queries (Granite-style)
+"""
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
+from transformers import AutoModel, Blip2QFormerConfig
+from transformers.models.llama.modeling_llama import LlamaRMSNorm
+# =============================================================================
+# MLP Projector
+# =============================================================================
+class MLPAudioProjector(nn.Module):
+    """2-layer MLP projector with frame-stacking downsampling (matches GLM-ASR).
+    Both RMSNorms use LlamaRMSNorm's default weight=1.0 init. A prior version
+    initialized both to 0.029 (Qwen3-0.6B's embed_tokens RMS) to put projector
+    outputs at residual-stream scale on step 1. Empirically, after training the
+    model drifted both norms back to ~1.0 (norm) and ~1.2 (norm_2) — the small
+    init wasted compute on a 35× scale-correction phase the optimizer would
+    have skipped from default init.
+    """
+    def __init__(self, config):
+        """Initialize MLP projector.
+        Args:
+            config: ASRConfig with encoder_dim, llm_dim, projector_pool_stride
+        """
+        super().__init__()
+        encoder_dim = getattr(config, "encoder_dim", 768)
+        llm_dim = getattr(config, "llm_dim", 2048)
+        self.k = getattr(config, "projector_pool_stride", 4)
+        # Frame stacking: concat k adjacent frames then project
+        in_dim = encoder_dim * self.k
+        # Hidden dim defaults to llm_dim, can be overridden via config
+        hidden_dim = getattr(config, "projector_hidden_dim", None) or llm_dim
+        self.linear_1 = nn.Linear(in_dim, hidden_dim, bias=False)
+        self.norm = LlamaRMSNorm(hidden_dim, eps=1e-6)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(getattr(config, "projector_dropout", 0.0))
+        self.linear_2 = nn.Linear(hidden_dim, llm_dim, bias=False)
+        self.norm_2 = LlamaRMSNorm(llm_dim, eps=1e-6)
+    def get_output_length(self, input_length: int) -> int:
+        """Calculate output sequence length given input length (matches GLM-ASR)."""
+        # GLM-ASR formula: (L - merge_factor) // merge_factor + 1
+        return (input_length - self.k) // self.k + 1
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Project audio features to LLM embedding space.
+        Args:
+            x: Audio encoder output of shape [batch, seq_len, encoder_dim]
+        Returns:
+            Projected features of shape [batch, (seq_len - k) // k + 1, llm_dim]
+        """
+        x = _frame_stack(x, self.k)
+        x = self.linear_1(x)
+        x = self.norm(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.linear_2(x)
+        return self.norm_2(x)
+# =============================================================================
+# MoE Projector (MOSA-style)
+# =============================================================================
+def _frame_stack(x: torch.Tensor, k: int) -> torch.Tensor:
+    """Stack k adjacent frames along the feature dim.
+    Truncates trailing frames that don't fill a complete k-frame window,
+    matching GLM-ASR's `(seq_len - k) // k + 1` formula.
+    """
+    batch, seq, dim = x.shape
+    out_len = (seq - k) // k + 1
+    return x[:, : out_len * k, :].reshape(batch, out_len, dim * k)
+class SimpleAdapter(nn.Module):
+    """Simple 2-layer GELU adapter (from MOSA paper)."""
+    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
+        super().__init__()
+        self.fc1 = nn.Linear(input_dim, hidden_dim)
+        self.act = nn.GELU()
+        self.fc2 = nn.Linear(hidden_dim, output_dim)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.fc2(self.act(self.fc1(x)))
+class MOSAProjector(nn.Module):
+    """MOSA-Base projector: simple 2-layer ReLU router with 4 simple adapters.
+    Based on "MOSA: Mixtures of Simple Adapters" (arXiv:2508.18998).
+    Uses softmax gating over all experts (dense MoE) with only cross-entropy loss.
+    Uses Conv1d for downsampling (2 layers, stride 2 each = 4x total).
+    """
+    ADAPTER_HIDDEN_DIM = 4096
+    ROUTER_HIDDEN_DIM = 512
+    CONV_KERNEL = 3
+    CONV_STRIDE = 2
+    CONV_PADDING = 1
+    def __init__(self, config):
+        """Initialize MOSA projector.
+        Args:
+            config: ASRConfig with encoder_dim, llm_dim, num_experts
+        """
+        super().__init__()
+        self.encoder_dim = getattr(config, "encoder_dim", None) or 1280
+        self.llm_dim = getattr(config, "llm_dim", None) or 2048
+        self.num_experts = getattr(config, "num_experts", None) or 4  # MOSA-Base uses 4
+        conv_kwargs = {
+            "kernel_size": self.CONV_KERNEL,
+            "stride": self.CONV_STRIDE,
+            "padding": self.CONV_PADDING,
+        }
+        self.downsampler = nn.Sequential(
+            nn.Conv1d(self.encoder_dim, self.encoder_dim, **conv_kwargs),
+            nn.GELU(),
+            nn.Conv1d(self.encoder_dim, self.llm_dim, **conv_kwargs),
+            nn.GELU(),
+        )
+        self.router = nn.Sequential(
+            nn.Linear(self.llm_dim, self.ROUTER_HIDDEN_DIM),
+            nn.ReLU(),
+            nn.Linear(self.ROUTER_HIDDEN_DIM, self.num_experts),
+        )
+        self.experts = nn.ModuleList(
+            [
+                SimpleAdapter(self.llm_dim, self.ADAPTER_HIDDEN_DIM, self.llm_dim)
+                for _ in range(self.num_experts)
+            ]
+        )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Project audio features using mixture of experts.
+        Args:
+            x: Audio encoder output of shape [batch, seq_len, encoder_dim]
+        Returns:
+            Projected features of shape [batch, out_len, llm_dim]
+        """
+        x = self.downsampler(x.transpose(1, 2)).transpose(1, 2)
+        routing_weights = F.softmax(self.router(x), dim=-1)  # (B, out_len, num_experts)
+        # Accumulate weighted expert outputs without materializing all experts at once.
+        output = self.experts[0](x) * routing_weights[..., 0:1]
+        for i, expert in enumerate(self.experts[1:], start=1):
+            output = output + expert(x) * routing_weights[..., i : i + 1]
+        return output
+    def get_output_length(self, input_length: int) -> int:
+        """Calculate output sequence length after Conv1d downsampling (4x reduction)."""
+        length = input_length
+        for _ in range(2):
+            length = (length + 2 * self.CONV_PADDING - self.CONV_KERNEL) // self.CONV_STRIDE + 1
+        return length
+# =============================================================================
+# MoE Projector (Pure PyTorch with Shared Expert)
+# =============================================================================
+class MoEAudioProjector(nn.Module):
+    """MoE projector with shared expert (DeepSeek-style), pure PyTorch implementation.
+    Uses 4 sparse experts with top-2 routing plus a shared expert that processes all tokens.
+    No external dependencies (megablocks removed).
+    Architecture matches main branch: norm → experts(in_dim → hidden → out_dim)
+    """
+    def __init__(self, config):
+        """Initialize MoE projector.
+        Args:
+            config: ASRConfig with encoder_dim, llm_dim, num_experts, num_experts_per_tok
+        """
+        super().__init__()
+        self.k = getattr(config, "projector_pool_stride", 4)
+        self.aux_coef = getattr(config, "router_aux_loss_coef", 0.01)
+        # Stability coefficients
+        self.router_z_loss_coef = getattr(
+            config, "router_z_loss_coef", 1e-4
+        )  # Prevents logit explosion
+        self.router_jitter_noise = getattr(
+            config, "router_jitter_noise", 0.01
+        )  # Prevents expert collapse
+        in_dim = config.encoder_dim * self.k
+        out_dim = config.llm_dim
+        # Expert hidden dim (default = output dim)
+        hidden_dim = getattr(config, "projector_hidden_dim", None) or out_dim
+        # Number of experts and top-k selection
+        self.num_experts = getattr(config, "num_experts", 4)
+        self.top_k = getattr(config, "num_experts_per_tok", 2)
+        # A. Normalize stacked input (like main branch SharedMoEBlock)
+        self.norm = LlamaRMSNorm(in_dim, eps=1e-6)
+        # B. Router (operates on stacked input)
+        self.router = nn.Linear(in_dim, self.num_experts, bias=False)
+        # C. Experts: simple 2-layer MLP (same as MLPAudioProjector)
+        self.experts = nn.ModuleList(
+            [SimpleAdapter(in_dim, hidden_dim, out_dim) for _ in range(self.num_experts)]
+        )
+        # D. Shared Expert (same architecture)
+        self.shared_expert = SimpleAdapter(in_dim, hidden_dim, out_dim)
+        # E. Initialize weights for stable training
+        self._init_weights()
+        self.last_aux_loss = torch.tensor(0.0)
+    def _init_weights(self):
+        """Initialize weights for stable training start."""
+        with torch.no_grad():
+            # Router: small weights -> uniform probability
+            nn.init.normal_(self.router.weight, mean=0.0, std=0.02)
+            # Experts: xavier for fc1, small for fc2 (output)
+            for expert in [self.shared_expert, *self.experts]:
+                nn.init.xavier_uniform_(expert.fc1.weight)
+                nn.init.normal_(expert.fc2.weight, mean=0.0, std=0.01)  # Small init
+    def get_output_length(self, input_length: int) -> int:
+        """Calculate output sequence length given input length (matches MLP projector)."""
+        return (input_length - self.k) // self.k + 1
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Project audio features using shared + sparse MoE.
+        Args:
+            x: Audio encoder output of shape [batch, seq_len, encoder_dim]
+        Returns:
+            Projected features of shape [batch, out_len, llm_dim]
+        """
+        x = _frame_stack(x, self.k)
+        batch, out_len, _ = x.shape
+        # Normalize stacked input (like main branch SharedMoEBlock)
+        x = self.norm(x)
+        flat_x = x.view(-1, x.size(-1))  # [tokens, in_dim]
+        # 3. Shared Expert (compute first, creates output tensor)
+        output = self.shared_expert(flat_x)
+        # 4. Sparse Experts (in-place add to shared output)
+        self.last_aux_loss = self._forward_sparse(flat_x, output)
+        return output.view(batch, out_len, -1)
+    def _forward_sparse(self, x: torch.Tensor, output: torch.Tensor) -> torch.Tensor:
+        """Stability-hardened sparse expert dispatch (in-place add to output).
+        Args:
+            x: Flattened input of shape [tokens, dim]
+            output: Output tensor to add sparse expert results into (in-place)
+        Returns:
+            Auxiliary loss tensor
+        """
+        # A. Router Logic with Jitter
+        logits = self.router(x)
+        if self.training and self.router_jitter_noise > 0:
+            # Jitter: multiply by uniform noise (1-eps, 1+eps) to shake decision boundary
+            # Prevents router from getting stuck on one expert early in training
+            noise = torch.empty_like(logits).uniform_(
+                1.0 - self.router_jitter_noise, 1.0 + self.router_jitter_noise
+            )
+            logits = logits * noise
+        # Force float32 for softmax (bf16/fp16 exponentials can overflow)
+        probs = torch.softmax(logits, dim=-1, dtype=torch.float32).type_as(x)
+        # B. Top-K Selection
+        top_k_weights, top_k_indices = torch.topk(probs, self.top_k, dim=-1)
+        # Normalize weights so they sum to 1.0
+        top_k_weights = top_k_weights / (top_k_weights.sum(dim=-1, keepdim=True) + 1e-6)
+        # C. Aux Loss + Z-Loss
+        aux_loss = torch.tensor(0.0, device=x.device)
+        if self.training:
+            # Load balancing loss (batch-size invariant)
+            prob_per_expert = probs.mean(0)  # [num_experts]
+            target = 1.0 / self.num_experts
+            balance_loss = (
+                self.aux_coef * ((prob_per_expert - target) ** 2).mean() * self.num_experts
+            )
+            # Z-loss: penalty on large logits to prevent softmax saturation
+            z_loss = self.router_z_loss_coef * torch.logsumexp(logits, dim=-1).pow(2).mean()
+            aux_loss = balance_loss + z_loss
+        # D. Dispatch Loop (in-place add to output)
+        for i, expert in enumerate(self.experts):
+            # Create boolean mask for tokens that selected Expert 'i'
+            mask = top_k_indices == i
+            if mask.any():
+                # token_idx = which tokens, k_idx = 1st or 2nd choice
+                token_idx, k_idx = torch.where(mask)
+                # Gather inputs and compute
+                expert_input = x[token_idx]
+                expert_output = expert(expert_input)
+                # Apply routing weight
+                weight = top_k_weights[token_idx, k_idx].unsqueeze(-1)
+                weighted_output = (expert_output * weight).type_as(output)
+                # Scatter back in-place (index_add_ is atomic and deterministic)
+                output.index_add_(0, token_idx, weighted_output)
+        return aux_loss
+    def get_aux_loss(self) -> torch.Tensor:
+        """Return auxiliary load balancing loss."""
+        return self.last_aux_loss
+# =============================================================================
+# QFormer Projector (Granite-style)
+# =============================================================================
+class QFormerAudioProjector(nn.Module):
+    """
+    BLIP-2 QFormer projector with learnable queries.
+    Based on GraniteSpeechEncoderProjector - uses a QFormer model with learnable
+    query embeddings to compress and project audio encoder outputs. The audio
+    sequence is processed in windows and downsampled via cross-attention.
+    """
+    def __init__(self, config):
+        """Initialize QFormer projector.
+        Args:
+            config: ASRConfig with encoder_dim, llm_dim, qformer_* settings
+        """
+        super().__init__()
+        encoder_dim = config.encoder_dim
+        llm_dim = config.llm_dim
+        # Window and downsampling parameters (Granite defaults: window=15, downsample=5)
+        self.window_size = getattr(config, "qformer_window_size", 15)
+        self.downsample_rate = getattr(config, "downsample_rate", 5)
+        self.num_queries = self.window_size // self.downsample_rate
+        # QFormer hidden size (matches encoder for cross-attention)
+        qformer_hidden = getattr(config, "qformer_hidden_size", None) or encoder_dim
+        qformer_num_layers = getattr(config, "qformer_num_layers", 2)
+        qformer_num_heads = getattr(config, "qformer_num_heads", 16)
+        qformer_intermediate = getattr(config, "qformer_intermediate_size", None) or (
+            qformer_hidden * 4
+        )
+        # Learnable query embeddings (Granite uses std=1.0)
+        self.query = nn.Parameter(torch.zeros(1, self.num_queries, qformer_hidden))
+        self.query.data.normal_(mean=0.0, std=1.0)
+        # Optional projection if encoder dim != qformer hidden
+        if encoder_dim != qformer_hidden:
+            self.encoder_proj = nn.Linear(encoder_dim, qformer_hidden, bias=False)
+        else:
+            self.encoder_proj = None
+        # Configure QFormer to match Granite's exact config
+        qformer_config = Blip2QFormerConfig(
+            hidden_size=qformer_hidden,
+            num_hidden_layers=qformer_num_layers,
+            num_attention_heads=qformer_num_heads,
+            intermediate_size=qformer_intermediate,
+            encoder_hidden_size=qformer_hidden,
+            cross_attention_frequency=1,
+            # Granite-specific settings
+            hidden_act="gelu",
+            attention_probs_dropout_prob=0.1,
+            hidden_dropout_prob=0.1,
+            layer_norm_eps=1e-12,
+            initializer_range=0.02,
+        )
+        self.qformer = AutoModel.from_config(qformer_config)
+        # Final projection to LLM dimension (Granite uses bias=True)
+        self.linear = nn.Linear(qformer_hidden, llm_dim)
+    def get_output_length(self, input_length):
+        """Calculate output sequence length given input length.
+        Accepts either Python ints or torch tensors; uses ceiling division so
+        the formula is identical for both — math.ceil would block tensors.
+        """
+        nblocks = (input_length + self.window_size - 1) // self.window_size
+        return nblocks * self.num_queries
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            hidden_states: [batch_size, seq_len, encoder_dim]
+        Returns:
+            projected: [batch_size, num_output_tokens, llm_dim]
+        """
+        batch_size, seq_len, dim = hidden_states.size()
+        # Ensure float dtype for QFormer
+        target_dtype = self.query.dtype
+        if hidden_states.dtype != target_dtype:
+            hidden_states = hidden_states.to(target_dtype)
+        # Optional encoder projection
+        if self.encoder_proj is not None:
+            hidden_states = self.encoder_proj(hidden_states)
+        # Compute number of windows and pad to fit
+        nblocks = math.ceil(seq_len / self.window_size)
+        pad = nblocks * self.window_size - seq_len
+        if pad > 0:
+            hidden_states = F.pad(hidden_states, (0, 0, 0, pad), "constant", 0)
+        # Reshape to process each window: [batch*nblocks, window_size, dim]
+        effective_batch = batch_size * nblocks
+        hidden_states = hidden_states.view(effective_batch, self.window_size, -1)
+        # Expand queries to match batch size
+        query_embeds = self.query.expand(effective_batch, -1, -1)
+        # QFormer cross-attention
+        query_output = self.qformer(
+            query_embeds=query_embeds,
+            encoder_hidden_states=hidden_states,
+            return_dict=True,
+        )
+        # Reshape back: [batch, nblocks * num_queries, hidden]
+        output_tokens = nblocks * self.num_queries
+        query_proj = query_output.last_hidden_state.view(batch_size, output_tokens, -1)
+        # Project to LLM dimension
+        return self.linear(query_proj)
+# =============================================================================
+# Projector Registry
+# =============================================================================
+PROJECTOR_CLASSES = {
+    "mlp": MLPAudioProjector,
+    "mosa": MOSAProjector,
+    "moe": MoEAudioProjector,
+    "qformer": QFormerAudioProjector,
+}