Training in progress - step 5000

Browse files

Files changed (12) hide show

.gitattributes +1 -0
README.md +199 -0
asr_config.py +225 -0
asr_modeling.py +860 -0
asr_pipeline.py +482 -0
asr_processing.py +131 -0
chat_template.jinja +89 -0
diarization.py +853 -0
preprocessor_config.json +19 -0
projectors.py +484 -0
tokenizer.json +3 -0
tokenizer_config.json +17 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

asr_config.py ADDED Viewed

	@@ -0,0 +1,225 @@

+from typing import Optional
+import transformers
+class ASRConfig(transformers.PretrainedConfig):
+    """Configuration class for the ASR model.
+    This config combines settings for:
+    - Audio encoder (GLM-ASR/Whisper)
+    - Text decoder (Qwen)
+    - Projector (MLP, MOSA, MoE, QFormer)
+    - Generation parameters
+    - Training options (SpecAugment, LoRA)
+    """
+    model_type = "asr_model"
+    is_composition = True
+    def __init__(
+        self,
+        audio_model_id: str = "zai-org/GLM-ASR-Nano-2512",
+        text_model_id: str = "Qwen/Qwen3-0.6B",
+        attn_implementation: str = "flash_attention_2",
+        model_dtype: str = "bfloat16",
+        num_beams: Optional[int] = None,
+        system_prompt: str = "You are a helpful assistant.",
+        encoder_dim: Optional[int] = None,
+        llm_dim: Optional[int] = None,
+        # Encoder conv layers: list of (padding, kernel_size, stride) tuples
+        # Default is Whisper/GLM-ASR structure: conv1(k=3,s=1,p=1) + conv2(k=3,s=2,p=1)
+        encoder_conv_layers: Optional[list] = None,
+        audio_sample_rate: int = 16000,
+        projector_pool_stride: int = 4,
+        downsample_rate: int = 5,  # Granite default
+        projector_hidden_dim: Optional[int] = None,
+        projector_type: str = "mlp",  # "mlp", "mosa", "moe", "qformer"
+        projector_num_layers: int = 2,  # Number of layers in MLP projector
+        projector_init_std: float = 0.02,  # Weight initialization std
+        projector_dropout: float = 0.0,  # Dropout rate for projector layers
+        # MoE-specific configuration
+        num_experts: int = 4,  # Number of experts in MoE projectors
+        num_experts_per_tok: int = 2,  # Top-k experts per token
+        router_aux_loss_coef: float = 0.01,  # Auxiliary loss coefficient for load balancing
+        # QFormer-specific configuration (Granite defaults)
+        qformer_window_size: int = 15,  # Window size for QFormer processing
+        qformer_hidden_size: Optional[int] = None,  # QFormer hidden size (defaults to encoder_dim)
+        qformer_num_layers: int = 2,  # Number of QFormer transformer layers
+        qformer_num_heads: int = 16,  # Number of attention heads in QFormer
+        qformer_intermediate_size: Optional[int] = None,  # FFN size (defaults to 4x hidden)
+        label_smoothing: float = 0.0,  # Label smoothing for cross-entropy loss
+        inference_warmup_tokens: int = 10,
+        # SpecAugment settings
+        use_specaugment: bool = False,
+        num_time_masks: int = 2,
+        time_mask_length: int = 10,
+        num_freq_masks: int = 0,
+        freq_mask_length: int = 10,
+        # LoRA configuration (for Stage 2 fine-tuning)
+        use_lora: bool = False,
+        lora_rank: int = 8,  # SALMONN default
+        lora_alpha: int = 32,  # SALMONN default (scaling factor 4.0)
+        lora_dropout: float = 0.0,
+        lora_target_modules: Optional[list] = None,  # Default: all linear layers
+        freeze_projector: bool = False,  # True for Stage 2 (LoRA-only training)
+        max_new_tokens: Optional[int] = None,
+        min_new_tokens: Optional[int] = None,
+        repetition_penalty: Optional[float] = None,
+        length_penalty: Optional[float] = None,
+        no_repeat_ngram_size: Optional[int] = None,
+        use_cache: Optional[bool] = None,
+        **kwargs,
+    ):
+        """Initialize ASR model configuration.
+        Args:
+            audio_model_id: HuggingFace model ID for audio encoder (GLM-ASR/Whisper)
+            text_model_id: HuggingFace model ID for text decoder (Qwen)
+            attn_implementation: Attention implementation ("flash_attention_2", "sdpa", "eager")
+            model_dtype: Model dtype ("bfloat16", "float16", "float32")
+            projector_type: Projector architecture ("mlp", "mosa", "moe", "qformer")
+            use_lora: Enable LoRA adapters for Stage 2 fine-tuning
+            use_specaugment: Enable SpecAugment data augmentation
+        """
+        # Set default generation parameters (greedy decoding only)
+        generation_defaults = {
+            "num_beams": 1,
+            "max_new_tokens": 128,
+            "min_new_tokens": 0,
+            "repetition_penalty": 1.0,
+            "length_penalty": 1.0,
+            "no_repeat_ngram_size": 0,  # Prevent repeating 3-grams like "so so so"
+            "use_cache": True,
+        }
+        # Apply defaults (config.json values take precedence)
+        kwargs = {**generation_defaults, **kwargs}
+        self.audio_model_id = audio_model_id
+        self.text_model_id = text_model_id
+        self.attn_implementation = attn_implementation
+        self.model_dtype = model_dtype
+        self.system_prompt = system_prompt
+        self.encoder_dim = encoder_dim
+        self.llm_dim = llm_dim
+        # Default conv layers for Whisper/GLM-ASR: [(pad, kernel, stride), ...]
+        self.encoder_conv_layers = encoder_conv_layers or [(1, 3, 1), (1, 3, 2)]
+        self.audio_sample_rate = audio_sample_rate
+        self.projector_init_std = projector_init_std
+        self.projector_pool_stride = projector_pool_stride
+        self.downsample_rate = downsample_rate
+        self.projector_hidden_dim = projector_hidden_dim
+        self.projector_type = projector_type
+        self.projector_num_layers = projector_num_layers
+        self.projector_dropout = projector_dropout
+        # MoE-specific configuration
+        self.num_experts = num_experts
+        self.num_experts_per_tok = num_experts_per_tok
+        self.router_aux_loss_coef = router_aux_loss_coef
+        # QFormer-specific configuration
+        self.qformer_window_size = qformer_window_size
+        self.qformer_hidden_size = qformer_hidden_size
+        self.qformer_num_layers = qformer_num_layers
+        self.qformer_num_heads = qformer_num_heads
+        self.qformer_intermediate_size = qformer_intermediate_size
+        self.label_smoothing = label_smoothing
+        self.inference_warmup_tokens = inference_warmup_tokens
+        # SpecAugment configuration
+        self.use_specaugment = use_specaugment
+        self.num_time_masks = num_time_masks
+        self.time_mask_length = time_mask_length
+        self.num_freq_masks = num_freq_masks
+        self.freq_mask_length = freq_mask_length
+        # LoRA configuration
+        self.use_lora = use_lora
+        self.lora_rank = lora_rank
+        self.lora_alpha = lora_alpha
+        self.lora_dropout = lora_dropout
+        self.lora_target_modules = lora_target_modules or [
+            "q_proj",
+            "k_proj",
+            "v_proj",
+            "o_proj",
+            "gate_proj",
+            "up_proj",
+            "down_proj",
+        ]
+        self.freeze_projector = freeze_projector
+        # Generation parameters (use explicit value if provided, else use default)
+        self.num_beams = num_beams if num_beams is not None else generation_defaults["num_beams"]
+        self.max_new_tokens = (
+            max_new_tokens if max_new_tokens is not None else generation_defaults["max_new_tokens"]
+        )
+        self.min_new_tokens = (
+            min_new_tokens if min_new_tokens is not None else generation_defaults["min_new_tokens"]
+        )
+        self.repetition_penalty = (
+            repetition_penalty
+            if repetition_penalty is not None
+            else generation_defaults["repetition_penalty"]
+        )
+        self.length_penalty = (
+            length_penalty if length_penalty is not None else generation_defaults["length_penalty"]
+        )
+        self.no_repeat_ngram_size = (
+            no_repeat_ngram_size
+            if no_repeat_ngram_size is not None
+            else generation_defaults["no_repeat_ngram_size"]
+        )
+        self.use_cache = use_cache if use_cache is not None else generation_defaults["use_cache"]
+        if "audio_config" not in kwargs:
+            self.audio_config = transformers.AutoConfig.from_pretrained(audio_model_id)
+            # Override dtype to match model_dtype
+            self.audio_config.dtype = model_dtype
+        else:
+            self.audio_config = kwargs.pop("audio_config")
+        if "text_config" not in kwargs:
+            self.text_config = transformers.AutoConfig.from_pretrained(
+                text_model_id, trust_remote_code=True
+            )
+            # Override dtype to match model_dtype
+            self.text_config.dtype = model_dtype
+        else:
+            self.text_config = kwargs.pop("text_config")
+        if isinstance(self.text_config, dict):
+            # Reconstruct config from dict using the model_type stored in the dict
+            model_type = self.text_config["model_type"]
+            config_class = transformers.AutoConfig.for_model(model_type).__class__
+            self.text_config = config_class(**self.text_config)
+        if isinstance(self.audio_config, dict):
+            model_type = self.audio_config.get("model_type")
+            if model_type:
+                config_class = transformers.AutoConfig.for_model(model_type).__class__
+                self.audio_config = config_class(**self.audio_config)
+        super().__init__(**kwargs)
+        # Point encoder to audio_config so pipeline uses correct feature extractor
+        # The pipeline looks for config.encoder._name_or_path for feature extractor
+        self.encoder = self.audio_config
+        self.auto_map = {
+            "AutoConfig": "asr_config.ASRConfig",
+            "AutoModel": "asr_modeling.ASRModel",
+            "AutoModelForSpeechSeq2Seq": "asr_modeling.ASRModel",
+            "AutoProcessor": "asr_processing.ASRProcessor",
+        }
+        self.custom_pipelines = {
+            "automatic-speech-recognition": {
+                "impl": "asr_pipeline.ASRPipeline",
+                "pt": ["AutoModelForSpeechSeq2Seq"],
+                "tf": [],
+                "type": "audio",
+            }
+        }
+        self.architectures = ["ASRModel"]
+        self.pipeline_tag = "automatic-speech-recognition"
+transformers.AutoConfig.register("asr_model", ASRConfig)

asr_modeling.py ADDED Viewed

	@@ -0,0 +1,860 @@

+import json
+from pathlib import Path
+from threading import Thread
+from typing import Iterator, Optional, Union
+import torch
+import torch.nn as nn
+from transformers import (
+    AutoConfig,
+    AutoModel,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    PreTrainedModel,
+    TextIteratorStreamer,
+)
+from transformers.generation import GenerationMixin
+from transformers.modeling_outputs import CausalLMOutputWithPast
+try:
+    from .asr_config import ASRConfig
+    from .projectors import PROJECTOR_CLASSES
+except ImportError:
+    from asr_config import ASRConfig  # type: ignore[no-redef]
+    from projectors import PROJECTOR_CLASSES  # type: ignore[no-redef]
+from torchaudio.transforms import SpecAugment
+class ASRModel(PreTrainedModel, GenerationMixin):
+    """Audio-to-text model combining an audio encoder, projector, and language model."""
+    config_class = ASRConfig
+    base_model_prefix = "model"
+    main_input_name = "input_features"
+    _supports_flash_attn_2 = True
+    supports_gradient_checkpointing = True
+    _is_loading_from_pretrained: bool = False
+    _pretrained_model_path: Optional[str] = None
+    TRANSCRIBE_PROMPT = "Transcribe speech to text"  # Audio tokens come BEFORE this
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: str, *args, **kwargs) -> "ASRModel":
+        """Load model from pretrained, handling device placement correctly."""
+        from safetensors.torch import load_file
+        from transformers.utils.hub import cached_file
+        config = kwargs.pop("config", None)
+        if config is None:
+            config = ASRConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        # Set flag to avoid device_map="auto" in sub-model loaders
+        cls._is_loading_from_pretrained = True
+        cls._pretrained_model_path = pretrained_model_name_or_path
+        try:
+            model = cls(config, **kwargs)
+            # Load projector weights from safetensors
+            subfolder = kwargs.get("subfolder")
+            revision = kwargs.get("revision")
+            cache_kwargs = {}
+            if subfolder:
+                cache_kwargs["subfolder"] = subfolder
+            if revision:
+                cache_kwargs["revision"] = revision
+            model_file = cached_file(
+                pretrained_model_name_or_path,
+                "model.safetensors",
+                _raise_exceptions_for_missing_entries=False,
+                **cache_kwargs,
+            )
+            if model_file is not None:
+                state_dict = load_file(model_file)
+                model.load_state_dict(state_dict, strict=False)
+            # Load LoRA adapters if use_lora is enabled
+            if getattr(config, "use_lora", False):
+                # Check for adapter_config.json (required by PEFT to load adapters)
+                adapter_config_file = cached_file(
+                    pretrained_model_name_or_path,
+                    "adapter_config.json",
+                    _raise_exceptions_for_missing_entries=False,
+                    **cache_kwargs,
+                )
+                if adapter_config_file is not None:
+                    # Load saved adapter weights using the original repo_id/path
+                    # PEFT handles Hub downloads and caching internally
+                    from peft import PeftModel
+                    model.language_model = PeftModel.from_pretrained(
+                        model.language_model,
+                        pretrained_model_name_or_path,
+                        is_trainable=True,
+                        **cache_kwargs,
+                    )
+                else:
+                    # No saved adapters - initialize fresh LLM LoRA for training
+                    from peft import LoraConfig, get_peft_model
+                    lora_config = LoraConfig(
+                        r=config.lora_rank,
+                        lora_alpha=config.lora_alpha,
+                        target_modules=config.lora_target_modules,
+                        lora_dropout=config.lora_dropout,
+                        bias="none",
+                        task_type="CAUSAL_LM",
+                    )
+                    model.language_model = get_peft_model(model.language_model, lora_config)
+            return model
+        finally:
+            cls._is_loading_from_pretrained = False
+            cls._pretrained_model_path = None
+    def __init__(self, config: ASRConfig, **kwargs) -> None:
+        super().__init__(config)
+        self.system_prompt = config.system_prompt
+        target_dtype = getattr(torch, config.model_dtype)
+        # Audio encoder (frozen)
+        self.audio_tower = self._load_audio_encoder(config, target_dtype)
+        # Language model (frozen)
+        self.language_model = self._load_language_model(config, target_dtype)
+        # Initialize tokenizer and special tokens
+        self._init_tokenizer(config)
+        # Set up generation config with greedy decoding defaults
+        self.generation_config = self.language_model.generation_config
+        self.generation_config.max_new_tokens = config.max_new_tokens
+        self.generation_config.min_new_tokens = config.min_new_tokens
+        self.generation_config.num_beams = config.num_beams
+        self.generation_config.do_sample = False
+        # Clear sampling params (inherited from LLM) since we use greedy decoding
+        self.generation_config.temperature = None
+        self.generation_config.top_p = None
+        self.generation_config.top_k = None
+        self.generation_config.use_cache = config.use_cache
+        self.generation_config.length_penalty = config.length_penalty
+        self.generation_config.repetition_penalty = config.repetition_penalty
+        self.generation_config.no_repeat_ngram_size = config.no_repeat_ngram_size
+        self.generation_config.eos_token_id = [
+            self.tokenizer.convert_tokens_to_ids("<|im_end|>"),
+            self.tokenizer.convert_tokens_to_ids("<|endoftext|>"),
+        ]
+        self.generation_config.pad_token_id = self.tokenizer.pad_token_id
+        # Feature extractor for audio preprocessing
+        self.feature_extractor = self._create_feature_extractor(config)
+        # Audio projector (trainable unless freeze_projector is set)
+        self.projector = self._create_projector(config, target_dtype)
+        # Setup LoRA if enabled (Stage 2 fine-tuning)
+        # Skip if loading from pretrained - from_pretrained will handle adapter loading
+        if getattr(config, "use_lora", False) and not getattr(
+            self.__class__, "_is_loading_from_pretrained", False
+        ):
+            self._setup_lora(config)
+        # Freeze projector if specified (for Stage 2 LoRA-only training)
+        if getattr(config, "freeze_projector", False):
+            self.projector.requires_grad_(False)
+        # SpecAugment for data augmentation during training
+        if getattr(config, "use_specaugment", False):
+            self.spec_augment = SpecAugment(
+                n_time_masks=config.num_time_masks,
+                time_mask_param=config.time_mask_length,
+                n_freq_masks=config.num_freq_masks,
+                freq_mask_param=config.freq_mask_length,
+            )
+        else:
+            self.spec_augment = None
+        # For model parallelism
+        self._no_split_modules = getattr(self.language_model, "_no_split_modules", [])
+    def _create_feature_extractor(self, config: ASRConfig):
+        """Create the appropriate feature extractor for the audio encoder."""
+        from transformers import AutoFeatureExtractor
+        feature_extractor = AutoFeatureExtractor.from_pretrained(config.audio_model_id)
+        # Disable padding by default - use actual audio length
+        feature_extractor.padding = False
+        return feature_extractor
+    @classmethod
+    def _load_audio_encoder(cls, config: ASRConfig, dtype: torch.dtype) -> nn.Module:
+        """Load and freeze the audio encoder."""
+        encoder_kwargs = {
+            "attn_implementation": config.attn_implementation,
+            "low_cpu_mem_usage": True,
+            "dtype": dtype,
+        }
+        if "whisper" in config.audio_model_id.lower():
+            from transformers import WhisperModel
+            full_model = WhisperModel.from_pretrained(config.audio_model_id, **encoder_kwargs)
+            encoder = full_model.encoder
+            del full_model
+        elif "glm" in config.audio_model_id.lower():
+            # GLM-ASR models use audio_tower as the encoder
+            # Requires transformers >= 5.x or installed from source
+            from transformers import AutoModelForSeq2SeqLM
+            full_model = AutoModelForSeq2SeqLM.from_pretrained(
+                config.audio_model_id, trust_remote_code=True, **encoder_kwargs
+            )
+            # GLM stores encoder at audio_tower (GlmAsrEncoder)
+            encoder = full_model.audio_tower
+            # Clear references to free VRAM from the LLM decoder
+            full_model.language_model = None
+            full_model.multi_modal_projector = None
+            del full_model
+        else:
+            encoder = AutoModel.from_pretrained(config.audio_model_id, **encoder_kwargs)
+        encoder.requires_grad_(False)
+        encoder.eval()
+        return encoder
+    @classmethod
+    def _load_language_model(cls, config: ASRConfig, dtype: torch.dtype) -> PreTrainedModel:
+        """Load and freeze the language model."""
+        decoder_kwargs = {
+            "attn_implementation": config.attn_implementation,
+            "trust_remote_code": True,
+            "tie_word_embeddings": False,
+            "low_cpu_mem_usage": True,
+            "dtype": dtype,
+        }
+        decoder = AutoModelForCausalLM.from_pretrained(config.text_model_id, **decoder_kwargs)
+        decoder.config.use_cache = getattr(config, "use_cache", True)
+        decoder.requires_grad_(False)
+        decoder.eval()
+        return decoder
+    def _create_projector(self, config: ASRConfig, dtype: torch.dtype) -> nn.Module:
+        """Create the trainable audio projector."""
+        # Auto-detect dimensions if not specified
+        if config.encoder_dim is None:
+            enc_cfg = self.audio_tower.config
+            config.encoder_dim = getattr(enc_cfg, "hidden_size", None) or getattr(
+                enc_cfg, "d_model", None
+            )
+            if config.encoder_dim is None:
+                raise ValueError("Could not auto-detect encoder_dim. Please specify in config.")
+        if config.llm_dim is None:
+            dec_cfg = self.language_model.config
+            config.llm_dim = getattr(dec_cfg, "hidden_size", None) or getattr(
+                dec_cfg, "d_model", None
+            )
+            if config.llm_dim is None:
+                raise ValueError("Could not auto-detect llm_dim. Please specify in config.")
+        # Select projector type based on config
+        projector_type = getattr(config, "projector_type", "mlp")
+        projector_class = PROJECTOR_CLASSES.get(projector_type)
+        if projector_class is None:
+            raise ValueError(
+                f"Unknown projector_type: {projector_type}. "
+                f"Valid options: {list(PROJECTOR_CLASSES.keys())}"
+            )
+        projector = projector_class(config)
+        # Move projector to same device as language model (important when using quantization)
+        device = next(self.language_model.parameters()).device
+        return projector.to(device=device, dtype=dtype)
+    def _setup_lora(self, config: ASRConfig):
+        """Apply LoRA adapters to the language model for Stage 2 fine-tuning."""
+        from peft import LoraConfig, get_peft_model
+        lora_config = LoraConfig(
+            r=config.lora_rank,
+            lora_alpha=config.lora_alpha,
+            target_modules=config.lora_target_modules,
+            lora_dropout=config.lora_dropout,
+            bias="none",
+            task_type="CAUSAL_LM",
+        )
+        self.language_model = get_peft_model(self.language_model, lora_config)
+    def _init_tokenizer(self, config: ASRConfig):
+        """Initialize tokenizer with audio token."""
+        self.tokenizer = AutoTokenizer.from_pretrained(config.text_model_id, trust_remote_code=True)
+        # Set pad token
+        if (
+            self.tokenizer.pad_token is None
+            or self.tokenizer.pad_token_id == self.tokenizer.eos_token_id
+        ) and "<|finetune_right_pad_id|>" in self.tokenizer.get_vocab():
+            self.tokenizer.pad_token = "<|finetune_right_pad_id|>"
+        # Add audio token
+        existing_special = getattr(self.tokenizer, "additional_special_tokens", None) or []
+        if "<audio>" not in existing_special:
+            self.tokenizer.add_special_tokens(
+                {"additional_special_tokens": existing_special + ["<audio>"]}
+            )
+            self.language_model.resize_token_embeddings(len(self.tokenizer), mean_resizing=False)
+        self.audio_token_id = self.tokenizer.convert_tokens_to_ids("<audio>")
+        self.tokenizer.padding_side = "right"
+        # Sync token IDs to configs
+        for cfg in [self.config.text_config, self.language_model.config, self.generation_config]:
+            if cfg is not None:
+                cfg.pad_token_id = self.tokenizer.pad_token_id
+                cfg.eos_token_id = self.tokenizer.eos_token_id
+                cfg.bos_token_id = self.tokenizer.bos_token_id
+    def _init_weights(self, _module):
+        """Weight initialization (projector weights are initialized in MoEAudioProjector)."""
+        pass
+    def _set_gradient_checkpointing(self, enable: bool = True, gradient_checkpointing_func=None):
+        """Enable/disable gradient checkpointing for the language model."""
+        # The LLM still stores activations during forward for backprop to projector
+        # Gradient checkpointing trades compute for memory by recomputing activations
+        if hasattr(self.language_model, "_set_gradient_checkpointing"):
+            self.language_model._set_gradient_checkpointing(enable, gradient_checkpointing_func)
+        elif hasattr(self.language_model, "gradient_checkpointing_enable") and enable:
+            self.language_model.gradient_checkpointing_enable(
+                gradient_checkpointing_kwargs={"use_reentrant": False}
+            )
+        elif hasattr(self.language_model, "gradient_checkpointing_disable") and not enable:
+            self.language_model.gradient_checkpointing_disable()
+    def get_input_embeddings(self) -> nn.Module:
+        return self.language_model.get_input_embeddings()
+    def set_input_embeddings(self, value: nn.Module) -> None:
+        self.language_model.set_input_embeddings(value)
+    def get_output_embeddings(self) -> nn.Module:
+        return self.language_model.get_output_embeddings()
+    def set_output_embeddings(self, value: nn.Module) -> None:
+        self.language_model.set_output_embeddings(value)
+    def get_processor(self):
+        """Get the processor for this model."""
+        try:
+            from .asr_processing import ASRProcessor
+        except ImportError:
+            from asr_processing import ASRProcessor  # type: ignore[no-redef]
+        return ASRProcessor(
+            feature_extractor=self.feature_extractor,
+            tokenizer=self.tokenizer,
+            projector=self.projector,
+            encoder_conv_layers=self.config.encoder_conv_layers,
+        )
+    def state_dict(self, *args, **kwargs) -> dict[str, torch.Tensor]:
+        """Only save trainable projector weights."""
+        return {f"projector.{k}": v for k, v in self.projector.state_dict().items()}
+    def _compute_encoder_output_lengths(
+        self,
+        audio_attention_mask: torch.Tensor,
+    ) -> torch.Tensor:
+        """Compute per-sample encoder output lengths using conv layer formulas.
+        Args:
+            audio_attention_mask: Mask indicating real vs padded mel frames (batch, mel_len)
+        Returns:
+            Tensor of encoder output lengths per sample (batch,)
+        """
+        # Get mel frame lengths from attention mask
+        lengths = audio_attention_mask.sum(dim=-1)
+        # Apply conv layer formulas: output = (input + 2*pad - (kernel-1) - 1) // stride + 1
+        for padding, kernel_size, stride in self.config.encoder_conv_layers:
+            lengths = (lengths + 2 * padding - (kernel_size - 1) - 1) // stride + 1
+        return lengths
+    def _encode_audio(
+        self,
+        audio_features: torch.Tensor,
+        audio_attention_mask: torch.Tensor,
+    ) -> torch.Tensor:
+        """Encode audio and project to LLM embedding space.
+        Args:
+            audio_features: Mel spectrogram features (batch, n_mels, mel_len)
+            audio_attention_mask: Mask indicating real vs padded mel frames (batch, mel_len)
+        Returns:
+            Flattened audio embeddings of shape (total_audio_tokens, hidden_dim).
+        """
+        with torch.no_grad():
+            encoder_out = self.audio_tower(input_features=audio_features)
+            hidden_states = encoder_out.last_hidden_state
+        # Compute per-sample encoder output lengths using conv formulas
+        encoder_lengths = self._compute_encoder_output_lengths(audio_attention_mask)
+        # Project to LLM space
+        audio_embeds = self.projector(hidden_states)
+        # Compute per-sample projector output lengths
+        projector_lengths = torch.tensor(
+            [self.projector.get_output_length(int(length.item())) for length in encoder_lengths],
+            device=audio_embeds.device,
+        )
+        # Create valid mask for variable-length samples and extract only real embeddings
+        max_len = audio_embeds.shape[1]
+        valid_mask = (
+            torch.arange(max_len, device=audio_embeds.device)[None, :] < projector_lengths[:, None]
+        )
+        return audio_embeds[valid_mask]
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        input_features: Optional[torch.Tensor] = None,
+        audio_attention_mask: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        past_key_values: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        """Forward pass for training and inference."""
+        # Get text embeddings if not provided
+        if inputs_embeds is None:
+            inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        if input_features is not None and input_ids is not None:
+            # Apply SpecAugment during training if enabled
+            if self.training and self.spec_augment is not None:
+                input_features = self.spec_augment(input_features)
+            # Encode audio -> flattened (total_audio_tokens, hidden_dim)
+            audio_embeds = self._encode_audio(input_features, audio_attention_mask)
+            # Replace <audio> token placeholders with audio embeddings using masked_scatter
+            audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
+            inputs_embeds = inputs_embeds.masked_scatter(
+                audio_token_mask.to(inputs_embeds.device),
+                audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
+            )
+        # Run through language model (let it compute loss if labels provided)
+        outputs = self.language_model(
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        # Add auxiliary loss from MoE projectors if available
+        if outputs.loss is not None and hasattr(self.projector, "get_aux_loss"):
+            aux_loss = self.projector.get_aux_loss()
+            if aux_loss is not None and aux_loss.numel() > 0:
+                outputs.loss = outputs.loss + aux_loss.to(outputs.loss.device)
+        return outputs
+    def prepare_inputs_for_generation(self, *args, **kwargs):
+        """Prepare inputs for generation, handling audio features for cached decoding."""
+        input_features = kwargs.pop("input_features", None)
+        cache_position = kwargs.get("cache_position")
+        model_inputs = self.language_model.prepare_inputs_for_generation(*args, **kwargs)
+        # Only pass audio features on the first generation step (cache_position[0] == 0)
+        if cache_position is not None and cache_position[0] == 0 and input_features is not None:
+            model_inputs["input_features"] = input_features
+        return model_inputs
+    def _get_num_audio_tokens(
+        self,
+        audio_attention_mask: torch.Tensor,
+    ) -> int:
+        """Calculate number of audio tokens based on actual audio length.
+        Uses attention mask to get real audio length, then computes:
+        mel_frames -> encoder_frames (via conv formulas) -> projector output tokens
+        """
+        encoder_lengths = self._compute_encoder_output_lengths(audio_attention_mask)
+        # Use max length for batch (all samples should have same token count for generation)
+        encoder_output_len = int(encoder_lengths.max().item())
+        return int(self.projector.get_output_length(encoder_output_len))
+    @torch.no_grad()
+    def generate(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        input_features: Optional[torch.Tensor] = None,
+        audio_attention_mask: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        system_prompt: Optional[str] = None,
+        **generate_kwargs,
+    ) -> torch.Tensor:
+        """Generate transcription from audio input.
+        Can be called in two ways:
+        1. With input_ids containing <audio> tokens (from processor)
+        2. With just audio, and we build the prompt internally
+        """
+        if input_features is None:
+            raise ValueError("input_features required for generation")
+        if audio_attention_mask is None:
+            raise ValueError("audio_attention_mask required for generation")
+        device = input_features.device
+        batch_size = input_features.shape[0]
+        # Encode audio -> flattened embeddings
+        audio_embeds = self._encode_audio(input_features, audio_attention_mask)
+        # If input_ids not provided, build prompt with correct number of audio tokens
+        if input_ids is None:
+            num_audio_tokens = self._get_num_audio_tokens(audio_attention_mask)
+            audio_placeholder = "<audio>" * num_audio_tokens
+            system_prompt = system_prompt or self.system_prompt
+            messages: list[dict[str, str]] = []
+            if system_prompt:
+                messages.append({"role": "system", "content": system_prompt})
+            # Audio BEFORE prompt for proper causal attention
+            messages.append(
+                {"role": "user", "content": audio_placeholder + " " + self.TRANSCRIBE_PROMPT}
+            )
+            chat_result = self.tokenizer.apply_chat_template(
+                messages,
+                tokenize=True,
+                add_generation_prompt=True,
+                return_tensors="pt",
+                enable_thinking=False,  # Disable Qwen3 thinking mode for ASR
+            )
+            input_ids = chat_result.input_ids.to(device)
+            if input_ids.dim() == 1:
+                input_ids = input_ids.unsqueeze(0)
+            if input_ids.shape[0] == 1 and batch_size > 1:
+                input_ids = input_ids.expand(batch_size, -1)
+            attention_mask = torch.ones_like(input_ids)
+        # Get text embeddings and replace audio tokens with audio embeddings
+        inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
+        inputs_embeds = inputs_embeds.masked_scatter(
+            audio_token_mask.to(inputs_embeds.device),
+            audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
+        )
+        # Generate using language model
+        output = self.language_model.generate(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            generation_config=self.generation_config,
+            **generate_kwargs,
+        )
+        # When using inputs_embeds without input_ids, generate returns only new tokens
+        if isinstance(output, torch.Tensor):
+            return output
+        return output.sequences
+    def generate_streaming(
+        self,
+        input_features: torch.Tensor,
+        audio_attention_mask: torch.Tensor,
+        system_prompt: Optional[str] = None,
+        **generate_kwargs,
+    ) -> Iterator[str]:
+        """Generate transcription with streaming token output.
+        Yields partial transcript strings as tokens are generated.
+        Reduces time-to-first-word by streaming tokens as they're decoded.
+        Args:
+            input_features: Mel spectrogram features (batch, n_mels, mel_len)
+            audio_attention_mask: Mask for real vs padded mel frames (batch, mel_len)
+            system_prompt: Optional system prompt override
+            **generate_kwargs: Additional generation arguments
+        Yields:
+            Partial transcript text as each token is generated
+        """
+        device = input_features.device
+        batch_size = input_features.shape[0]
+        # Encode audio -> flattened embeddings
+        audio_embeds = self._encode_audio(input_features, audio_attention_mask)
+        # Build prompt with correct number of audio tokens
+        num_audio_tokens = self._get_num_audio_tokens(audio_attention_mask)
+        audio_placeholder = "<audio>" * num_audio_tokens
+        system_prompt = system_prompt or self.system_prompt
+        messages: list[dict[str, str]] = []
+        if system_prompt:
+            messages.append({"role": "system", "content": system_prompt})
+        # Audio BEFORE prompt for proper causal attention
+        messages.append(
+            {"role": "user", "content": audio_placeholder + " " + self.TRANSCRIBE_PROMPT}
+        )
+        chat_result = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=True,
+            return_tensors="pt",
+            enable_thinking=False,  # Disable Qwen3 thinking mode for ASR
+        )
+        input_ids = chat_result.input_ids.to(device)
+        if input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(0)
+        if input_ids.shape[0] == 1 and batch_size > 1:
+            input_ids = input_ids.expand(batch_size, -1)
+        attention_mask = torch.ones_like(input_ids)
+        # Get text embeddings and replace audio tokens with audio embeddings
+        inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
+        inputs_embeds = inputs_embeds.masked_scatter(
+            audio_token_mask.to(inputs_embeds.device),
+            audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
+        )
+        # Setup streamer for token-by-token output
+        streamer = TextIteratorStreamer(
+            self.tokenizer,
+            skip_prompt=True,
+            skip_special_tokens=True,
+        )
+        # Prepare generation kwargs
+        gen_kwargs = {
+            "inputs_embeds": inputs_embeds,
+            "attention_mask": attention_mask,
+            "generation_config": self.generation_config,
+            "streamer": streamer,
+            **generate_kwargs,
+        }
+        # Run generation in background thread
+        thread = Thread(target=self.language_model.generate, kwargs=gen_kwargs)
+        thread.start()
+        # Yield tokens as they're generated, filtering out <think>...</think> blocks
+        # Start assuming no think block - only filter when we see <think>
+        in_think_block = False
+        buffer = ""
+        for text in streamer:
+            buffer += text
+            # Check for think block start (in case model outputs think blocks)
+            while "<think>" in buffer:
+                in_think_block = True
+                # Yield any text before <think>
+                before_think = buffer.split("<think>")[0]
+                if before_think:
+                    yield before_think
+                buffer = buffer.split("<think>", 1)[-1]
+            # Check for think block end
+            while in_think_block and "</think>" in buffer:
+                in_think_block = False
+                buffer = buffer.split("</think>", 1)[-1]
+            # Yield text if not in think block
+            if not in_think_block and buffer:
+                yield buffer
+                buffer = ""
+        # Yield any remaining buffer
+        if buffer and not in_think_block:
+            yield buffer
+        thread.join()
+    @torch.no_grad()
+    def generate_text_only(
+        self,
+        messages: list[dict[str, str]],
+        max_new_tokens: int = 256,
+        **generate_kwargs,
+    ) -> str:
+        """Generate text using only the LLM (no audio encoding).
+        Used for SIFT-style response generation from metadata prompts.
+        Args:
+            messages: List of chat messages [{"role": "user", "content": "..."}]
+            max_new_tokens: Maximum tokens to generate
+            **generate_kwargs: Additional generation arguments
+        Returns:
+            Generated text response
+        """
+        device = next(self.language_model.parameters()).device
+        # Apply chat template
+        input_ids = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=True,
+            return_tensors="pt",
+            enable_thinking=False,
+        ).to(device)
+        if input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(0)
+        attention_mask = torch.ones_like(input_ids)
+        # Generate using language model directly
+        output = self.language_model.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=max_new_tokens,
+            do_sample=False,
+            pad_token_id=self.tokenizer.pad_token_id,
+            eos_token_id=self.tokenizer.eos_token_id,
+            **generate_kwargs,
+        )
+        # Decode only the new tokens
+        new_tokens = output[0, input_ids.shape[1] :]
+        response = self.tokenizer.decode(new_tokens, skip_special_tokens=True)
+        return response.strip()
+    def save_pretrained(self, save_directory: Union[str, Path], **kwargs) -> None:
+        """Save model, tokenizer, and processor."""
+        import shutil
+        from pathlib import Path as PathlibPath
+        save_dir = PathlibPath(save_directory)
+        save_dir.mkdir(parents=True, exist_ok=True)
+        # Update config with actual vocab size
+        self.config.vocab_size = self.language_model.config.vocab_size
+        self.config.text_config.vocab_size = self.language_model.config.vocab_size
+        if hasattr(self.audio_tower.config, "num_mel_bins"):
+            self.config.audio_config.num_mel_bins = self.audio_tower.config.num_mel_bins
+        # Save model (temporarily remove non-serializable attributes)
+        tokenizer = self.tokenizer
+        del self.tokenizer
+        try:
+            super().save_pretrained(save_dir, **kwargs)
+        finally:
+            self.tokenizer = tokenizer
+        # Save tokenizer and feature extractor
+        self.tokenizer.save_pretrained(save_dir)
+        self.feature_extractor.save_pretrained(save_dir)
+        # Save LoRA adapters if present (creates adapter_model.safetensors and adapter_config.json)
+        # Don't save embedding layers - the <audio> token embedding is never used
+        # (it's replaced with projected audio embeddings before the LLM sees it)
+        if hasattr(self.language_model, "peft_config"):
+            self.language_model.save_pretrained(save_dir, save_embedding_layers=False)
+            # Clear base_model_name_or_path in adapter_config.json to prevent HF pipeline
+            # from redirecting to the base LLM repo (like Qwen) which breaks feature
+            # extractor loading for multimodal models. If a repo_id is provided, use that
+            # so the model can be loaded directly from the Hub.
+            adapter_config_path = save_dir / "adapter_config.json"
+            if adapter_config_path.exists():
+                with adapter_config_path.open() as f:
+                    adapter_config = json.load(f)
+                # Use repo_id if available, otherwise clear to prevent redirect.
+                # Use empty string instead of None to avoid str(None) -> "None" bug
+                # in some transformers/PEFT versions.
+                repo_id = (
+                    kwargs.get("repo_id")
+                    or kwargs.get("push_to_hub_model_id")
+                    or getattr(self.config, "pretrained_model_path", None)
+                    or ""  # Use empty string instead of None
+                )
+                adapter_config["base_model_name_or_path"] = repo_id
+                with adapter_config_path.open("w") as f:
+                    json.dump(adapter_config, f, indent=2)
+        # Add processor auto_map to preprocessor_config.json
+        config_path = save_dir / "preprocessor_config.json"
+        if config_path.exists():
+            with config_path.open() as f:
+                processor_config = json.load(f)
+        else:
+            processor_config = {}
+        processor_config.update(
+            {
+                "processor_class": "ASRProcessor",
+                "auto_map": {"AutoProcessor": "asr_processing.ASRProcessor"},
+            }
+        )
+        with config_path.open("w") as f:
+            json.dump(processor_config, f, indent=2)
+        # Copy source files for auto-loading
+        src_dir = PathlibPath(__file__).parent
+        for asr_file in src_dir.glob("asr_*.py"):
+            shutil.copy(asr_file, save_dir / asr_file.name)
+        # Copy projectors module
+        shutil.copy(src_dir / "projectors.py", save_dir / "projectors.py")
+        # Copy diarization module
+        shutil.copy(src_dir / "diarization.py", save_dir / "diarization.py")
+    def push_to_hub(self, repo_id: str, **kwargs) -> str:
+        """Push model to HuggingFace Hub, ensuring adapter_config points to repo.
+        IMPORTANT: Sets base_model_name_or_path in adapter_config.json to repo_id
+        so that transformers pipeline() can load the model correctly. Without this,
+        the pipeline tries to load from "None" which fails.
+        """
+        # Store repo_id in config so save_pretrained can access it
+        self.config.pretrained_model_path = repo_id
+        # Call parent's push_to_hub
+        return super().push_to_hub(repo_id, **kwargs)
+    def create_or_update_model_card(self, output_dir: Union[str, Path]) -> None:
+        """No-op for model card creation - we use MODEL_CARD.md in repo instead."""
+        pass
+# Register with transformers Auto classes
+AutoConfig.register("asr_model", ASRConfig)
+AutoModel.register(ASRConfig, ASRModel)

asr_pipeline.py ADDED Viewed

	@@ -0,0 +1,482 @@

+"""ASR pipeline for audio-to-text transcription with optional timestamps and diarization."""
+import re
+from pathlib import Path
+from typing import Any
+import numpy as np
+import torch
+import transformers
+try:
+    from .asr_modeling import ASRModel
+except ImportError:
+    from asr_modeling import ASRModel  # type: ignore[no-redef]
+def _get_device() -> str:
+    """Get best available device for non-transformers models."""
+    if torch.cuda.is_available():
+        return "cuda"
+    if torch.backends.mps.is_available():
+        return "mps"
+    return "cpu"
+class ForcedAligner:
+    """Lazy-loaded forced aligner for word-level timestamps using torchaudio wav2vec2."""
+    _bundle = None
+    _model = None
+    _labels = None
+    _dictionary = None
+    @classmethod
+    def get_instance(cls, device: str = "cuda"):
+        """Get or create the forced alignment model (singleton).
+        Args:
+            device: Device to run model on ("cuda" or "cpu")
+        Returns:
+            Tuple of (model, labels, dictionary)
+        """
+        if cls._model is None:
+            import torchaudio
+            cls._bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
+            cls._model = cls._bundle.get_model().to(device)
+            cls._model.eval()
+            cls._labels = cls._bundle.get_labels()
+            cls._dictionary = {c: i for i, c in enumerate(cls._labels)}
+        return cls._model, cls._labels, cls._dictionary
+    @classmethod
+    def align(
+        cls,
+        audio: np.ndarray,
+        text: str,
+        sample_rate: int = 16000,
+        _language: str = "eng",
+        _batch_size: int = 16,
+    ) -> list[dict]:
+        """Align transcript to audio and return word-level timestamps.
+        Args:
+            audio: Audio waveform as numpy array
+            text: Transcript text to align
+            sample_rate: Audio sample rate (default 16000)
+            _language: ISO-639-3 language code (default "eng" for English, unused)
+            _batch_size: Batch size for alignment model (unused)
+        Returns:
+            List of dicts with 'word', 'start', 'end' keys
+        """
+        import torchaudio
+        from torchaudio.functional import forced_align, merge_tokens
+        device = _get_device()
+        model, labels, dictionary = cls.get_instance(device)
+        # Convert audio to tensor (copy to ensure array is writable)
+        if isinstance(audio, np.ndarray):
+            waveform = torch.from_numpy(audio.copy()).float()
+        else:
+            waveform = audio.clone().float()
+        # Ensure 2D (channels, time)
+        if waveform.dim() == 1:
+            waveform = waveform.unsqueeze(0)
+        # Resample if needed (wav2vec2 expects 16kHz)
+        if sample_rate != cls._bundle.sample_rate:
+            waveform = torchaudio.functional.resample(
+                waveform, sample_rate, cls._bundle.sample_rate
+            )
+        waveform = waveform.to(device)
+        # Get emissions from model
+        with torch.inference_mode():
+            emissions, _ = model(waveform)
+            emissions = torch.log_softmax(emissions, dim=-1)
+        emission = emissions[0].cpu()
+        # Normalize text: uppercase, keep only valid characters
+        transcript = text.upper()
+        # Build tokens from transcript
+        tokens = []
+        for char in transcript:
+            if char in dictionary:
+                tokens.append(dictionary[char])
+            elif char == " ":
+                tokens.append(dictionary.get("|", dictionary.get(" ", 0)))
+        if not tokens:
+            return []
+        targets = torch.tensor([tokens], dtype=torch.int32)
+        # Run forced alignment
+        # Note: forced_align is deprecated in torchaudio 2.6+ and will be removed in 2.9 (late 2025)
+        # No official replacement announced yet. See https://github.com/pytorch/audio/issues/3902
+        aligned_tokens, scores = forced_align(emission.unsqueeze(0), targets, blank=0)
+        # Use torchaudio's merge_tokens to get token spans (removes blanks and merges repeats)
+        token_spans = merge_tokens(aligned_tokens[0], scores[0])
+        # Convert frame indices to time (model stride is 320 samples at 16kHz = 20ms)
+        frame_duration = 320 / cls._bundle.sample_rate
+        # Group token spans into words based on pipe separator
+        words = text.split()
+        word_timestamps = []
+        current_word_start = None
+        current_word_end = None
+        word_idx = 0
+        for span in token_spans:
+            token_char = labels[span.token]
+            if token_char == "|":  # Word separator
+                if current_word_start is not None and word_idx < len(words):
+                    word_timestamps.append(
+                        {
+                            "word": words[word_idx],
+                            "start": current_word_start * frame_duration,
+                            "end": current_word_end * frame_duration,
+                        }
+                    )
+                    word_idx += 1
+                current_word_start = None
+                current_word_end = None
+            else:
+                if current_word_start is None:
+                    current_word_start = span.start
+                current_word_end = span.end
+        # Don't forget the last word
+        if current_word_start is not None and word_idx < len(words):
+            word_timestamps.append(
+                {
+                    "word": words[word_idx],
+                    "start": current_word_start * frame_duration,
+                    "end": current_word_end * frame_duration,
+                }
+            )
+        return word_timestamps
+try:
+    from .diarization import SpeakerDiarizer
+except ImportError:
+    from diarization import SpeakerDiarizer  # type: ignore[no-redef]
+# Re-export for backwards compatibility
+__all__ = ["ForcedAligner", "SpeakerDiarizer", "ASRPipeline"]
+class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
+    """ASR Pipeline for audio-to-text transcription."""
+    model: ASRModel
+    def __init__(self, model: ASRModel, **kwargs):
+        """Initialize ASR pipeline.
+        Args:
+            model: ASRModel instance for transcription
+            **kwargs: Additional arguments (feature_extractor, tokenizer, device)
+        """
+        feature_extractor = kwargs.pop("feature_extractor", None)
+        tokenizer = kwargs.pop("tokenizer", model.tokenizer)
+        if feature_extractor is None:
+            feature_extractor = model.get_processor().feature_extractor
+        super().__init__(
+            model=model, feature_extractor=feature_extractor, tokenizer=tokenizer, **kwargs
+        )
+        self._current_audio = None
+    def _sanitize_parameters(self, **kwargs):
+        """Intercept our custom parameters before parent class validates them."""
+        # Remove our custom parameters so parent doesn't see them
+        kwargs.pop("return_timestamps", None)
+        kwargs.pop("return_speakers", None)
+        kwargs.pop("num_speakers", None)
+        kwargs.pop("min_speakers", None)
+        kwargs.pop("max_speakers", None)
+        kwargs.pop("hf_token", None)
+        kwargs.pop("user_prompt", None)
+        kwargs.pop("diarization_backend", None)
+        return super()._sanitize_parameters(**kwargs)
+    def __call__(
+        self,
+        inputs,
+        **kwargs,
+    ):
+        """Transcribe audio with optional word-level timestamps and speaker diarization.
+        Args:
+            inputs: Audio input (file path, dict with array/sampling_rate, etc.)
+            return_timestamps: If True, return word-level timestamps using forced alignment
+            return_speakers: If True, return speaker labels for each word
+            user_prompt: Custom transcription prompt (default: "Transcribe: ")
+            num_speakers: Exact number of speakers (if known, for diarization)
+            min_speakers: Minimum number of speakers (for diarization)
+            max_speakers: Maximum number of speakers (for diarization)
+            hf_token: HuggingFace token for pyannote models (or set HF_TOKEN env var)
+            diarization_backend: Backend for diarization ("pyannote" or "local")
+            **kwargs: Additional arguments passed to the pipeline
+        Returns:
+            Dict with 'text' key, 'words' key if return_timestamps=True,
+            and speaker labels on words if return_speakers=True
+        """
+        # Extract our params before super().__call__ (which will also call _sanitize_parameters)
+        return_timestamps = kwargs.pop("return_timestamps", False)
+        return_speakers = kwargs.pop("return_speakers", False)
+        user_prompt = kwargs.pop("user_prompt", None)
+        diarization_params = {
+            "num_speakers": kwargs.pop("num_speakers", None),
+            "min_speakers": kwargs.pop("min_speakers", None),
+            "max_speakers": kwargs.pop("max_speakers", None),
+            "hf_token": kwargs.pop("hf_token", None),
+            "backend": kwargs.pop("diarization_backend", "pyannote"),
+        }
+        if return_speakers:
+            return_timestamps = True
+        # Set custom user prompt if provided
+        original_prompt = None
+        if user_prompt:
+            original_prompt = self.model.TRANSCRIBE_PROMPT
+            self.model.TRANSCRIBE_PROMPT = user_prompt
+        # Store audio for timestamp alignment and diarization
+        if return_timestamps or return_speakers:
+            self._current_audio = self._extract_audio(inputs)
+        # Run standard transcription
+        result = super().__call__(inputs, **kwargs)
+        # Add timestamps if requested
+        if return_timestamps and self._current_audio is not None:
+            text = result.get("text", "")
+            if text:
+                try:
+                    words = ForcedAligner.align(
+                        self._current_audio["array"],
+                        text,
+                        sample_rate=self._current_audio.get("sampling_rate", 16000),
+                    )
+                    result["words"] = words
+                except Exception as e:
+                    result["words"] = []
+                    result["timestamp_error"] = str(e)
+            else:
+                result["words"] = []
+        # Add speaker diarization if requested
+        if return_speakers and self._current_audio is not None:
+            try:
+                # Run diarization
+                speaker_segments = SpeakerDiarizer.diarize(
+                    self._current_audio["array"],
+                    sample_rate=self._current_audio.get("sampling_rate", 16000),
+                    **{k: v for k, v in diarization_params.items() if v is not None},
+                )
+                result["speaker_segments"] = speaker_segments
+                # Assign speakers to words
+                if result.get("words"):
+                    result["words"] = SpeakerDiarizer.assign_speakers_to_words(
+                        result["words"],
+                        speaker_segments,
+                    )
+            except Exception as e:
+                result["speaker_segments"] = []
+                result["diarization_error"] = str(e)
+        # Clean up
+        self._current_audio = None
+        if original_prompt is not None:
+            self.model.TRANSCRIBE_PROMPT = original_prompt
+        return result
+    def _extract_audio(self, inputs) -> dict | None:
+        """Extract audio array from various input formats using HF utilities."""
+        from transformers.pipelines.audio_utils import ffmpeg_read
+        if isinstance(inputs, dict):
+            if "array" in inputs:
+                return {
+                    "array": inputs["array"],
+                    "sampling_rate": inputs.get("sampling_rate", 16000),
+                }
+            if "raw" in inputs:
+                return {
+                    "array": inputs["raw"],
+                    "sampling_rate": inputs.get("sampling_rate", 16000),
+                }
+        elif isinstance(inputs, str):
+            # File path - load audio using ffmpeg (same as HF pipeline)
+            with Path(inputs).open("rb") as f:
+                audio = ffmpeg_read(f.read(), sampling_rate=16000)
+            return {"array": audio, "sampling_rate": 16000}
+        elif isinstance(inputs, bytes):
+            audio = ffmpeg_read(inputs, sampling_rate=16000)
+            return {"array": audio, "sampling_rate": 16000}
+        elif isinstance(inputs, np.ndarray):
+            return {"array": inputs, "sampling_rate": 16000}
+        return None
+    def preprocess(self, inputs, **preprocess_params):
+        """Preprocess audio inputs for the model.
+        Args:
+            inputs: Audio input (dict with array, file path, etc.)
+            **preprocess_params: Additional preprocessing parameters
+        Yields:
+            Model input dicts with input_features and attention_mask
+        """
+        # Handle dict with "array" key (from datasets)
+        if isinstance(inputs, dict) and "array" in inputs:
+            inputs = {
+                "raw": inputs["array"],
+                "sampling_rate": inputs.get("sampling_rate", self.feature_extractor.sampling_rate),
+            }
+        for item in super().preprocess(inputs, **preprocess_params):
+            if "is_last" not in item:
+                item["is_last"] = True
+            yield item
+    def _forward(self, model_inputs, **generate_kwargs) -> dict[str, Any]:
+        """Run model forward pass to generate transcription.
+        Args:
+            model_inputs: Dict with input_features and attention_mask
+            **generate_kwargs: Generation parameters
+        Returns:
+            Dict with generated token IDs
+        """
+        # Extract audio features and is_last flag
+        is_last = model_inputs.pop("is_last", True) if isinstance(model_inputs, dict) else True
+        input_features = model_inputs["input_features"].to(self.model.device)
+        audio_attention_mask = model_inputs["attention_mask"].to(self.model.device)
+        generated_ids = self.model.generate(
+            input_features=input_features,
+            audio_attention_mask=audio_attention_mask,
+            **generate_kwargs,
+        )
+        return {"tokens": generated_ids, "is_last": is_last}
+    def postprocess(self, model_outputs, **kwargs) -> dict[str, str]:
+        """Convert model output tokens to text.
+        Args:
+            model_outputs: Dict with 'tokens' key containing generated IDs
+            **kwargs: Additional postprocessing parameters
+        Returns:
+            Dict with 'text' key containing transcription
+        """
+        # Handle list of outputs (from chunking)
+        if isinstance(model_outputs, list):
+            model_outputs = model_outputs[0] if model_outputs else {}
+        tokens = model_outputs.get("tokens")
+        if tokens is None:
+            return super().postprocess(model_outputs, **kwargs)
+        if torch.is_tensor(tokens):
+            tokens = tokens.cpu()
+            if tokens.dim() > 1:
+                tokens = tokens[0]
+        # Filter out eos tokens that the tokenizer doesn't recognize as special
+        # (generation_config.eos_token_id may differ from tokenizer.eos_token_id)
+        if hasattr(self, "model") and hasattr(self.model, "generation_config"):
+            eos_ids = self.model.generation_config.eos_token_id
+            if eos_ids is not None:
+                eos_set = set(eos_ids) if isinstance(eos_ids, list) else {eos_ids}
+                tokens = [t for t in tokens.tolist() if t not in eos_set]
+        text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
+        # Strip <think>...</think> tags (Qwen3 doesn't respect /no_think prompt)
+        text = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
+        # Truncate repetitions at end of text
+        text = _truncate_repetitions(text)
+        return {"text": text}
+def _truncate_repetitions(text: str, min_repeats: int = 3) -> str:
+    """Truncate repeated words/phrases/characters at end of text.
+    Detects patterns like:
+    - Repeated words: "the the the the" -> "the"
+    - Repeated phrases: "i am sorry i am sorry i am sorry" -> "i am sorry"
+    - Repeated characters: "444444" -> "4"
+    Args:
+        text: Input text to process
+        min_repeats: Minimum repetitions to trigger truncation (default 3)
+    Returns:
+        Text with trailing repetitions removed
+    """
+    if not text:
+        return text
+    # 1. Truncate repeated characters at end (e.g., "444444" -> "4")
+    char_pattern = re.compile(r"(.)\1{" + str(min_repeats - 1) + r",}$")
+    text = char_pattern.sub(r"\1", text)
+    # 2. Truncate repeated words at end (e.g., "the the the" -> "the")
+    word_pattern = re.compile(
+        r"\b(\w+)(?:\s+\1){" + str(min_repeats - 1) + r",}\s*$", re.IGNORECASE
+    )
+    while word_pattern.search(text):
+        text = word_pattern.sub(r"\1", text)
+    # 3. Truncate repeated phrases (2-20 words) at end
+    # e.g., "i am sorry i am sorry i am sorry" -> "i am sorry"
+    words = text.split()
+    if len(words) >= min_repeats * 2:
+        # Try phrase lengths from 2 to 20 words
+        for phrase_len in range(2, min(21, len(words) // min_repeats + 1)):
+            # Check if the last phrase_len words repeat
+            phrase = " ".join(words[-phrase_len:])
+            # Build pattern to match repeated phrases at end
+            phrase_escaped = re.escape(phrase)
+            phrase_pattern = re.compile(
+                r"(^|.*?\s)("
+                + phrase_escaped
+                + r")(?:\s+"
+                + phrase_escaped
+                + r"){"
+                + str(min_repeats - 1)
+                + r",}\s*$",
+                re.IGNORECASE,
+            )
+            match = phrase_pattern.match(text)
+            if match:
+                # Keep prefix + one instance of the phrase
+                text = (match.group(1) + match.group(2)).strip()
+                words = text.split()
+                break
+    return text

asr_processing.py ADDED Viewed

	@@ -0,0 +1,131 @@

+from typing import Optional, Union
+import torch
+import transformers
+from transformers import ProcessorMixin
+try:
+    from .asr_config import ASRConfig
+except ImportError:
+    from asr_config import ASRConfig  # type: ignore[no-redef]
+class ASRProcessor(ProcessorMixin):
+    """Processor for Whisper-based ASR models."""
+    attributes = ["feature_extractor", "tokenizer"]
+    feature_extractor_class = "AutoFeatureExtractor"
+    tokenizer_class = "AutoTokenizer"
+    AUDIO_TOKEN = "<audio>"
+    TRANSCRIBE_PROMPT = "Transcribe speech to text"
+    # Default conv layers for Whisper/GLM-ASR: [(pad, kernel, stride), ...]
+    DEFAULT_ENCODER_CONV_LAYERS = [(1, 3, 1), (1, 3, 2)]
+    def __init__(
+        self,
+        feature_extractor,
+        tokenizer,
+        projector=None,
+        encoder_conv_layers: Optional[list] = None,
+    ):
+        """Initialize the ASR processor.
+        Args:
+            feature_extractor: Audio feature extractor (WhisperFeatureExtractor)
+            tokenizer: Text tokenizer for the language model
+            projector: Audio projector module (for computing output lengths)
+            encoder_conv_layers: Conv layer specs [(pad, kernel, stride), ...]
+        """
+        self.feature_extractor = feature_extractor
+        self.tokenizer = tokenizer
+        self.audio_token_id = tokenizer.convert_tokens_to_ids(self.AUDIO_TOKEN)
+        self.projector = projector
+        self.encoder_conv_layers = encoder_conv_layers or self.DEFAULT_ENCODER_CONV_LAYERS
+    def _compute_encoder_output_length(self, mel_length: int) -> int:
+        """Compute encoder output length using conv layer formulas."""
+        length = mel_length
+        for padding, kernel_size, stride in self.encoder_conv_layers:
+            length = (length + 2 * padding - (kernel_size - 1) - 1) // stride + 1
+        return length
+    def __call__(
+        self,
+        audio: Optional[Union[list, "torch.Tensor"]] = None,
+        text: Optional[str] = None,
+        system_prompt: Optional[str] = None,
+        return_tensors: str = "pt",
+        **kwargs,
+    ) -> dict:
+        """Process audio and text inputs for inference.
+        Args:
+            audio: Raw audio waveform(s)
+            text: Target transcription (optional, for training - but use DataCollator instead)
+            system_prompt: Optional system prompt
+            return_tensors: Return format ("pt" for PyTorch)
+        Returns:
+            Dict with input_features, input_ids, attention_mask
+        """
+        result = {}
+        # Process audio
+        if audio is not None:
+            audio_inputs = self.feature_extractor(
+                audio,
+                sampling_rate=getattr(self.feature_extractor, "sampling_rate", 16000),
+                return_attention_mask=True,
+                return_tensors=return_tensors,
+                **kwargs,
+            )
+            result["input_features"] = audio_inputs["input_features"]
+            result["audio_attention_mask"] = audio_inputs["attention_mask"]
+            # Use actual audio length (from attention mask) for token count
+            real_mel_len = int(audio_inputs["attention_mask"].sum(dim=-1).max().item())
+            encoder_output_len = self._compute_encoder_output_length(real_mel_len)
+            num_audio_tokens = self.projector.get_output_length(encoder_output_len)
+        else:
+            num_audio_tokens = 0
+        # Build prompt with audio token placeholders (audio BEFORE prompt)
+        if num_audio_tokens > 0:
+            user_content = self.AUDIO_TOKEN * num_audio_tokens + " " + self.TRANSCRIBE_PROMPT
+        else:
+            user_content = self.TRANSCRIBE_PROMPT
+        messages = []
+        if system_prompt:
+            messages.append({"role": "system", "content": system_prompt})
+        messages.append({"role": "user", "content": user_content})
+        if text is not None:
+            messages.append({"role": "assistant", "content": text})
+        # Tokenize
+        tokenized = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=(text is None),
+            return_tensors=return_tensors,
+            enable_thinking=False,  # Disable Qwen3 thinking mode for ASR
+        )
+        # Handle both tensor and BatchEncoding returns
+        if isinstance(tokenized, torch.Tensor):
+            input_ids = tokenized
+        else:
+            # BatchEncoding or dict-like object
+            input_ids = tokenized.get("input_ids", tokenized.input_ids)
+        if input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(0)
+        result["input_ids"] = input_ids
+        result["attention_mask"] = torch.ones_like(input_ids)
+        return result
+ASRProcessor.register_for_auto_class()
+transformers.AutoProcessor.register(ASRConfig, ASRProcessor)

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,89 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {{- messages[0].content + '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
+        {%- set ns.multi_step_tool = false %}
+        {%- set ns.last_query_index = index %}
+    {%- endif %}
+{%- endfor %}
+{%- for message in messages %}
+    {%- if message.content is string %}
+        {%- set content = message.content %}
+    {%- else %}
+        {%- set content = '' %}
+    {%- endif %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
+        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {%- if loop.last or (not loop.last and reasoning_content) %}
+                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
+            {%- else %}
+                {{- '<|im_start|>' + message.role + '\n' + content }}
+            {%- endif %}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '\n' + content }}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+    {%- if true %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- endif %}
+{%- endif %}

diarization.py ADDED Viewed

	@@ -0,0 +1,853 @@

+"""Speaker diarization with support for pyannote and local (tiny-audio) backends.
+Provides two diarization backends:
+- pyannote: Uses pyannote-audio pipeline (requires HF token with model access)
+- local: Uses TEN-VAD + ERes2NetV2 + spectral clustering (no token required)
+Spectral clustering implementation adapted from FunASR/3D-Speaker:
+https://github.com/alibaba-damo-academy/FunASR
+MIT License (https://opensource.org/licenses/MIT)
+"""
+import numpy as np
+import scipy
+import sklearn.metrics.pairwise
+import torch
+from sklearn.cluster._kmeans import k_means
+def _get_device() -> torch.device:
+    """Get best available device for inference."""
+    if torch.cuda.is_available():
+        return torch.device("cuda")
+    if torch.backends.mps.is_available():
+        return torch.device("mps")
+    return torch.device("cpu")
+class SpectralCluster:
+    """Spectral clustering using unnormalized Laplacian of affinity matrix.
+    Adapted from FunASR/3D-Speaker and SpeechBrain implementations.
+    Uses eigenvalue gap to automatically determine number of speakers.
+    """
+    def __init__(self, min_num_spks: int = 1, max_num_spks: int = 15, pval: float = 0.06):
+        self.min_num_spks = min_num_spks
+        self.max_num_spks = max_num_spks
+        self.pval = pval
+    def __call__(self, embeddings: np.ndarray, oracle_num: int | None = None) -> np.ndarray:
+        """Run spectral clustering on embeddings.
+        Args:
+            embeddings: Speaker embeddings of shape [N, D]
+            oracle_num: Optional known number of speakers
+        Returns:
+            Cluster labels of shape [N]
+        """
+        # Similarity matrix computation
+        sim_mat = self.get_sim_mat(embeddings)
+        # Refining similarity matrix with pval
+        prunned_sim_mat = self.p_pruning(sim_mat)
+        # Symmetrization
+        sym_prund_sim_mat = 0.5 * (prunned_sim_mat + prunned_sim_mat.T)
+        # Laplacian calculation
+        laplacian = self.get_laplacian(sym_prund_sim_mat)
+        # Get Spectral Embeddings
+        emb, num_of_spk = self.get_spec_embs(laplacian, oracle_num)
+        # Perform clustering
+        return self.cluster_embs(emb, num_of_spk)
+    def get_sim_mat(self, embeddings: np.ndarray) -> np.ndarray:
+        """Compute cosine similarity matrix."""
+        return sklearn.metrics.pairwise.cosine_similarity(embeddings, embeddings)
+    def p_pruning(self, affinity: np.ndarray) -> np.ndarray:
+        """Prune low similarity values in affinity matrix."""
+        pval = 6.0 / affinity.shape[0] if affinity.shape[0] * self.pval < 6 else self.pval
+        n_elems = int((1 - pval) * affinity.shape[0])
+        # For each row in affinity matrix, zero out low similarities
+        for i in range(affinity.shape[0]):
+            low_indexes = np.argsort(affinity[i, :])
+            low_indexes = low_indexes[0:n_elems]
+            affinity[i, low_indexes] = 0
+        return affinity
+    def get_laplacian(self, sim_mat: np.ndarray) -> np.ndarray:
+        """Compute unnormalized Laplacian matrix."""
+        sim_mat[np.diag_indices(sim_mat.shape[0])] = 0
+        degree = np.sum(np.abs(sim_mat), axis=1)
+        degree_mat = np.diag(degree)
+        return degree_mat - sim_mat
+    def get_spec_embs(
+        self, laplacian: np.ndarray, k_oracle: int | None = None
+    ) -> tuple[np.ndarray, int]:
+        """Extract spectral embeddings from Laplacian."""
+        lambdas, eig_vecs = scipy.linalg.eigh(laplacian)
+        if k_oracle is not None:
+            num_of_spk = k_oracle
+        else:
+            lambda_gap_list = self.get_eigen_gaps(
+                lambdas[self.min_num_spks - 1 : self.max_num_spks + 1]
+            )
+            num_of_spk = np.argmax(lambda_gap_list) + self.min_num_spks
+        emb = eig_vecs[:, :num_of_spk]
+        return emb, num_of_spk
+    def cluster_embs(self, emb: np.ndarray, k: int) -> np.ndarray:
+        """Cluster spectral embeddings using k-means."""
+        _, labels, _ = k_means(emb, k, n_init=10)
+        return labels
+    def get_eigen_gaps(self, eig_vals: np.ndarray) -> list[float]:
+        """Compute gaps between consecutive eigenvalues."""
+        eig_vals_gap_list = []
+        for i in range(len(eig_vals) - 1):
+            gap = float(eig_vals[i + 1]) - float(eig_vals[i])
+            eig_vals_gap_list.append(gap)
+        return eig_vals_gap_list
+class SpeakerClusterer:
+    """Speaker clustering backend using spectral clustering with speaker merging.
+    Features:
+    - Spectral clustering with eigenvalue gap for auto speaker count detection
+    - P-pruning for affinity matrix refinement
+    - Post-clustering speaker merging by cosine similarity
+    """
+    def __init__(
+        self,
+        min_num_spks: int = 2,
+        max_num_spks: int = 10,
+        merge_thr: float = 0.90,  # Moderate merging
+    ):
+        self.min_num_spks = min_num_spks
+        self.max_num_spks = max_num_spks
+        self.merge_thr = merge_thr
+        self._spectral_cluster: SpectralCluster | None = None
+    def _get_spectral_cluster(self) -> SpectralCluster:
+        """Lazy-load spectral clusterer."""
+        if self._spectral_cluster is None:
+            self._spectral_cluster = SpectralCluster(
+                min_num_spks=self.min_num_spks,
+                max_num_spks=self.max_num_spks,
+            )
+        return self._spectral_cluster
+    def __call__(self, embeddings: np.ndarray, num_speakers: int | None = None) -> np.ndarray:
+        """Cluster speaker embeddings and return labels.
+        Args:
+            embeddings: Speaker embeddings of shape [N, D]
+            num_speakers: Optional oracle number of speakers
+        Returns:
+            Cluster labels of shape [N]
+        """
+        import warnings
+        if len(embeddings.shape) != 2:
+            raise ValueError(f"Expected 2D array, got shape {embeddings.shape}")
+        # Handle edge cases
+        if embeddings.shape[0] == 0:
+            return np.array([], dtype=int)
+        if embeddings.shape[0] == 1:
+            return np.array([0], dtype=int)
+        if embeddings.shape[0] < 6:
+            return np.zeros(embeddings.shape[0], dtype=int)
+        # Normalize embeddings
+        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
+        norms = np.maximum(norms, 1e-10)
+        embeddings = embeddings / norms
+        # Replace NaN/inf with zeros
+        embeddings = np.nan_to_num(embeddings, nan=0.0, posinf=0.0, neginf=0.0)
+        # Run spectral clustering (suppress numerical warnings)
+        spectral = self._get_spectral_cluster()
+        # Update min/max for oracle case
+        if num_speakers is not None:
+            spectral.min_num_spks = num_speakers
+            spectral.max_num_spks = num_speakers
+        with warnings.catch_warnings():
+            warnings.filterwarnings("ignore", category=RuntimeWarning)
+            labels = spectral(embeddings, oracle_num=num_speakers)
+        # Reset min/max
+        if num_speakers is not None:
+            spectral.min_num_spks = self.min_num_spks
+            spectral.max_num_spks = self.max_num_spks
+        # Merge similar speakers if no oracle
+        if num_speakers is None:
+            labels = self._merge_by_cos(labels, embeddings, self.merge_thr)
+        # Re-index labels sequentially
+        _, labels = np.unique(labels, return_inverse=True)
+        return labels
+    def _merge_by_cos(self, labels: np.ndarray, embs: np.ndarray, cos_thr: float) -> np.ndarray:
+        """Merge similar speakers by cosine similarity of centroids."""
+        labels = labels.copy()
+        while True:
+            spk_num = labels.max() + 1
+            if spk_num == 1:
+                break
+            # Compute speaker centroids
+            spk_center = []
+            for i in range(spk_num):
+                spk_emb = embs[labels == i].mean(0)
+                spk_center.append(spk_emb)
+            if len(spk_center) == 0:
+                break
+            spk_center = np.stack(spk_center, axis=0)
+            norm_spk_center = spk_center / np.linalg.norm(spk_center, axis=1, keepdims=True)
+            affinity = np.matmul(norm_spk_center, norm_spk_center.T)
+            affinity = np.triu(affinity, 1)
+            # Find most similar pair
+            spks = np.unravel_index(np.argmax(affinity), affinity.shape)
+            if affinity[spks] < cos_thr:
+                break
+            # Merge speakers
+            for i in range(len(labels)):
+                if labels[i] == spks[1]:
+                    labels[i] = spks[0]
+                elif labels[i] > spks[1]:
+                    labels[i] -= 1
+        return labels
+class LocalSpeakerDiarizer:
+    """Local speaker diarization using TEN-VAD + ERes2NetV2 + spectral clustering.
+    Pipeline:
+    1. TEN-VAD detects speech segments
+    2. Sliding window (1.0s, 75% overlap) for uniform embedding extraction
+    3. ERes2NetV2 extracts speaker embeddings per window
+    4. Spectral clustering with eigenvalue gap for auto speaker detection
+    5. Frame-level consensus voting for segment reconstruction
+    6. Post-processing merges short segments to reduce flicker
+    Tunable Parameters (class attributes):
+    - WINDOW_SIZE: Embedding extraction window size in seconds
+    - STEP_SIZE: Sliding window step size (overlap = WINDOW_SIZE - STEP_SIZE)
+    - VAD_THRESHOLD: Speech detection threshold (lower = more sensitive)
+    - VAD_MIN_DURATION: Minimum speech segment duration
+    - VAD_MAX_GAP: Maximum gap to bridge between segments
+    - VAD_PAD_ONSET/OFFSET: Padding added to speech segments
+    - VOTING_RATE: Frame resolution for consensus voting
+    - MIN_SEGMENT_DURATION: Minimum final segment duration
+    - SAME_SPEAKER_GAP: Maximum gap to merge same-speaker segments
+    - TAIL_COVERAGE_RATIO: Minimum tail coverage to add extra window
+    """
+    _ten_vad_model = None
+    _eres2netv2_model = None
+    _device = None
+    # ==================== TUNABLE PARAMETERS ====================
+    # Sliding window for embedding extraction
+    WINDOW_SIZE = 0.75  # seconds - shorter window for finer resolution
+    STEP_SIZE = 0.15  # seconds (80% overlap for more votes)
+    TAIL_COVERAGE_RATIO = 0.1  # Add extra window if tail > this ratio of window
+    # VAD hysteresis parameters
+    VAD_THRESHOLD = 0.25  # Balanced threshold
+    VAD_MIN_DURATION = 0.05  # Minimum speech segment duration (seconds)
+    VAD_MAX_GAP = 0.50  # Bridge gaps shorter than this (seconds)
+    VAD_PAD_ONSET = 0.05  # Padding at segment start (seconds)
+    VAD_PAD_OFFSET = 0.05  # Padding at segment end (seconds)
+    # Frame-level voting
+    VOTING_RATE = 0.01  # 10ms resolution for consensus voting
+    # Post-processing
+    MIN_SEGMENT_DURATION = 0.15  # Minimum final segment duration (seconds)
+    SHORT_SEGMENT_GAP = 0.1  # Gap threshold for merging short segments
+    SAME_SPEAKER_GAP = 0.5  # Gap threshold for merging same-speaker segments
+    # ===========================================================
+    @classmethod
+    def _get_ten_vad_model(cls):
+        """Lazy-load TEN-VAD model (singleton)."""
+        if cls._ten_vad_model is None:
+            from ten_vad import TenVad
+            cls._ten_vad_model = TenVad(hop_size=256, threshold=cls.VAD_THRESHOLD)
+        return cls._ten_vad_model
+    @classmethod
+    def _get_device(cls) -> torch.device:
+        """Get the best available device."""
+        if cls._device is None:
+            cls._device = _get_device()
+        return cls._device
+    @classmethod
+    def _get_eres2netv2_model(cls):
+        """Lazy-load ERes2NetV2 speaker embedding model (singleton)."""
+        if cls._eres2netv2_model is None:
+            from modelscope.pipelines import pipeline
+            from modelscope.utils.constant import Tasks
+            sv_pipeline = pipeline(
+                task=Tasks.speaker_verification,
+                model="iic/speech_eres2netv2_sv_zh-cn_16k-common",
+            )
+            cls._eres2netv2_model = sv_pipeline.model
+            # Move model to GPU if available
+            device = cls._get_device()
+            cls._eres2netv2_model = cls._eres2netv2_model.to(device)
+            cls._eres2netv2_model.device = device
+            cls._eres2netv2_model.eval()
+        return cls._eres2netv2_model
+    @classmethod
+    def diarize(
+        cls,
+        audio: np.ndarray | str,
+        sample_rate: int = 16000,
+        num_speakers: int | None = None,
+        min_speakers: int = 2,
+        max_speakers: int = 10,
+        **_kwargs,
+    ) -> list[dict]:
+        """Run speaker diarization on audio.
+        Args:
+            audio: Audio waveform as numpy array or path to audio file
+            sample_rate: Audio sample rate (default 16000)
+            num_speakers: Exact number of speakers (if known)
+            min_speakers: Minimum number of speakers
+            max_speakers: Maximum number of speakers
+        Returns:
+            List of dicts with 'speaker', 'start', 'end' keys
+        """
+        # Handle file path input
+        if isinstance(audio, str):
+            import librosa
+            audio, sample_rate = librosa.load(audio, sr=16000)
+        # Ensure correct sample rate
+        if sample_rate != 16000:
+            import librosa
+            audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)
+            sample_rate = 16000
+        audio = audio.astype(np.float32)
+        total_duration = len(audio) / sample_rate
+        # Step 1: VAD (returns segments and raw frame-level decisions)
+        segments, vad_frames = cls._get_speech_segments(audio, sample_rate)
+        if not segments:
+            return []
+        # Step 2: Extract embeddings
+        embeddings, window_segments = cls._extract_embeddings(audio, segments, sample_rate)
+        if len(embeddings) == 0:
+            return []
+        # Step 3: Cluster
+        clusterer = SpeakerClusterer(min_num_spks=min_speakers, max_num_spks=max_speakers)
+        labels = clusterer(embeddings, num_speakers)
+        # Step 4: Post-process with consensus voting (VAD-aware)
+        return cls._postprocess_segments(window_segments, labels, total_duration, vad_frames)
+    @classmethod
+    def _get_speech_segments(
+        cls, audio_array: np.ndarray, sample_rate: int = 16000
+    ) -> tuple[list[dict], list[bool]]:
+        """Get speech segments using TEN-VAD.
+        Returns:
+            Tuple of (segments list, vad_frames list of per-frame speech decisions)
+        """
+        vad_model = cls._get_ten_vad_model()
+        # Convert to int16 as required by TEN-VAD
+        # Clip to prevent integer overflow
+        if audio_array.dtype != np.int16:
+            audio_int16 = (np.clip(audio_array, -1.0, 1.0) * 32767).astype(np.int16)
+        else:
+            audio_int16 = audio_array
+        # Process frame by frame
+        hop_size = 256
+        frame_duration = hop_size / sample_rate
+        speech_frames: list[bool] = []
+        for i in range(0, len(audio_int16) - hop_size, hop_size):
+            frame = audio_int16[i : i + hop_size]
+            _, is_speech = vad_model.process(frame)
+            speech_frames.append(is_speech)
+        # Convert frame-level decisions to segments
+        segments = []
+        in_speech = False
+        start_idx = 0
+        for i, is_speech in enumerate(speech_frames):
+            if is_speech and not in_speech:
+                start_idx = i
+                in_speech = True
+            elif not is_speech and in_speech:
+                start_time = start_idx * frame_duration
+                end_time = i * frame_duration
+                segments.append(
+                    {
+                        "start": start_time,
+                        "end": end_time,
+                        "start_sample": int(start_time * sample_rate),
+                        "end_sample": int(end_time * sample_rate),
+                    }
+                )
+                in_speech = False
+        # Handle trailing speech
+        if in_speech:
+            start_time = start_idx * frame_duration
+            end_time = len(speech_frames) * frame_duration
+            segments.append(
+                {
+                    "start": start_time,
+                    "end": end_time,
+                    "start_sample": int(start_time * sample_rate),
+                    "end_sample": int(end_time * sample_rate),
+                }
+            )
+        return cls._apply_vad_hysteresis(segments, sample_rate), speech_frames
+    @classmethod
+    def _apply_vad_hysteresis(cls, segments: list[dict], sample_rate: int = 16000) -> list[dict]:
+        """Apply hysteresis-like post-processing to VAD segments."""
+        if not segments:
+            return segments
+        segments = sorted(segments, key=lambda x: x["start"])
+        # Fill short gaps
+        merged = [segments[0].copy()]
+        for seg in segments[1:]:
+            gap = seg["start"] - merged[-1]["end"]
+            if gap <= cls.VAD_MAX_GAP:
+                merged[-1]["end"] = seg["end"]
+                merged[-1]["end_sample"] = seg["end_sample"]
+            else:
+                merged.append(seg.copy())
+        # Remove short segments
+        filtered = [seg for seg in merged if (seg["end"] - seg["start"]) >= cls.VAD_MIN_DURATION]
+        # Dilate segments (add padding)
+        for seg in filtered:
+            seg["start"] = max(0.0, seg["start"] - cls.VAD_PAD_ONSET)
+            seg["end"] = seg["end"] + cls.VAD_PAD_OFFSET
+            seg["start_sample"] = int(seg["start"] * sample_rate)
+            seg["end_sample"] = int(seg["end"] * sample_rate)
+        return filtered
+    @classmethod
+    def _extract_embeddings(
+        cls, audio_array: np.ndarray, segments: list[dict], sample_rate: int
+    ) -> tuple[np.ndarray, list[dict]]:
+        """Extract speaker embeddings using sliding windows."""
+        speaker_model = cls._get_eres2netv2_model()
+        device = cls._get_device()
+        window_samples = int(cls.WINDOW_SIZE * sample_rate)
+        step_samples = int(cls.STEP_SIZE * sample_rate)
+        embeddings = []
+        window_segments = []
+        with torch.no_grad():
+            for seg in segments:
+                seg_start = seg["start_sample"]
+                seg_end = seg["end_sample"]
+                seg_len = seg_end - seg_start
+                # Generate window positions
+                if seg_len <= window_samples:
+                    starts = [seg_start]
+                    ends = [seg_end]
+                else:
+                    starts = list(range(seg_start, seg_end - window_samples + 1, step_samples))
+                    ends = [s + window_samples for s in starts]
+                    # Cover tail if > TAIL_COVERAGE_RATIO of window remains
+                    if ends and ends[-1] < seg_end:
+                        remainder = seg_end - ends[-1]
+                        if remainder > (window_samples * cls.TAIL_COVERAGE_RATIO):
+                            starts.append(seg_end - window_samples)
+                            ends.append(seg_end)
+                for c_start, c_end in zip(starts, ends):
+                    chunk = audio_array[c_start:c_end]
+                    # Pad short chunks with reflection
+                    if len(chunk) < window_samples:
+                        pad_width = window_samples - len(chunk)
+                        chunk = np.pad(chunk, (0, pad_width), mode="reflect")
+                    # Extract embedding
+                    chunk_tensor = torch.from_numpy(chunk).float().unsqueeze(0).to(device)
+                    embedding = speaker_model.forward(chunk_tensor).squeeze(0).cpu().numpy()
+                    # Validate and normalize
+                    if not np.isfinite(embedding).all():
+                        continue
+                    norm = np.linalg.norm(embedding)
+                    if norm > 1e-8:
+                        embeddings.append(embedding / norm)
+                        window_segments.append(
+                            {
+                                "start": c_start / sample_rate,
+                                "end": c_end / sample_rate,
+                            }
+                        )
+        if embeddings:
+            return np.array(embeddings), window_segments
+        return np.array([]), []
+    @classmethod
+    def _resample_vad(cls, vad_frames: list[bool], num_frames: int) -> np.ndarray:
+        """Resample VAD frame decisions to match voting grid resolution.
+        VAD operates at 256 samples / 16000 Hz = 16ms per frame.
+        Voting operates at VOTING_RATE (default 10ms) per frame.
+        This maps VAD decisions to the finer voting grid.
+        """
+        if not vad_frames:
+            return np.zeros(num_frames, dtype=bool)
+        vad_rate = 256 / 16000  # 16ms per VAD frame
+        result = np.zeros(num_frames, dtype=bool)
+        for i in range(num_frames):
+            voting_time = i * cls.VOTING_RATE
+            vad_frame = int(voting_time / vad_rate)
+            if vad_frame < len(vad_frames):
+                result[i] = vad_frames[vad_frame]
+        return result
+    @classmethod
+    def _postprocess_segments(
+        cls,
+        window_segments: list[dict],
+        labels: np.ndarray,
+        total_duration: float,
+        vad_frames: list[bool],
+    ) -> list[dict]:
+        """Post-process using frame-level consensus voting with VAD-aware silence."""
+        if not window_segments or len(labels) == 0:
+            return []
+        # Correct labels to be contiguous
+        unique_labels = np.unique(labels)
+        label_map = {old: new for new, old in enumerate(unique_labels)}
+        clean_labels = np.array([label_map[lbl] for lbl in labels])
+        num_speakers = len(unique_labels)
+        if num_speakers == 0:
+            return []
+        # Create voting grid
+        num_frames = int(np.ceil(total_duration / cls.VOTING_RATE)) + 1
+        votes = np.zeros((num_frames, num_speakers), dtype=np.float32)
+        # Accumulate votes
+        for win, label in zip(window_segments, clean_labels):
+            start_frame = int(win["start"] / cls.VOTING_RATE)
+            end_frame = int(win["end"] / cls.VOTING_RATE)
+            end_frame = min(end_frame, num_frames)
+            if start_frame < end_frame:
+                votes[start_frame:end_frame, label] += 1.0
+        # Determine winner per frame
+        frame_speakers = np.argmax(votes, axis=1)
+        max_votes = np.max(votes, axis=1)
+        # Resample VAD to voting grid resolution for silence-aware voting
+        vad_resampled = cls._resample_vad(vad_frames, num_frames)
+        # Convert frames to segments
+        final_segments = []
+        current_speaker = -1
+        seg_start = 0.0
+        for f in range(num_frames):
+            speaker = int(frame_speakers[f])
+            score = max_votes[f]
+            # Force silence if VAD says no speech OR no votes
+            if score == 0 or not vad_resampled[f]:
+                speaker = -1
+            if speaker != current_speaker:
+                if current_speaker != -1:
+                    final_segments.append(
+                        {
+                            "speaker": f"SPEAKER_{current_speaker}",
+                            "start": seg_start,
+                            "end": f * cls.VOTING_RATE,
+                        }
+                    )
+                current_speaker = speaker
+                seg_start = f * cls.VOTING_RATE
+        # Close last segment
+        if current_speaker != -1:
+            final_segments.append(
+                {
+                    "speaker": f"SPEAKER_{current_speaker}",
+                    "start": seg_start,
+                    "end": num_frames * cls.VOTING_RATE,
+                }
+            )
+        return cls._merge_short_segments(final_segments)
+    @classmethod
+    def _merge_short_segments(cls, segments: list[dict]) -> list[dict]:
+        """Merge short segments to reduce flicker."""
+        if not segments:
+            return []
+        clean: list[dict] = []
+        for seg in segments:
+            dur = seg["end"] - seg["start"]
+            if dur < cls.MIN_SEGMENT_DURATION:
+                if (
+                    clean
+                    and clean[-1]["speaker"] == seg["speaker"]
+                    and seg["start"] - clean[-1]["end"] < cls.SHORT_SEGMENT_GAP
+                ):
+                    clean[-1]["end"] = seg["end"]
+                continue
+            if (
+                clean
+                and clean[-1]["speaker"] == seg["speaker"]
+                and seg["start"] - clean[-1]["end"] < cls.SAME_SPEAKER_GAP
+            ):
+                clean[-1]["end"] = seg["end"]
+            else:
+                clean.append(seg)
+        return clean
+    @classmethod
+    def assign_speakers_to_words(
+        cls,
+        words: list[dict],
+        speaker_segments: list[dict],
+    ) -> list[dict]:
+        """Assign speaker labels to words based on timestamp overlap.
+        Args:
+            words: List of word dicts with 'word', 'start', 'end' keys
+            speaker_segments: List of speaker dicts with 'speaker', 'start', 'end' keys
+        Returns:
+            Words list with 'speaker' key added to each word
+        """
+        for word in words:
+            word_mid = (word["start"] + word["end"]) / 2
+            # Find the speaker segment that contains this word's midpoint
+            best_speaker = None
+            for seg in speaker_segments:
+                if seg["start"] <= word_mid <= seg["end"]:
+                    best_speaker = seg["speaker"]
+                    break
+            # If no exact match, find closest segment
+            if best_speaker is None and speaker_segments:
+                min_dist = float("inf")
+                for seg in speaker_segments:
+                    seg_mid = (seg["start"] + seg["end"]) / 2
+                    dist = abs(word_mid - seg_mid)
+                    if dist < min_dist:
+                        min_dist = dist
+                        best_speaker = seg["speaker"]
+            word["speaker"] = best_speaker
+        return words
+class SpeakerDiarizer:
+    """Unified speaker diarization interface supporting multiple backends.
+    Backends:
+    - 'pyannote': Uses pyannote-audio pipeline (requires HF token)
+    - 'local': Uses TEN-VAD + ERes2NetV2 + spectral clustering
+    Example:
+        >>> segments = SpeakerDiarizer.diarize(audio_array, backend="local")
+        >>> for seg in segments:
+        ...     print(f"{seg['speaker']}: {seg['start']:.2f} - {seg['end']:.2f}")
+    """
+    _pyannote_pipeline = None
+    @classmethod
+    def _get_pyannote_pipeline(cls, hf_token: str | None = None):
+        """Get or create the pyannote diarization pipeline."""
+        if cls._pyannote_pipeline is None:
+            from pyannote.audio import Pipeline
+            cls._pyannote_pipeline = Pipeline.from_pretrained(
+                "pyannote/speaker-diarization-3.1",
+                use_auth_token=hf_token,
+            )
+            cls._pyannote_pipeline.to(torch.device(_get_device()))
+        return cls._pyannote_pipeline
+    @classmethod
+    def diarize(
+        cls,
+        audio: np.ndarray | str,
+        sample_rate: int = 16000,
+        num_speakers: int | None = None,
+        min_speakers: int | None = None,
+        max_speakers: int | None = None,
+        hf_token: str | None = None,
+        backend: str = "pyannote",
+    ) -> list[dict]:
+        """Run speaker diarization on audio.
+        Args:
+            audio: Audio waveform as numpy array or path to audio file
+            sample_rate: Audio sample rate (default 16000)
+            num_speakers: Exact number of speakers (if known)
+            min_speakers: Minimum number of speakers
+            max_speakers: Maximum number of speakers
+            hf_token: HuggingFace token for pyannote models
+            backend: Diarization backend ("pyannote" or "local")
+        Returns:
+            List of dicts with 'speaker', 'start', 'end' keys
+        """
+        if backend == "local":
+            return LocalSpeakerDiarizer.diarize(
+                audio,
+                sample_rate=sample_rate,
+                num_speakers=num_speakers,
+                min_speakers=min_speakers or 2,
+                max_speakers=max_speakers or 10,
+            )
+        # Default to pyannote
+        return cls._diarize_pyannote(
+            audio,
+            sample_rate=sample_rate,
+            num_speakers=num_speakers,
+            min_speakers=min_speakers,
+            max_speakers=max_speakers,
+            hf_token=hf_token,
+        )
+    @classmethod
+    def _diarize_pyannote(
+        cls,
+        audio: np.ndarray | str,
+        sample_rate: int = 16000,
+        num_speakers: int | None = None,
+        min_speakers: int | None = None,
+        max_speakers: int | None = None,
+        hf_token: str | None = None,
+    ) -> list[dict]:
+        """Run pyannote diarization."""
+        pipeline = cls._get_pyannote_pipeline(hf_token)
+        # Prepare audio input
+        if isinstance(audio, np.ndarray):
+            waveform = torch.from_numpy(audio.copy()).unsqueeze(0)
+            if waveform.dim() == 1:
+                waveform = waveform.unsqueeze(0)
+            audio_input = {"waveform": waveform, "sample_rate": sample_rate}
+        else:
+            audio_input = audio
+        # Run diarization
+        diarization_args = {}
+        if num_speakers is not None:
+            diarization_args["num_speakers"] = num_speakers
+        if min_speakers is not None:
+            diarization_args["min_speakers"] = min_speakers
+        if max_speakers is not None:
+            diarization_args["max_speakers"] = max_speakers
+        diarization = pipeline(audio_input, **diarization_args)
+        # Handle different pyannote return types
+        if hasattr(diarization, "itertracks"):
+            annotation = diarization
+        elif hasattr(diarization, "speaker_diarization"):
+            annotation = diarization.speaker_diarization
+        elif isinstance(diarization, tuple):
+            annotation = diarization[0]
+        else:
+            raise TypeError(f"Unexpected diarization output type: {type(diarization)}")
+        # Convert to simple format
+        segments = []
+        for turn, _, speaker in annotation.itertracks(yield_label=True):
+            segments.append(
+                {
+                    "speaker": speaker,
+                    "start": turn.start,
+                    "end": turn.end,
+                }
+            )
+        return segments
+    @classmethod
+    def assign_speakers_to_words(
+        cls,
+        words: list[dict],
+        speaker_segments: list[dict],
+    ) -> list[dict]:
+        """Assign speaker labels to words based on timestamp overlap."""
+        return LocalSpeakerDiarizer.assign_speakers_to_words(words, speaker_segments)

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "chunk_length": 30,
+  "dither": 0.0,
+  "feature_extractor_type": "WhisperFeatureExtractor",
+  "feature_size": 128,
+  "hop_length": 160,
+  "n_fft": 400,
+  "n_samples": 480000,
+  "nb_max_frames": 3000,
+  "padding": false,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "return_attention_mask": false,
+  "sampling_rate": 16000,
+  "processor_class": "ASRProcessor",
+  "auto_map": {
+    "AutoProcessor": "asr_processing.ASRProcessor"
+  }
+}

projectors.py ADDED Viewed

	@@ -0,0 +1,484 @@

+"""Audio projector modules for bridging encoder and decoder embeddings.
+This module contains all projector architectures:
+- MLPAudioProjector: Simple 2-layer MLP with frame stacking downsampling
+- MOSAProjector: MOSA-style dense mixture of experts
+- SharedMoEAudioProjector: Shared expert + sparse routed experts
+- QFormerAudioProjector: BLIP-2 QFormer with learnable queries (Granite-style)
+"""
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
+from transformers import AutoModel, Blip2QFormerConfig
+from transformers.models.llama.modeling_llama import LlamaRMSNorm
+# =============================================================================
+# MLP Projector
+# =============================================================================
+class MLPAudioProjector(nn.Module):
+    """2-layer MLP projector with frame-stacking downsampling (matches GLM-ASR)."""
+    def __init__(self, config):
+        """Initialize MLP projector.
+        Args:
+            config: ASRConfig with encoder_dim, llm_dim, projector_pool_stride
+        """
+        super().__init__()
+        encoder_dim = getattr(config, "encoder_dim", 768)
+        llm_dim = getattr(config, "llm_dim", 2048)
+        self.k = getattr(config, "projector_pool_stride", 4)
+        # Frame stacking: concat k adjacent frames then project
+        # Hidden dim uses 2x expansion like GLM-ASR's GlmAsrMultiModalProjector
+        in_dim = encoder_dim * self.k
+        hidden_dim = llm_dim * 2
+        self.linear_1 = nn.Linear(in_dim, hidden_dim)
+        self.act = nn.GELU()
+        self.linear_2 = nn.Linear(hidden_dim, llm_dim)
+    def get_output_length(self, input_length: int) -> int:
+        """Calculate output sequence length given input length (matches GLM-ASR)."""
+        # GLM-ASR formula: (L - merge_factor) // merge_factor + 1
+        return (input_length - self.k) // self.k + 1
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Project audio features to LLM embedding space.
+        Args:
+            x: Audio encoder output of shape [batch, seq_len, encoder_dim]
+        Returns:
+            Projected features of shape [batch, (seq_len - k) // k + 1, llm_dim]
+        """
+        batch, seq, dim = x.shape
+        # Truncate to match GLM-ASR: use (seq - k) // k + 1 frames
+        # This drops trailing frames that don't fill a complete k-frame window
+        out_len = (seq - self.k) // self.k + 1
+        x = x[:, : out_len * self.k, :]  # Truncate to exact multiple
+        x = x.reshape(batch, out_len, dim * self.k)
+        x = self.linear_1(x)
+        x = self.act(x)
+        return self.linear_2(x)
+# =============================================================================
+# MoE Projector (MOSA-style)
+# =============================================================================
+class SimpleAdapter(nn.Module):
+    """Simple 2-layer GELU adapter (from MOSA paper)."""
+    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
+        super().__init__()
+        self.fc1 = nn.Linear(input_dim, hidden_dim)
+        self.act = nn.GELU()
+        self.fc2 = nn.Linear(hidden_dim, output_dim)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.fc2(self.act(self.fc1(x)))
+class MOSAProjector(nn.Module):
+    """MOSA-Base projector: simple 2-layer ReLU router with 4 simple adapters.
+    Based on "MOSA: Mixtures of Simple Adapters" (arXiv:2508.18998).
+    Uses softmax gating over all experts (dense MoE) with only cross-entropy loss.
+    Uses Conv1d for downsampling (2 layers, stride 2 each = 4x total).
+    """
+    def __init__(self, config):
+        """Initialize MOSA projector.
+        Args:
+            config: ASRConfig with encoder_dim, llm_dim, num_experts
+        """
+        super().__init__()
+        self.encoder_dim = getattr(config, "encoder_dim", None) or 1280
+        self.llm_dim = getattr(config, "llm_dim", None) or 2048
+        self.num_experts = getattr(config, "num_experts", None) or 4  # MOSA-Base uses 4
+        adapter_hidden = getattr(config, "adapter_hidden_dim", None) or 4096
+        router_hidden = getattr(config, "router_hidden_dim", None) or 512
+        # --- 1. Conv1d Downsampler (4x reduction) ---
+        # 2 layers of stride-2 convolution
+        self.downsampler = nn.Sequential(
+            nn.Conv1d(self.encoder_dim, self.encoder_dim, kernel_size=3, stride=2, padding=1),
+            nn.GELU(),
+            nn.Conv1d(self.encoder_dim, self.llm_dim, kernel_size=3, stride=2, padding=1),
+            nn.GELU(),
+        )
+        # --- 2. Simple Router (MOSA-Base: 2 layers with ReLU) ---
+        # Takes downsampled features (llm_dim) -> 512 -> num_experts
+        self.router = nn.Sequential(
+            nn.Linear(self.llm_dim, router_hidden),
+            nn.ReLU(),
+            nn.Linear(router_hidden, self.num_experts),
+        )
+        # --- 3. Experts (Simple 2-layer GELU adapters) ---
+        # Each expert: llm_dim -> hidden -> llm_dim (much smaller than frame-stacking)
+        self.experts = nn.ModuleList(
+            [
+                SimpleAdapter(self.llm_dim, adapter_hidden, self.llm_dim)
+                for _ in range(self.num_experts)
+            ]
+        )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Project audio features using mixture of experts.
+        Args:
+            x: Audio encoder output of shape [batch, seq_len, encoder_dim]
+        Returns:
+            Projected features of shape [batch, out_len, llm_dim]
+        """
+        # --- 1. Conv1d Downsampling ---
+        # Permute for Conv1d: [B, S, D] -> [B, D, S]
+        x = x.transpose(1, 2)
+        x = self.downsampler(x)
+        # Permute back: [B, D, S] -> [B, S, D]
+        x = x.transpose(1, 2)
+        # --- 2. Routing ---
+        routing_weights = F.softmax(self.router(x), dim=-1)  # (B, out_len, num_experts)
+        # --- 3. Expert Mixture (Dense Execution) ---
+        expert_outputs = torch.stack([expert(x) for expert in self.experts])  # (E, B, out_len, D)
+        return torch.einsum("ebsd, bse -> bsd", expert_outputs, routing_weights)
+    def get_output_length(self, input_length: int) -> int:
+        """Calculate output sequence length after Conv1d downsampling (4x reduction)."""
+        # Conv1d with stride 2, kernel 3, padding 1: out = (in + 2*1 - 3) // 2 + 1 = (in - 1) // 2 + 1
+        # Applied twice for 4x total reduction
+        after_conv1 = (input_length + 2 * 1 - 3) // 2 + 1
+        return (after_conv1 + 2 * 1 - 3) // 2 + 1
+# =============================================================================
+# MoE Projector (Shared Expert + Sparse Routed Experts)
+# =============================================================================
+class SharedMoEBlock(nn.Module):
+    """MoE block with Shared + Sigmoid-Routed Experts."""
+    def __init__(
+        self,
+        input_dim: int,
+        hidden_dim: int,
+        output_dim: int,
+        num_experts: int = 4,
+        top_k: int = 2,
+    ):
+        super().__init__()
+        self.num_experts = num_experts
+        self.top_k = top_k
+        self.output_dim = output_dim
+        # RMSNorm before routing
+        self.norm = LlamaRMSNorm(input_dim, eps=1e-8)
+        self.router = nn.Linear(input_dim, num_experts, bias=False)
+        nn.init.normal_(self.router.weight, mean=0.0, std=0.02)
+        self.shared_expert = SimpleAdapter(input_dim, hidden_dim, output_dim)
+        self.experts = nn.ModuleList(
+            [SimpleAdapter(input_dim, hidden_dim, output_dim) for _ in range(num_experts)]
+        )
+        self.last_router_logits = None
+        self.last_router_probs = None
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        batch_size, seq_len, dim = hidden_states.shape
+        # 1. Apply Shared Expert
+        normed_states = self.norm(hidden_states)
+        shared_out = self.shared_expert(normed_states)
+        # 2. Router Logic (Sigmoid Style)
+        flat_hidden = normed_states.view(-1, dim)
+        router_logits = self.router(flat_hidden)
+        # Sigmoid routing
+        router_probs = torch.sigmoid(router_logits)
+        self.last_router_logits = router_logits
+        self.last_router_probs = router_probs
+        # 3. Top-K Selection
+        top_k_scores, top_k_indices = torch.topk(router_probs, self.top_k, dim=-1)
+        # Normalize weights
+        top_k_weights = top_k_scores / (top_k_scores.sum(dim=-1, keepdim=True) + 1e-6)
+        top_k_weights = top_k_weights.to(hidden_states.dtype)
+        # 4. Dispatch
+        routed_out = self._dispatch_experts(flat_hidden, top_k_indices, top_k_weights)
+        routed_out = routed_out.view(batch_size, seq_len, -1)
+        return shared_out + routed_out
+    def _dispatch_experts(
+        self,
+        hidden_states: torch.Tensor,
+        top_k_indices: torch.Tensor,
+        top_k_weights: torch.Tensor,
+    ) -> torch.Tensor:
+        num_tokens = hidden_states.shape[0]
+        output = torch.zeros(
+            num_tokens, self.output_dim, device=hidden_states.device, dtype=hidden_states.dtype
+        )
+        for expert_idx, expert in enumerate(self.experts):
+            expert_mask = top_k_indices == expert_idx
+            if not expert_mask.any():
+                continue
+            token_indices, slot_indices = torch.where(expert_mask)
+            expert_input = hidden_states[token_indices]
+            expert_output = expert(expert_input).to(output.dtype)
+            weights = top_k_weights[token_indices, slot_indices].unsqueeze(-1)
+            output.index_add_(0, token_indices, expert_output * weights)
+        return output
+def load_balancing_loss(router_probs: torch.Tensor, num_experts: int, top_k: int) -> torch.Tensor:
+    """Auxiliary loss to encourage balanced expert usage."""
+    prob_per_expert = router_probs.mean(dim=0)
+    target_mean = prob_per_expert.mean()
+    return (prob_per_expert - target_mean).square().sum() * num_experts
+def z_loss(router_logits: torch.Tensor) -> torch.Tensor:
+    """Z-loss to prevent router logits from growing too large."""
+    return torch.logsumexp(router_logits.float(), dim=-1).square().mean()
+class MoEAudioProjector(nn.Module):
+    """MoE projector with shared expert + sparse routed experts."""
+    def __init__(self, config):
+        """Initialize MoE projector.
+        Args:
+            config: ASRConfig with encoder_dim, llm_dim, num_experts, num_experts_per_tok
+        """
+        super().__init__()
+        self.k = getattr(config, "projector_pool_stride", 4)
+        encoder_dim = config.encoder_dim
+        # Depthwise Conv for temporal mixing
+        self.temporal_conv = nn.Conv1d(
+            encoder_dim, encoder_dim, kernel_size=3, padding=1, groups=encoder_dim
+        )
+        in_dim = encoder_dim * self.k
+        out_dim = config.llm_dim
+        hidden_dim = getattr(config, "projector_hidden_dim", None) or in_dim
+        self.num_experts = getattr(config, "num_experts", 4)
+        self.top_k = getattr(config, "num_experts_per_tok", 2)
+        self.aux_loss_coef = getattr(config, "router_aux_loss_coef", 0.02)
+        self.z_loss_coef = getattr(config, "router_z_loss_coef", 0.001)
+        self.moe = SharedMoEBlock(in_dim, hidden_dim, out_dim, self.num_experts, self.top_k)
+        self._init_weights()
+    def _init_weights(self):
+        with torch.no_grad():
+            nn.init.orthogonal_(self.moe.shared_expert.fc1.weight)
+            nn.init.orthogonal_(self.moe.shared_expert.fc2.weight, gain=0.5)
+            for expert in self.moe.experts:
+                nn.init.orthogonal_(expert.fc1.weight)
+                nn.init.orthogonal_(expert.fc2.weight, gain=0.01)
+    def get_output_length(self, input_length: int) -> int:
+        """Calculate output sequence length given input length."""
+        # Temporal pooling with stride k
+        if input_length % self.k:
+            input_length += self.k - input_length % self.k
+        return input_length // self.k
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Project audio features using shared + sparse MoE.
+        Args:
+            x: Audio encoder output of shape [batch, seq_len, encoder_dim]
+        Returns:
+            Projected features of shape [batch, out_len, llm_dim]
+        """
+        batch_size, seq_len, dim = x.size()
+        target_dtype = self.moe.shared_expert.fc1.weight.dtype
+        if x.dtype != target_dtype:
+            x = x.to(target_dtype)
+        # Temporal Context Injection
+        x_ctx = x.transpose(1, 2)
+        x_ctx = self.temporal_conv(x_ctx)
+        x = x + x_ctx.transpose(1, 2)
+        if seq_len % self.k:
+            x = F.pad(x, (0, 0, 0, self.k - seq_len % self.k))
+        x = x.view(batch_size, -1, dim * self.k)
+        return self.moe(x)
+    def get_aux_loss(self) -> torch.Tensor:
+        if self.moe.last_router_logits is None:
+            return torch.tensor(0.0, device=self.moe.router.weight.device)
+        balance = load_balancing_loss(self.moe.last_router_probs, self.num_experts, self.top_k)
+        z = z_loss(self.moe.last_router_logits)
+        return self.aux_loss_coef * balance + self.z_loss_coef * z
+# =============================================================================
+# QFormer Projector (Granite-style)
+# =============================================================================
+class QFormerAudioProjector(nn.Module):
+    """
+    BLIP-2 QFormer projector with learnable queries.
+    Based on GraniteSpeechEncoderProjector - uses a QFormer model with learnable
+    query embeddings to compress and project audio encoder outputs. The audio
+    sequence is processed in windows and downsampled via cross-attention.
+    """
+    def __init__(self, config):
+        """Initialize QFormer projector.
+        Args:
+            config: ASRConfig with encoder_dim, llm_dim, qformer_* settings
+        """
+        super().__init__()
+        encoder_dim = config.encoder_dim
+        llm_dim = config.llm_dim
+        # Window and downsampling parameters (Granite defaults: window=15, downsample=5)
+        self.window_size = getattr(config, "qformer_window_size", 15)
+        self.downsample_rate = getattr(config, "downsample_rate", 5)
+        self.num_queries = self.window_size // self.downsample_rate
+        # QFormer hidden size (matches encoder for cross-attention)
+        qformer_hidden = getattr(config, "qformer_hidden_size", None) or encoder_dim
+        qformer_num_layers = getattr(config, "qformer_num_layers", 2)
+        qformer_num_heads = getattr(config, "qformer_num_heads", 16)
+        qformer_intermediate = getattr(config, "qformer_intermediate_size", None) or (
+            qformer_hidden * 4
+        )
+        # Learnable query embeddings (Granite uses std=1.0)
+        self.query = nn.Parameter(torch.zeros(1, self.num_queries, qformer_hidden))
+        self.query.data.normal_(mean=0.0, std=1.0)
+        # Optional projection if encoder dim != qformer hidden
+        if encoder_dim != qformer_hidden:
+            self.encoder_proj = nn.Linear(encoder_dim, qformer_hidden, bias=False)
+        else:
+            self.encoder_proj = None
+        # Configure QFormer to match Granite's exact config
+        qformer_config = Blip2QFormerConfig(
+            hidden_size=qformer_hidden,
+            num_hidden_layers=qformer_num_layers,
+            num_attention_heads=qformer_num_heads,
+            intermediate_size=qformer_intermediate,
+            encoder_hidden_size=qformer_hidden,
+            cross_attention_frequency=1,
+            # Granite-specific settings
+            hidden_act="gelu",
+            attention_probs_dropout_prob=0.1,
+            hidden_dropout_prob=0.1,
+            layer_norm_eps=1e-12,
+            initializer_range=0.02,
+        )
+        self.qformer = AutoModel.from_config(qformer_config)
+        # Final projection to LLM dimension (Granite uses bias=True)
+        self.linear = nn.Linear(qformer_hidden, llm_dim)
+    def get_output_length(self, input_length: int) -> int:
+        """Calculate output sequence length given input length."""
+        # QFormer uses window-based processing with num_queries per window
+        nblocks = math.ceil(input_length / self.window_size)
+        return nblocks * self.num_queries
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            hidden_states: [batch_size, seq_len, encoder_dim]
+        Returns:
+            projected: [batch_size, num_output_tokens, llm_dim]
+        """
+        batch_size, seq_len, dim = hidden_states.size()
+        # Ensure float dtype for QFormer
+        target_dtype = self.query.dtype
+        if hidden_states.dtype != target_dtype:
+            hidden_states = hidden_states.to(target_dtype)
+        # Optional encoder projection
+        if self.encoder_proj is not None:
+            hidden_states = self.encoder_proj(hidden_states)
+        # Compute number of windows and pad to fit
+        nblocks = math.ceil(seq_len / self.window_size)
+        pad = nblocks * self.window_size - seq_len
+        if pad > 0:
+            hidden_states = F.pad(hidden_states, (0, 0, 0, pad), "constant", 0)
+        # Reshape to process each window: [batch*nblocks, window_size, dim]
+        effective_batch = batch_size * nblocks
+        hidden_states = hidden_states.view(effective_batch, self.window_size, -1)
+        # Expand queries to match batch size
+        query_embeds = self.query.expand(effective_batch, -1, -1)
+        # QFormer cross-attention
+        query_output = self.qformer(
+            query_embeds=query_embeds,
+            encoder_hidden_states=hidden_states,
+            return_dict=True,
+        )
+        # Reshape back: [batch, nblocks * num_queries, hidden]
+        output_tokens = nblocks * self.num_queries
+        query_proj = query_output.last_hidden_state.view(batch_size, output_tokens, -1)
+        # Project to LLM dimension
+        return self.linear(query_proj)
+# =============================================================================
+# Projector Registry
+# =============================================================================
+PROJECTOR_CLASSES = {
+    "mlp": MLPAudioProjector,
+    "mosa": MOSAProjector,
+    "moe": MoEAudioProjector,
+    "qformer": QFormerAudioProjector,
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:33b674fb8444e2553eae8f1b261093371920a28ef75b5c18f4deb3f9217ed0ba
+size 11422834

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<audio>"
+  ],
+  "is_local": false,
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}