Upload 14 files

Browse files

Files changed (15) hide show

.gitattributes +1 -0
README.md +175 -0
asr_config.py +157 -0
asr_modeling.py +549 -0
asr_pipeline.py +472 -0
asr_processing.py +92 -0
chat_template.jinja +94 -0
config.json +203 -0
generation_config.json +10 -0
model.safetensors +3 -0
preprocessor_config.json +18 -0
projectors.py +527 -0
special_tokens_map.json +19 -0
tokenizer.json +3 -0
tokenizer_config.json +2075 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,178 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+language:
+- en
+library_name: transformers
+pipeline_tag: automatic-speech-recognition
+tags:
+- speech
+- audio
+- asr
+- speech-to-text
+- whisper
+- tiny-audio
+base_model:
+- openai/whisper-large-v3-turbo
+- HuggingFaceTB/SmolLM3-3B
+datasets:
+- speechbrain/LoquaciousSet
+metrics:
+- wer
 ---
+# Tiny Audio ASR - LoquaciousSet Training
+A Speech-to-Text model trained using the [Tiny Audio](https://github.com/alexkroman/tiny-audio) framework, combining a frozen Whisper encoder with a trained MLP projector and frozen SmolLM3-3B decoder.
+## Model Description
+This model uses an encoder-projector-decoder architecture for automatic speech recognition:
+| Component | Model | Parameters | Training Status |
+|-----------|-------|------------|-----------------|
+| Audio Encoder | openai/whisper-large-v3-turbo | ~800M | Frozen |
+| Projector | MLP | 11.7M | **Trained** |
+| Language Model | HuggingFaceTB/SmolLM3-3B | 3B | Frozen |
+| **Total** | - | **3.72B** | 0.32% trainable |
+## Training Details
+### Infrastructure
+- **GPU**: NVIDIA H100 80GB HBM3
+- **Cloud Provider**: E2E Networks
+- **Framework**: PyTorch 2.8.0, Transformers 4.57.3
+### Hyperparameters
+- **Dataset**: speechbrain/LoquaciousSet (small subset)
+- **Train Samples**: 1,000
+- **Evaluation Samples**: 100
+- **Batch Size**: 8
+- **Learning Rate**: 3e-4
+- **Max Steps**: 500
+- **Warmup Steps**: 50
+- **Precision**: BF16
+- **Gradient Checkpointing**: Enabled
+### Training Metrics
+| Step | Training Loss | Validation Loss |
+|------|---------------|-----------------|
+| 100 | 3.078 | 3.165 |
+| 200 | 2.543 | 3.163 |
+| 300 | 0.500 | 0.813 |
+| 400 | 0.140 | 0.728 |
+| 500 | 0.101 | 0.764 |
+Training time: ~18 minutes on H100.
+## Usage
+```python
+from src.asr_config import ASRConfig
+from src.asr_modeling import ASRModel
+import torchaudio
+# Initialize model
+config = ASRConfig(
+    audio_model_id="openai/whisper-large-v3-turbo",
+    text_model_id="HuggingFaceTB/SmolLM3-3B",
+    projector_type="mlp",
+    attn_implementation="sdpa",
+)
+model = ASRModel(config)
+# Load audio
+waveform, sample_rate = torchaudio.load("audio.wav")
+if sample_rate != 16000:
+    waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
+audio_array = waveform.squeeze().numpy()
+# Transcribe
+inputs = model.feature_extractor(
+    audio_array, sampling_rate=16000, return_tensors="pt"
+).input_features.to(model.device).to(model.dtype)
+with torch.no_grad():
+    output = model.generate(input_features=inputs, max_new_tokens=256)
+transcription = model.tokenizer.decode(output[0], skip_special_tokens=True)
+print(transcription)
+```
+## Example Results
+**Input Audio**: Sample from LoquaciousSet evaluation set
+**Ground Truth**:
+```
+THESE ARE REFORMS THAT WILL DISCIPLINE AND CONSTRAIN THE EXERCISE OF POWER
+BY THE GOVERNMENT AND ANY OTHER ECONOMIC OR POLITICAL ACTOR FOR GENERATIONS TO COME
+```
+**Model Output**:
+```
+These are reforms that will discipline and constrain the exercise of power
+by the government and any other economic or political actor for generations to come
+```
+## Limitations
+- Trained on a small subset (1,000 samples) for demonstration purposes
+- Full training with 50,000+ steps recommended for production use
+- English language only
+- Optimized for clean speech; performance may degrade on noisy audio
+## Citation
+### Tiny Audio Framework
+```bibtex
+@software{kroman2025tinyaudio,
+  author = {Kroman, Alex},
+  title = {Tiny Audio: Train Your Own Speech Recognition Model in 24 Hours},
+  year = {2025},
+  url = {https://github.com/alexkroman/tiny-audio}
+}
+```
+### LoquaciousSet Dataset
+```bibtex
+@misc{speechbrain2024loquaciousset,
+  author = {{SpeechBrain Team}},
+  title = {LoquaciousSet: 25,000 Hours of Transcribed English Speech},
+  year = {2024},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/datasets/speechbrain/LoquaciousSet}
+}
+```
+### Whisper
+```bibtex
+@article{radford2022whisper,
+  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
+  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
+  journal = {arXiv preprint arXiv:2212.04356},
+  year = {2022}
+}
+```
+### SmolLM
+```bibtex
+@misc{smollm2024,
+  author = {{Hugging Face}},
+  title = {SmolLM: Smaller Language Models for Efficient Inference},
+  year = {2024},
+  url = {https://huggingface.co/HuggingFaceTB/SmolLM3-3B}
+}
+```
+## License
+Apache 2.0 - See the [Tiny Audio repository](https://github.com/alexkroman/tiny-audio) for details.
+## Acknowledgments
+- [Alex Kroman](https://github.com/alexkroman) for the Tiny Audio framework
+- [SpeechBrain](https://speechbrain.github.io/) for the LoquaciousSet dataset
+- [OpenAI](https://openai.com/) for Whisper
+- [Hugging Face](https://huggingface.co/) for SmolLM3 and infrastructure
+- [E2E Networks](https://www.e2enetworks.com/) for GPU cloud infrastructure

asr_config.py ADDED Viewed

	@@ -0,0 +1,157 @@

+from typing import Optional
+import transformers
+class ASRConfig(transformers.PretrainedConfig):
+    model_type = "asr_model"
+    is_composition = True
+    def __init__(
+        self,
+        audio_model_id: str = "openai/whisper-large-v3-turbo",
+        text_model_id: str = "HuggingFaceTB/SmolLM3-3B",
+        attn_implementation: str = "flash_attention_2",
+        model_dtype: str = "bfloat16",
+        num_beams: Optional[int] = None,
+        system_prompt: str = "/no_think /system_override",
+        user_prompt: str = "Transcribe: <audio>",
+        encoder_dim: Optional[int] = None,
+        llm_dim: Optional[int] = None,
+        audio_sample_rate: int = 16000,
+        projector_init_std: float = 0.02,
+        projector_pool_stride: int = 2,
+        downsample_rate: int = 16,
+        projector_hidden_dim: Optional[int] = None,
+        projector_type: str = "moe",  # "moe", "swiglu", "residual", "shared_moe", "mlp"
+        projector_num_layers: int = 2,  # Number of layers (for residual projector)
+        projector_dropout: float = 0.05,  # Dropout rate for projector layers
+        projector_input_noise: float = 0.02,  # Input noise for projector
+        # MoE-specific configuration
+        num_experts: int = 4,  # Number of experts in MoE projectors
+        num_experts_per_tok: int = 2,  # Top-k experts per token
+        router_aux_loss_coef: float = 0.01,  # Auxiliary loss coefficient for load balancing
+        use_specaugment: bool = True,  # Apply SpecAugment during training
+        label_smoothing: float = 0.0,  # Label smoothing for cross-entropy loss
+        inference_diversity_penalty: float = 0.0,
+        inference_warmup_tokens: int = 10,
+        max_new_tokens: Optional[int] = None,
+        min_new_tokens: Optional[int] = None,
+        repetition_penalty: Optional[float] = None,
+        length_penalty: Optional[float] = None,
+        no_repeat_ngram_size: Optional[int] = None,
+        use_cache: Optional[bool] = None,
+        **kwargs,
+    ):
+        # Set default generation parameters (greedy decoding only)
+        generation_defaults = {
+            "num_beams": 1,
+            "max_new_tokens": 96,
+            "min_new_tokens": 0,
+            "repetition_penalty": 1.0,
+            "length_penalty": 1.0,
+            "no_repeat_ngram_size": 0,
+            "use_cache": True,
+        }
+        # Apply defaults (config.json values take precedence)
+        kwargs = {**generation_defaults, **kwargs}
+        self.audio_model_id = audio_model_id
+        self.text_model_id = text_model_id
+        self.attn_implementation = attn_implementation
+        self.model_dtype = model_dtype
+        self.system_prompt = system_prompt
+        self.user_prompt = user_prompt
+        self.encoder_dim = encoder_dim
+        self.llm_dim = llm_dim
+        self.audio_sample_rate = audio_sample_rate
+        self.projector_init_std = projector_init_std
+        self.projector_pool_stride = projector_pool_stride
+        self.downsample_rate = downsample_rate
+        self.projector_hidden_dim = projector_hidden_dim
+        self.projector_type = projector_type
+        self.projector_num_layers = projector_num_layers
+        self.projector_dropout = projector_dropout
+        self.projector_input_noise = projector_input_noise
+        # MoE-specific configuration
+        self.num_experts = num_experts
+        self.num_experts_per_tok = num_experts_per_tok
+        self.router_aux_loss_coef = router_aux_loss_coef
+        self.use_specaugment = use_specaugment
+        self.label_smoothing = label_smoothing
+        self.inference_diversity_penalty = inference_diversity_penalty
+        self.inference_warmup_tokens = inference_warmup_tokens
+        # Generation parameters (use explicit value if provided, else use default)
+        self.num_beams = num_beams if num_beams is not None else generation_defaults["num_beams"]
+        self.max_new_tokens = (
+            max_new_tokens if max_new_tokens is not None else generation_defaults["max_new_tokens"]
+        )
+        self.min_new_tokens = (
+            min_new_tokens if min_new_tokens is not None else generation_defaults["min_new_tokens"]
+        )
+        self.repetition_penalty = (
+            repetition_penalty
+            if repetition_penalty is not None
+            else generation_defaults["repetition_penalty"]
+        )
+        self.length_penalty = (
+            length_penalty if length_penalty is not None else generation_defaults["length_penalty"]
+        )
+        self.no_repeat_ngram_size = (
+            no_repeat_ngram_size
+            if no_repeat_ngram_size is not None
+            else generation_defaults["no_repeat_ngram_size"]
+        )
+        self.use_cache = use_cache if use_cache is not None else generation_defaults["use_cache"]
+        if "audio_config" not in kwargs:
+            self.audio_config = transformers.AutoConfig.from_pretrained(audio_model_id)
+            # Override dtype to match model_dtype
+            self.audio_config.dtype = model_dtype
+        else:
+            self.audio_config = kwargs.pop("audio_config")
+        if "text_config" not in kwargs:
+            self.text_config = transformers.AutoConfig.from_pretrained(
+                text_model_id, trust_remote_code=True
+            )
+            # Override dtype to match model_dtype
+            self.text_config.dtype = model_dtype
+        else:
+            self.text_config = kwargs.pop("text_config")
+        if isinstance(self.text_config, dict):
+            # Reconstruct config from dict using the model_type stored in the dict
+            model_type = self.text_config["model_type"]
+            config_class = transformers.AutoConfig.for_model(model_type).__class__
+            self.text_config = config_class(**self.text_config)
+        if isinstance(self.audio_config, dict):
+            model_type = self.audio_config.get("model_type")
+            if model_type:
+                config_class = transformers.AutoConfig.for_model(model_type).__class__
+                self.audio_config = config_class(**self.audio_config)
+        super().__init__(**kwargs)
+        self.auto_map = {
+            "AutoConfig": "asr_config.ASRConfig",
+            "AutoModel": "asr_modeling.ASRModel",
+            "AutoModelForSpeechSeq2Seq": "asr_modeling.ASRModel",
+            "AutoProcessor": "asr_processing.ASRProcessor",
+        }
+        self.custom_pipelines = {
+            "automatic-speech-recognition": {
+                "impl": "asr_pipeline.ASRPipeline",
+                "pt": ["AutoModelForSpeechSeq2Seq"],
+                "tf": [],
+                "type": "audio",
+            }
+        }
+        self.architectures = ["ASRModel"]
+        self.pipeline_tag = "automatic-speech-recognition"
+transformers.AutoConfig.register("asr_model", ASRConfig)

asr_modeling.py ADDED Viewed

	@@ -0,0 +1,549 @@

+import json
+from pathlib import Path
+from typing import Optional, Union
+import torch
+import torch.nn as nn
+from transformers import (
+    AutoConfig,
+    AutoModel,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    PreTrainedModel,
+)
+from transformers.generation import GenerationMixin
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.models.whisper.modeling_whisper import (
+    _compute_mask_indices,
+)
+try:
+    from .asr_config import ASRConfig
+    from .projectors import PROJECTOR_CLASSES
+except ImportError:
+    from asr_config import ASRConfig  # type: ignore[no-redef]
+    from projectors import PROJECTOR_CLASSES  # type: ignore[no-redef]
+class ASRModel(PreTrainedModel, GenerationMixin):
+    """Audio-to-text model combining an audio encoder, projector, and language model."""
+    config_class = ASRConfig
+    base_model_prefix = "model"
+    main_input_name = "input_features"
+    _supports_flash_attn_2 = True
+    supports_gradient_checkpointing = True
+    _is_loading_from_pretrained: bool = False
+    _pretrained_model_path: Optional[str] = None
+    TRANSCRIBE_PROMPT = "Transcribe: "
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
+        """Load model from pretrained, handling device placement correctly."""
+        from safetensors.torch import load_file
+        from transformers.utils.hub import cached_file
+        config = kwargs.pop("config", None)
+        if config is None:
+            config = ASRConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        # Set flag to avoid device_map="auto" in sub-model loaders
+        cls._is_loading_from_pretrained = True
+        cls._pretrained_model_path = pretrained_model_name_or_path
+        try:
+            model = cls(config, **kwargs)
+            # Load projector weights from safetensors
+            subfolder = kwargs.get("subfolder")
+            revision = kwargs.get("revision")
+            cache_kwargs = {}
+            if subfolder:
+                cache_kwargs["subfolder"] = subfolder
+            if revision:
+                cache_kwargs["revision"] = revision
+            model_file = cached_file(
+                pretrained_model_name_or_path,
+                "model.safetensors",
+                _raise_exceptions_for_missing_entries=False,
+                **cache_kwargs,
+            )
+            if model_file is not None:
+                state_dict = load_file(model_file)
+                model.load_state_dict(state_dict, strict=False)
+            return model
+        finally:
+            cls._is_loading_from_pretrained = False
+            cls._pretrained_model_path = None
+    def __init__(self, config: ASRConfig, **kwargs):
+        super().__init__(config)
+        self.system_prompt = config.system_prompt
+        target_dtype = getattr(torch, config.model_dtype)
+        # Audio encoder (frozen)
+        self.audio_tower = self._load_audio_encoder(config, target_dtype)
+        # Language model (frozen)
+        self.language_model = self._load_language_model(config, target_dtype)
+        # Initialize tokenizer and special tokens
+        self._init_tokenizer(config)
+        # Set up generation config with greedy decoding defaults
+        self.generation_config = self.language_model.generation_config
+        self.generation_config.max_new_tokens = config.max_new_tokens
+        self.generation_config.num_beams = config.num_beams
+        self.generation_config.do_sample = False
+        # Clear sampling params (inherited from LLM) since we use greedy decoding
+        self.generation_config.temperature = None
+        self.generation_config.top_p = None
+        self.generation_config.top_k = None
+        self.generation_config.use_cache = config.use_cache
+        self.generation_config.length_penalty = config.length_penalty
+        self.generation_config.repetition_penalty = config.repetition_penalty
+        self.generation_config.no_repeat_ngram_size = config.no_repeat_ngram_size
+        self.generation_config.eos_token_id = self.tokenizer.convert_tokens_to_ids("<|im_end|>")
+        self.generation_config.pad_token_id = self.tokenizer.pad_token_id
+        # Feature extractor for audio preprocessing
+        self.feature_extractor = self._create_feature_extractor(config)
+        # Audio projector (trainable)
+        self.projector = self._create_projector(config, target_dtype)
+        # For model parallelism
+        self._no_split_modules = getattr(self.language_model, "_no_split_modules", [])
+    def _create_feature_extractor(self, config: ASRConfig):
+        """Create the appropriate feature extractor for the audio encoder."""
+        from transformers import AutoFeatureExtractor
+        return AutoFeatureExtractor.from_pretrained(config.audio_model_id)
+    @classmethod
+    def _load_audio_encoder(cls, config: ASRConfig, dtype: torch.dtype) -> nn.Module:
+        """Load and freeze the audio encoder."""
+        encoder_kwargs = {
+            "attn_implementation": config.attn_implementation,
+            "low_cpu_mem_usage": True,
+            "dtype": dtype,
+        }
+        if "whisper" in config.audio_model_id.lower():
+            from transformers import WhisperModel
+            full_model = WhisperModel.from_pretrained(config.audio_model_id, **encoder_kwargs)
+            encoder = full_model.encoder
+            del full_model
+        else:
+            encoder = AutoModel.from_pretrained(config.audio_model_id, **encoder_kwargs)
+        encoder.requires_grad_(False)
+        encoder.eval()
+        return encoder
+    @classmethod
+    def _load_language_model(cls, config: ASRConfig, dtype: torch.dtype) -> PreTrainedModel:
+        """Load and freeze the language model."""
+        decoder_kwargs = {
+            "attn_implementation": config.attn_implementation,
+            "trust_remote_code": True,
+            "tie_word_embeddings": True,
+            "low_cpu_mem_usage": True,
+            "dtype": dtype,
+        }
+        decoder = AutoModelForCausalLM.from_pretrained(config.text_model_id, **decoder_kwargs)
+        decoder.config.use_cache = getattr(config, "use_cache", True)
+        decoder.requires_grad_(False)
+        decoder.eval()
+        return decoder
+    def _create_projector(self, config: ASRConfig, dtype: torch.dtype) -> nn.Module:
+        """Create the trainable audio projector."""
+        # Auto-detect dimensions if not specified
+        if config.encoder_dim is None:
+            enc_cfg = self.audio_tower.config
+            config.encoder_dim = getattr(enc_cfg, "hidden_size", None) or getattr(
+                enc_cfg, "d_model", None
+            )
+            if config.encoder_dim is None:
+                raise ValueError("Could not auto-detect encoder_dim. Please specify in config.")
+        if config.llm_dim is None:
+            dec_cfg = self.language_model.config
+            config.llm_dim = getattr(dec_cfg, "hidden_size", None) or getattr(
+                dec_cfg, "d_model", None
+            )
+            if config.llm_dim is None:
+                raise ValueError("Could not auto-detect llm_dim. Please specify in config.")
+        # Select projector type based on config
+        projector_type = getattr(config, "projector_type", "mlp")
+        projector_class = PROJECTOR_CLASSES.get(projector_type)
+        if projector_class is None:
+            raise ValueError(
+                f"Unknown projector_type: {projector_type}. "
+                f"Valid options: {list(PROJECTOR_CLASSES.keys())}"
+            )
+        projector = projector_class(config)
+        # Move projector to same device as language model (important when using quantization)
+        device = next(self.language_model.parameters()).device
+        return projector.to(device=device, dtype=dtype)
+    def _init_tokenizer(self, config: ASRConfig):
+        """Initialize tokenizer with audio token."""
+        self.tokenizer = AutoTokenizer.from_pretrained(config.text_model_id, trust_remote_code=True)
+        # Set pad token
+        if (
+            self.tokenizer.pad_token is None
+            or self.tokenizer.pad_token_id == self.tokenizer.eos_token_id
+        ) and "<|finetune_right_pad_id|>" in self.tokenizer.get_vocab():
+            self.tokenizer.pad_token = "<|finetune_right_pad_id|>"
+        # Add audio token
+        existing_special = self.tokenizer.additional_special_tokens or []
+        if "<audio>" not in existing_special:
+            self.tokenizer.add_special_tokens(
+                {"additional_special_tokens": existing_special + ["<audio>"]}
+            )
+            self.language_model.resize_token_embeddings(len(self.tokenizer), mean_resizing=False)
+        self.audio_token_id = self.tokenizer.convert_tokens_to_ids("<audio>")
+        self.tokenizer.padding_side = "right"
+        # Sync token IDs to configs
+        for cfg in [self.config.text_config, self.language_model.config, self.generation_config]:
+            if cfg is not None:
+                cfg.pad_token_id = self.tokenizer.pad_token_id
+                cfg.eos_token_id = self.tokenizer.eos_token_id
+                cfg.bos_token_id = self.tokenizer.bos_token_id
+    def _init_weights(self, module):
+        """Weight initialization (projector weights are initialized in MoEAudioProjector)."""
+        pass
+    def _set_gradient_checkpointing(self, enable: bool = True, gradient_checkpointing_func=None):
+        """Enable/disable gradient checkpointing for the language model."""
+        # The LLM still stores activations during forward for backprop to projector
+        # Gradient checkpointing trades compute for memory by recomputing activations
+        if hasattr(self.language_model, "_set_gradient_checkpointing"):
+            self.language_model._set_gradient_checkpointing(enable, gradient_checkpointing_func)
+        elif hasattr(self.language_model, "gradient_checkpointing_enable") and enable:
+            self.language_model.gradient_checkpointing_enable(
+                gradient_checkpointing_kwargs={"use_reentrant": False}
+            )
+        elif hasattr(self.language_model, "gradient_checkpointing_disable") and not enable:
+            self.language_model.gradient_checkpointing_disable()
+    def get_input_embeddings(self):
+        return self.language_model.get_input_embeddings()
+    def set_input_embeddings(self, value):
+        self.language_model.set_input_embeddings(value)
+    def get_output_embeddings(self):
+        return self.language_model.get_output_embeddings()
+    def set_output_embeddings(self, value):
+        self.language_model.set_output_embeddings(value)
+    def get_processor(self):
+        """Get the processor for this model."""
+        try:
+            from .asr_processing import ASRProcessor
+        except ImportError:
+            from asr_processing import ASRProcessor  # type: ignore[no-redef]
+        return ASRProcessor(feature_extractor=self.feature_extractor, tokenizer=self.tokenizer)
+    def state_dict(self, *args, **kwargs):
+        """Only save trainable projector weights."""
+        return {f"projector.{k}": v for k, v in self.projector.state_dict().items()}
+    def _apply_specaugment(
+        self,
+        input_features: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        if not getattr(self.config, "use_specaugment", False):
+            return input_features
+        if not self.training:
+            return input_features
+        # Input shape: (batch_size, num_mel_bins, sequence_length) for Whisper
+        batch_size, hidden_size, sequence_length = input_features.size()
+        mask_time_prob = getattr(self.config, "mask_time_prob", 0.05)
+        mask_time_length = getattr(self.config, "mask_time_length", 10)
+        mask_feature_prob = getattr(self.config, "mask_feature_prob", 0.0)
+        mask_feature_length = getattr(self.config, "mask_feature_length", 10)
+        # Time masking
+        if mask_time_prob > 0:
+            mask_time_np = _compute_mask_indices(
+                (batch_size, sequence_length),
+                mask_prob=mask_time_prob,
+                mask_length=mask_time_length,
+                attention_mask=attention_mask,
+                min_masks=2,
+            )
+            mask_time_indices = torch.tensor(
+                mask_time_np, device=input_features.device, dtype=torch.bool
+            )
+            # Expand to cover all features: (batch, seq) -> (batch, features, seq)
+            mask_time_expanded = mask_time_indices[:, None].expand(-1, hidden_size, -1)
+            input_features = input_features.masked_fill(mask_time_expanded, 0.0)
+        # Feature masking
+        if mask_feature_prob > 0:
+            mask_feature_np = _compute_mask_indices(
+                (batch_size, hidden_size),
+                mask_prob=mask_feature_prob,
+                mask_length=mask_feature_length,
+                min_masks=2,
+            )
+            mask_feature_indices = torch.tensor(
+                mask_feature_np, device=input_features.device, dtype=torch.bool
+            )
+            # Expand: (batch, features) -> (batch, features, seq)
+            mask_feature_expanded = mask_feature_indices[:, :, None].expand(-1, -1, sequence_length)
+            input_features = input_features.masked_fill(mask_feature_expanded, 0.0)
+        return input_features
+    def _encode_audio(
+        self,
+        audio_features: torch.Tensor,
+        audio_attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """Encode audio and project to LLM embedding space.
+        Returns flattened audio embeddings of shape (total_audio_tokens, hidden_dim).
+        """
+        # Apply SpecAugment during training (before encoding)
+        audio_features = self._apply_specaugment(audio_features, audio_attention_mask)
+        with torch.no_grad():
+            encoder_out = self.audio_tower(
+                input_features=audio_features, attention_mask=audio_attention_mask
+            )
+            hidden_states = encoder_out.last_hidden_state
+        audio_embeds = self.projector(hidden_states)
+        # Flatten: (batch, seq, hidden) -> (batch * seq, hidden)
+        # This allows masked_scatter to do 1:1 replacement
+        return audio_embeds.reshape(-1, audio_embeds.shape[-1])
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        input_features: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        past_key_values: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.Tensor] = None,
+        audio_attention_mask: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        """Forward pass for training and inference."""
+        # Get text embeddings if not provided
+        if inputs_embeds is None:
+            inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        if input_features is not None and input_ids is not None:
+            # Encode audio -> flattened (total_audio_tokens, hidden_dim)
+            audio_embeds = self._encode_audio(input_features, audio_attention_mask)
+            # Replace <audio> token placeholders with audio embeddings using masked_scatter
+            audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
+            inputs_embeds = inputs_embeds.masked_scatter(
+                audio_token_mask.to(inputs_embeds.device),
+                audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
+            )
+        # Run through language model (let it compute loss if labels provided)
+        outputs = self.language_model(
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        # Add auxiliary loss from MoE projectors if available
+        if outputs.loss is not None and hasattr(self.projector, "get_aux_loss"):
+            aux_loss = self.projector.get_aux_loss()
+            if aux_loss is not None and aux_loss.numel() > 0:
+                outputs.loss = outputs.loss + aux_loss.to(outputs.loss.device)
+        return outputs
+    def prepare_inputs_for_generation(self, *args, **kwargs):
+        """Prepare inputs for generation, handling audio features for cached decoding."""
+        input_features = kwargs.pop("input_features", None)
+        cache_position = kwargs.get("cache_position")
+        model_inputs = self.language_model.prepare_inputs_for_generation(*args, **kwargs)
+        # Only pass audio features on the first generation step (cache_position[0] == 0)
+        if cache_position is not None and cache_position[0] == 0 and input_features is not None:
+            model_inputs["input_features"] = input_features
+        return model_inputs
+    def _get_num_audio_tokens(self, input_features: torch.Tensor) -> int:
+        """Calculate number of audio tokens based on input shape.
+        Whisper: input_features shape is (batch, n_mels, mel_len)
+        Encoder output is mel_len // 2 due to stride-2 conv
+        MLP projector adds another stride-2 for 4x total downsampling
+        """
+        mel_len = input_features.shape[-1]
+        return mel_len // 4
+    @torch.no_grad()
+    def generate(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        input_features: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        audio_attention_mask: Optional[torch.Tensor] = None,
+        system_prompt: Optional[str] = None,
+        **generate_kwargs,
+    ) -> torch.Tensor:
+        """Generate transcription from audio input.
+        Can be called in two ways:
+        1. With input_ids containing <audio> tokens (from processor)
+        2. With just audio, and we build the prompt internally
+        """
+        if input_features is None:
+            raise ValueError("input_features required for generation")
+        device = input_features.device
+        batch_size = input_features.shape[0]
+        # Encode audio -> flattened embeddings
+        audio_embeds = self._encode_audio(input_features, audio_attention_mask)
+        # If input_ids not provided, build prompt with correct number of audio tokens
+        if input_ids is None:
+            num_audio_tokens = self._get_num_audio_tokens(input_features)
+            audio_placeholder = "<audio>" * num_audio_tokens
+            system_prompt = system_prompt or self.system_prompt
+            messages: list[dict[str, str]] = []
+            if system_prompt:
+                messages.append({"role": "system", "content": system_prompt})
+            messages.append({"role": "user", "content": self.TRANSCRIBE_PROMPT + audio_placeholder})
+            input_ids = self.tokenizer.apply_chat_template(
+                messages,
+                tokenize=True,
+                add_generation_prompt=True,
+                return_tensors="pt",
+            ).to(device)
+            if input_ids.dim() == 1:
+                input_ids = input_ids.unsqueeze(0)
+            if input_ids.shape[0] == 1 and batch_size > 1:
+                input_ids = input_ids.expand(batch_size, -1)
+            attention_mask = torch.ones_like(input_ids)
+        # Get text embeddings and replace audio tokens with audio embeddings
+        inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
+        inputs_embeds = inputs_embeds.masked_scatter(
+            audio_token_mask.to(inputs_embeds.device),
+            audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
+        )
+        # Generate using language model
+        output = self.language_model.generate(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            generation_config=self.generation_config,
+            **generate_kwargs,
+        )
+        # When using inputs_embeds without input_ids, generate returns only new tokens
+        if isinstance(output, torch.Tensor):
+            return output
+        return output.sequences
+    def save_pretrained(self, save_directory: Union[str, Path], **kwargs):
+        """Save model, tokenizer, and processor."""
+        import shutil
+        from pathlib import Path as PathlibPath
+        save_dir = PathlibPath(save_directory)
+        save_dir.mkdir(parents=True, exist_ok=True)
+        # Update config with actual vocab size
+        self.config.vocab_size = self.language_model.config.vocab_size
+        self.config.text_config.vocab_size = self.language_model.config.vocab_size
+        if hasattr(self.audio_tower.config, "num_mel_bins"):
+            self.config.audio_config.num_mel_bins = self.audio_tower.config.num_mel_bins
+        # Save model (temporarily remove non-serializable attributes)
+        tokenizer = self.tokenizer
+        del self.tokenizer
+        try:
+            super().save_pretrained(save_dir, **kwargs)
+        finally:
+            self.tokenizer = tokenizer
+        # Save tokenizer and feature extractor
+        self.tokenizer.save_pretrained(save_dir)
+        self.feature_extractor.save_pretrained(save_dir)
+        # Add processor auto_map to preprocessor_config.json
+        config_path = save_dir / "preprocessor_config.json"
+        if config_path.exists():
+            with config_path.open() as f:
+                processor_config = json.load(f)
+        else:
+            processor_config = {}
+        processor_config.update(
+            {
+                "processor_class": "ASRProcessor",
+                "auto_map": {"AutoProcessor": "asr_processing.ASRProcessor"},
+            }
+        )
+        with config_path.open("w") as f:
+            json.dump(processor_config, f, indent=2)
+        # Copy source files for auto-loading
+        src_dir = PathlibPath(__file__).parent
+        for asr_file in src_dir.glob("asr_*.py"):
+            shutil.copy(asr_file, save_dir / asr_file.name)
+        # Copy projectors module
+        shutil.copy(src_dir / "projectors.py", save_dir / "projectors.py")
+# Register with transformers Auto classes
+AutoConfig.register("asr_model", ASRConfig)
+AutoModel.register(ASRConfig, ASRModel)

asr_pipeline.py ADDED Viewed

	@@ -0,0 +1,472 @@

+from typing import Any
+import numpy as np
+import torch
+import transformers
+try:
+    from .asr_modeling import ASRModel
+except ImportError:
+    from asr_modeling import ASRModel  # type: ignore[no-redef]
+class ForcedAligner:
+    """Lazy-loaded forced aligner for word-level timestamps using torchaudio wav2vec2."""
+    _bundle = None
+    _model = None
+    _labels = None
+    _dictionary = None
+    @classmethod
+    def get_instance(cls, device: str = "cuda"):
+        if cls._model is None:
+            import torchaudio
+            cls._bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
+            cls._model = cls._bundle.get_model().to(device)
+            cls._model.eval()
+            cls._labels = cls._bundle.get_labels()
+            cls._dictionary = {c: i for i, c in enumerate(cls._labels)}
+        return cls._model, cls._labels, cls._dictionary
+    @classmethod
+    def align(
+        cls,
+        audio: np.ndarray,
+        text: str,
+        sample_rate: int = 16000,
+        language: str = "eng",
+        batch_size: int = 16,
+    ) -> list[dict]:
+        """Align transcript to audio and return word-level timestamps.
+        Args:
+            audio: Audio waveform as numpy array
+            text: Transcript text to align
+            sample_rate: Audio sample rate (default 16000)
+            language: ISO-639-3 language code (default "eng" for English, unused)
+            batch_size: Batch size for alignment model (unused)
+        Returns:
+            List of dicts with 'word', 'start', 'end' keys
+        """
+        import torchaudio
+        from torchaudio.functional import forced_align, merge_tokens
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+        model, labels, dictionary = cls.get_instance(device)
+        # Convert audio to tensor (copy to ensure array is writable)
+        if isinstance(audio, np.ndarray):
+            waveform = torch.from_numpy(audio.copy()).float()
+        else:
+            waveform = audio.clone().float()
+        # Ensure 2D (channels, time)
+        if waveform.dim() == 1:
+            waveform = waveform.unsqueeze(0)
+        # Resample if needed (wav2vec2 expects 16kHz)
+        if sample_rate != cls._bundle.sample_rate:
+            waveform = torchaudio.functional.resample(
+                waveform, sample_rate, cls._bundle.sample_rate
+            )
+        waveform = waveform.to(device)
+        # Get emissions from model
+        with torch.inference_mode():
+            emissions, _ = model(waveform)
+            emissions = torch.log_softmax(emissions, dim=-1)
+        emission = emissions[0].cpu()
+        # Normalize text: uppercase, keep only valid characters
+        transcript = text.upper()
+        # Build tokens from transcript
+        tokens = []
+        for char in transcript:
+            if char in dictionary:
+                tokens.append(dictionary[char])
+            elif char == " ":
+                tokens.append(dictionary.get("|", dictionary.get(" ", 0)))
+        if not tokens:
+            return []
+        targets = torch.tensor([tokens], dtype=torch.int32)
+        # Run forced alignment
+        # Note: forced_align is deprecated in torchaudio 2.6+ and will be removed in 2.9 (late 2025)
+        # No official replacement announced yet. See https://github.com/pytorch/audio/issues/3902
+        aligned_tokens, scores = forced_align(emission.unsqueeze(0), targets, blank=0)
+        # Use torchaudio's merge_tokens to get token spans (removes blanks and merges repeats)
+        token_spans = merge_tokens(aligned_tokens[0], scores[0])
+        # Convert frame indices to time (model stride is 320 samples at 16kHz = 20ms)
+        frame_duration = 320 / cls._bundle.sample_rate
+        # Group token spans into words based on pipe separator
+        words = text.split()
+        word_timestamps = []
+        current_word_start = None
+        current_word_end = None
+        word_idx = 0
+        for span in token_spans:
+            token_char = labels[span.token]
+            if token_char == "|":  # Word separator
+                if current_word_start is not None and word_idx < len(words):
+                    word_timestamps.append({
+                        "word": words[word_idx],
+                        "start": current_word_start * frame_duration,
+                        "end": current_word_end * frame_duration,
+                    })
+                    word_idx += 1
+                current_word_start = None
+                current_word_end = None
+            else:
+                if current_word_start is None:
+                    current_word_start = span.start
+                current_word_end = span.end
+        # Don't forget the last word
+        if current_word_start is not None and word_idx < len(words):
+            word_timestamps.append({
+                "word": words[word_idx],
+                "start": current_word_start * frame_duration,
+                "end": current_word_end * frame_duration,
+            })
+        return word_timestamps
+class SpeakerDiarizer:
+    """Lazy-loaded speaker diarization using pyannote-audio."""
+    _pipeline = None
+    @classmethod
+    def get_instance(cls, hf_token: str | None = None):
+        """Get or create the diarization pipeline.
+        Args:
+            hf_token: HuggingFace token with access to pyannote models.
+                     Can also be set via HF_TOKEN environment variable.
+        """
+        if cls._pipeline is None:
+            from pyannote.audio import Pipeline
+            cls._pipeline = Pipeline.from_pretrained(
+                "pyannote/speaker-diarization-3.1",
+            )
+            # Move to GPU if available
+            if torch.cuda.is_available():
+                cls._pipeline.to(torch.device("cuda"))
+            elif torch.backends.mps.is_available():
+                cls._pipeline.to(torch.device("mps"))
+        return cls._pipeline
+    @classmethod
+    def diarize(
+        cls,
+        audio: np.ndarray | str,
+        sample_rate: int = 16000,
+        num_speakers: int | None = None,
+        min_speakers: int | None = None,
+        max_speakers: int | None = None,
+        hf_token: str | None = None,
+    ) -> list[dict]:
+        """Run speaker diarization on audio.
+        Args:
+            audio: Audio waveform as numpy array or path to audio file
+            sample_rate: Audio sample rate (default 16000)
+            num_speakers: Exact number of speakers (if known)
+            min_speakers: Minimum number of speakers
+            max_speakers: Maximum number of speakers
+            hf_token: HuggingFace token for pyannote models
+        Returns:
+            List of dicts with 'speaker', 'start', 'end' keys
+        """
+        pipeline = cls.get_instance(hf_token)
+        # Prepare audio input
+        if isinstance(audio, np.ndarray):
+            # pyannote expects {"waveform": tensor, "sample_rate": int}
+            waveform = torch.from_numpy(audio).unsqueeze(0)  # Add channel dim
+            if waveform.dim() == 1:
+                waveform = waveform.unsqueeze(0)
+            audio_input = {"waveform": waveform, "sample_rate": sample_rate}
+        else:
+            # File path
+            audio_input = audio
+        # Run diarization
+        diarization_args = {}
+        if num_speakers is not None:
+            diarization_args["num_speakers"] = num_speakers
+        if min_speakers is not None:
+            diarization_args["min_speakers"] = min_speakers
+        if max_speakers is not None:
+            diarization_args["max_speakers"] = max_speakers
+        diarization = pipeline(audio_input, **diarization_args)
+        # Handle different pyannote return types
+        # pyannote 3.x returns DiarizeOutput dataclass, older versions return Annotation
+        if hasattr(diarization, "itertracks"):
+            annotation = diarization
+        elif hasattr(diarization, "speaker_diarization"):
+            # pyannote 3.x DiarizeOutput dataclass
+            annotation = diarization.speaker_diarization
+        elif isinstance(diarization, tuple):
+            # Some versions return (annotation, embeddings) tuple
+            annotation = diarization[0]
+        else:
+            raise TypeError(f"Unexpected diarization output type: {type(diarization)}")
+        # Convert to simple format
+        segments = []
+        for turn, _, speaker in annotation.itertracks(yield_label=True):
+            segments.append({
+                "speaker": speaker,
+                "start": turn.start,
+                "end": turn.end,
+            })
+        return segments
+    @classmethod
+    def assign_speakers_to_words(
+        cls,
+        words: list[dict],
+        speaker_segments: list[dict],
+    ) -> list[dict]:
+        """Assign speaker labels to words based on timestamp overlap.
+        Args:
+            words: List of word dicts with 'word', 'start', 'end' keys
+            speaker_segments: List of speaker dicts with 'speaker', 'start', 'end' keys
+        Returns:
+            Words list with 'speaker' key added to each word
+        """
+        for word in words:
+            word_mid = (word["start"] + word["end"]) / 2
+            # Find the speaker segment that contains this word's midpoint
+            best_speaker = None
+            for seg in speaker_segments:
+                if seg["start"] <= word_mid <= seg["end"]:
+                    best_speaker = seg["speaker"]
+                    break
+            # If no exact match, find closest segment
+            if best_speaker is None and speaker_segments:
+                min_dist = float("inf")
+                for seg in speaker_segments:
+                    seg_mid = (seg["start"] + seg["end"]) / 2
+                    dist = abs(word_mid - seg_mid)
+                    if dist < min_dist:
+                        min_dist = dist
+                        best_speaker = seg["speaker"]
+            word["speaker"] = best_speaker
+        return words
+class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
+    """ASR Pipeline for audio-to-text transcription."""
+    model: ASRModel
+    def __init__(self, model: ASRModel, **kwargs):
+        feature_extractor = kwargs.pop("feature_extractor", None)
+        tokenizer = kwargs.pop("tokenizer", model.tokenizer)
+        if feature_extractor is None:
+            feature_extractor = model.get_processor().feature_extractor
+        super().__init__(
+            model=model, feature_extractor=feature_extractor, tokenizer=tokenizer, **kwargs
+        )
+        self._current_audio = None
+    def _sanitize_parameters(self, **kwargs):
+        """Intercept our custom parameters before parent class validates them."""
+        # Remove our custom parameters so parent doesn't see them
+        kwargs.pop("return_timestamps", None)
+        kwargs.pop("return_speakers", None)
+        kwargs.pop("num_speakers", None)
+        kwargs.pop("min_speakers", None)
+        kwargs.pop("max_speakers", None)
+        kwargs.pop("hf_token", None)
+        return super()._sanitize_parameters(**kwargs)
+    def __call__(
+        self,
+        inputs,
+        **kwargs,
+    ):
+        """Transcribe audio with optional word-level timestamps and speaker diarization.
+        Args:
+            inputs: Audio input (file path, dict with array/sampling_rate, etc.)
+            return_timestamps: If True, return word-level timestamps using forced alignment
+            return_speakers: If True, return speaker labels for each word
+            num_speakers: Exact number of speakers (if known, for diarization)
+            min_speakers: Minimum number of speakers (for diarization)
+            max_speakers: Maximum number of speakers (for diarization)
+            hf_token: HuggingFace token for pyannote models (or set HF_TOKEN env var)
+            **kwargs: Additional arguments passed to the pipeline
+        Returns:
+            Dict with 'text' key, 'words' key if return_timestamps=True,
+            and speaker labels on words if return_speakers=True
+        """
+        # Extract our params before super().__call__ (which will also call _sanitize_parameters)
+        return_timestamps = kwargs.pop("return_timestamps", False)
+        return_speakers = kwargs.pop("return_speakers", False)
+        diarization_params = {
+            "num_speakers": kwargs.pop("num_speakers", None),
+            "min_speakers": kwargs.pop("min_speakers", None),
+            "max_speakers": kwargs.pop("max_speakers", None),
+            "hf_token": kwargs.pop("hf_token", None),
+        }
+        if return_speakers:
+            return_timestamps = True
+        # Store audio for timestamp alignment and diarization
+        if return_timestamps or return_speakers:
+            self._current_audio = self._extract_audio(inputs)
+        # Run standard transcription
+        result = super().__call__(inputs, **kwargs)
+        # Add timestamps if requested
+        if return_timestamps and self._current_audio is not None:
+            text = result.get("text", "")
+            if text:
+                try:
+                    words = ForcedAligner.align(
+                        self._current_audio["array"],
+                        text,
+                        sample_rate=self._current_audio.get("sampling_rate", 16000),
+                    )
+                    result["words"] = words
+                except Exception as e:
+                    result["words"] = []
+                    result["timestamp_error"] = str(e)
+            else:
+                result["words"] = []
+        # Add speaker diarization if requested
+        if return_speakers and self._current_audio is not None:
+            try:
+                # Run diarization
+                speaker_segments = SpeakerDiarizer.diarize(
+                    self._current_audio["array"],
+                    sample_rate=self._current_audio.get("sampling_rate", 16000),
+                    **{k: v for k, v in diarization_params.items() if v is not None},
+                )
+                result["speaker_segments"] = speaker_segments
+                # Assign speakers to words
+                if result.get("words"):
+                    result["words"] = SpeakerDiarizer.assign_speakers_to_words(
+                        result["words"],
+                        speaker_segments,
+                    )
+            except Exception as e:
+                result["speaker_segments"] = []
+                result["diarization_error"] = str(e)
+        # Clean up
+        self._current_audio = None
+        return result
+    def _extract_audio(self, inputs) -> dict | None:
+        """Extract audio array from various input formats using HF utilities."""
+        from transformers.pipelines.audio_utils import ffmpeg_read
+        if isinstance(inputs, dict):
+            if "array" in inputs:
+                return {
+                    "array": inputs["array"],
+                    "sampling_rate": inputs.get("sampling_rate", 16000),
+                }
+            if "raw" in inputs:
+                return {
+                    "array": inputs["raw"],
+                    "sampling_rate": inputs.get("sampling_rate", 16000),
+                }
+        elif isinstance(inputs, str):
+            # File path - load audio using ffmpeg (same as HF pipeline)
+            with open(inputs, "rb") as f:
+                audio = ffmpeg_read(f.read(), sampling_rate=16000)
+            return {"array": audio, "sampling_rate": 16000}
+        elif isinstance(inputs, bytes):
+            audio = ffmpeg_read(inputs, sampling_rate=16000)
+            return {"array": audio, "sampling_rate": 16000}
+        elif isinstance(inputs, np.ndarray):
+            return {"array": inputs, "sampling_rate": 16000}
+        return None
+    def preprocess(self, inputs, **preprocess_params):
+        # Handle dict with "array" key (from datasets)
+        if isinstance(inputs, dict) and "array" in inputs:
+            inputs = {
+                "raw": inputs["array"],
+                "sampling_rate": inputs.get("sampling_rate", self.feature_extractor.sampling_rate),
+            }
+        for item in super().preprocess(inputs, **preprocess_params):
+            if "is_last" not in item:
+                item["is_last"] = True
+            yield item
+    def _forward(self, model_inputs, **generate_kwargs) -> dict[str, Any]:
+        # Extract audio features and is_last flag
+        is_last = model_inputs.pop("is_last", True) if isinstance(model_inputs, dict) else True
+        if isinstance(model_inputs, dict):
+            input_features = model_inputs.get("input_features")
+            if input_features is not None:
+                input_features = input_features.to(self.model.device)
+        else:
+            input_features = model_inputs.to(self.model.device)
+        generated_ids = self.model.generate(
+            input_features=input_features,
+            **generate_kwargs,
+        )
+        return {"tokens": generated_ids, "is_last": is_last}
+    def postprocess(self, model_outputs, **kwargs) -> dict[str, str]:
+        # Handle list of outputs (from chunking)
+        if isinstance(model_outputs, list):
+            model_outputs = model_outputs[0] if model_outputs else {}
+        tokens = model_outputs.get("tokens")
+        if tokens is None:
+            return super().postprocess(model_outputs, **kwargs)
+        if torch.is_tensor(tokens):
+            tokens = tokens.cpu()
+            if tokens.dim() > 1:
+                tokens = tokens[0]
+        text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
+        return {"text": text}

asr_processing.py ADDED Viewed

	@@ -0,0 +1,92 @@

+from typing import Optional, Union
+import torch
+import transformers
+from transformers import ProcessorMixin
+try:
+    from .asr_config import ASRConfig
+except ImportError:
+    from asr_config import ASRConfig  # type: ignore[no-redef]
+class ASRProcessor(ProcessorMixin):
+    """Processor for Whisper-based ASR models."""
+    attributes = ["feature_extractor", "tokenizer"]
+    feature_extractor_class = "AutoFeatureExtractor"
+    tokenizer_class = "AutoTokenizer"
+    AUDIO_TOKEN = "<audio>"
+    TRANSCRIBE_PROMPT = "Transcribe: "
+    def __init__(self, feature_extractor, tokenizer):
+        self.feature_extractor = feature_extractor
+        self.tokenizer = tokenizer
+        self.audio_token_id = tokenizer.convert_tokens_to_ids(self.AUDIO_TOKEN)
+    def __call__(
+        self,
+        audio: Optional[Union[list, "torch.Tensor"]] = None,
+        text: Optional[str] = None,
+        system_prompt: Optional[str] = None,
+        return_tensors: str = "pt",
+        **kwargs,
+    ) -> dict:
+        """Process audio and text inputs for inference.
+        Args:
+            audio: Raw audio waveform(s)
+            text: Target transcription (optional, for training - but use DataCollator instead)
+            system_prompt: Optional system prompt
+            return_tensors: Return format ("pt" for PyTorch)
+        Returns:
+            Dict with input_features, input_ids, attention_mask
+        """
+        result = {}
+        # Process audio
+        if audio is not None:
+            audio_inputs = self.feature_extractor(
+                audio,
+                sampling_rate=getattr(self.feature_extractor, "sampling_rate", 16000),
+                return_tensors=return_tensors,
+                **kwargs,
+            )
+            result["input_features"] = audio_inputs["input_features"]
+            # Whisper encoder output length = mel_len // 2 (stride-2 conv)
+            num_audio_tokens = audio_inputs["input_features"].shape[-1] // 2
+        else:
+            num_audio_tokens = 0
+        # Build prompt with audio token placeholders
+        user_content = self.TRANSCRIBE_PROMPT
+        if num_audio_tokens > 0:
+            user_content += self.AUDIO_TOKEN * num_audio_tokens
+        messages = []
+        if system_prompt:
+            messages.append({"role": "system", "content": system_prompt})
+        messages.append({"role": "user", "content": user_content})
+        if text is not None:
+            messages.append({"role": "assistant", "content": text})
+        # Tokenize
+        input_ids = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=(text is None),
+            return_tensors=return_tensors,
+        )
+        if isinstance(input_ids, torch.Tensor) and input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(0)
+        result["input_ids"] = input_ids
+        result["attention_mask"] = torch.ones_like(input_ids)
+        return result
+ASRProcessor.register_for_auto_class()
+transformers.AutoProcessor.register(ASRConfig, ASRProcessor)

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,94 @@

+{# ───── defaults ───── #}
+{%- if enable_thinking is not defined -%}
+{%- set enable_thinking = true -%}
+{%- endif -%}
+{# ───── reasoning mode ───── #}
+{%- if enable_thinking -%}
+  {%- set reasoning_mode = "/think" -%}
+{%- else -%}
+  {%- set reasoning_mode = "/no_think" -%}
+{%- endif -%}
+{# ───── header (system message) ───── #}
+{{- "<|im_start|>system\n" -}}
+{%- if messages[0].role == "system" -%}
+  {%- set system_message = messages[0].content -%}
+  {%- if "/no_think" in system_message -%}
+    {%- set reasoning_mode = "/no_think" -%}
+  {%- elif "/think" in system_message -%}
+    {%- set reasoning_mode = "/think" -%}
+  {%- endif -%}
+  {%- set custom_instructions = system_message.replace("/no_think", "").replace("/think", "").rstrip() -%}
+{%- endif -%}
+{%- if "/system_override" in system_message -%}
+  {{- custom_instructions.replace("/system_override", "").rstrip() -}}
+  {{- "<|im_end|>\n" -}}
+{%- else -%}
+  {{- "## Metadata\n\n" -}}
+  {{- "Knowledge Cutoff Date: June 2025\n" -}}
+  {%- set today = strftime_now("%d %B %Y") -%}
+  {{- "Today Date: " ~ today ~ "\n" -}}
+  {{- "Reasoning Mode: " + reasoning_mode + "\n\n" -}}
+  {{- "## Custom Instructions\n\n" -}}
+  {%- if custom_instructions -%}
+    {{- custom_instructions + "\n\n" -}}
+  {%- elif reasoning_mode == "/think" -%}
+    {{- "You are a helpful AI assistant named SmolLM, trained by Hugging Face. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracking, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion.\n\n" -}}
+  {%- else -%}
+    {{- "You are a helpful AI assistant named SmolLM, trained by Hugging Face.\n\n" -}}
+  {%- endif -%}
+  {%- if xml_tools or python_tools or tools -%}
+    {{- "### Tools\n\n" -}}
+    {%- if xml_tools or tools -%}
+      {%- if tools -%}
+        {%- set xml_tools = tools -%}
+      {%- endif -%}
+      {%- set ns = namespace(xml_tool_string="You may call one or more functions to assist with the user query.\nYou are provided with function signatures within <tools></tools> XML tags:\n\n<tools>\n") -%}
+      {%- for tool in xml_tools[:] -%} {# The slicing makes sure that xml_tools is a list #}
+        {%- set ns.xml_tool_string = ns.xml_tool_string ~ (tool | string) ~ "\n" -%}
+      {%- endfor -%}
+      {%- set xml_tool_string = ns.xml_tool_string + "</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>" -%}
+      {{- xml_tool_string -}}
+    {%- endif -%}
+    {%- if python_tools -%}
+      {%- set ns = namespace(python_tool_string="When you send a message containing Python code between '<code>' and '</code>' tags, it will be executed in a stateful Jupyter notebook environment, and you will then be given the output to continued reasoning in an agentic loop.\n\nYou can use the following tools in your python code like regular functions:\n<tools>\n") -%}
+      {%- for tool in python_tools[:] -%} {# The slicing makes sure that python_tools is a list #}
+        {%- set ns.python_tool_string = ns.python_tool_string ~ (tool | string) ~ "\n" -%}
+      {%- endfor -%}
+      {%- set python_tool_string = ns.python_tool_string + "</tools>\n\nThe state persists between code executions: so variables that you define in one step are still available thereafter." -%}
+      {{- python_tool_string -}}
+    {%- endif -%}
+    {{- "\n\n" -}}
+    {{- "<|im_end|>\n" -}}
+  {%- endif -%}
+{%- endif -%}
+{# ───── main loop ───── #}
+{%- for message in messages -%}
+    {%- set content = message.content if message.content is string else "" -%}
+    {%- if message.role == "user" -%}
+        {{ "<|im_start|>" + message.role + "\n"  + content + "<|im_end|>\n" }}
+    {%- elif message.role == "assistant" -%}
+        {% generation %}
+        {%- if reasoning_mode == "/think" -%}
+            {{ "<|im_start|>assistant\n" + content.lstrip("\n") + "<|im_end|>\n" }}
+        {%- else -%}
+            {{ "<|im_start|>assistant\n" + "<think>\n\n</think>\n" + content.lstrip("\n") + "<|im_end|>\n" }}
+        {%- endif -%}
+        {% endgeneration %}
+    {%- elif message.role == "tool" -%}
+    {{ "<|im_start|>" + "user\n"  + content + "<|im_end|>\n" }}
+    {%- endif -%}
+{%- endfor -%}
+{# ───── generation prompt ───── #}
+{%- if add_generation_prompt -%}
+    {%- if reasoning_mode == "/think" -%}
+        {{ "<|im_start|>assistant\n" }}
+    {%- else -%}
+        {{ "<|im_start|>assistant\n" + "<think>\n\n</think>\n"  }}
+    {%- endif -%}
+{%- endif -%}

config.json ADDED Viewed

	@@ -0,0 +1,203 @@

+{
+  "architectures": [
+    "ASRModel"
+  ],
+  "attn_implementation": "sdpa",
+  "audio_config": {
+    "_name_or_path": "openai/whisper-large-v3-turbo",
+    "activation_dropout": 0.0,
+    "activation_function": "gelu",
+    "apply_spec_augment": false,
+    "architectures": [
+      "WhisperForConditionalGeneration"
+    ],
+    "attention_dropout": 0.0,
+    "bos_token_id": 50257,
+    "classifier_proj_size": 256,
+    "d_model": 1280,
+    "decoder_attention_heads": 20,
+    "decoder_ffn_dim": 5120,
+    "decoder_layerdrop": 0.0,
+    "decoder_layers": 4,
+    "decoder_start_token_id": 50258,
+    "dropout": 0.0,
+    "dtype": "bfloat16",
+    "encoder_attention_heads": 20,
+    "encoder_ffn_dim": 5120,
+    "encoder_layerdrop": 0.0,
+    "encoder_layers": 32,
+    "eos_token_id": 50257,
+    "init_std": 0.02,
+    "mask_feature_length": 10,
+    "mask_feature_min_masks": 0,
+    "mask_feature_prob": 0.0,
+    "mask_time_length": 10,
+    "mask_time_min_masks": 2,
+    "mask_time_prob": 0.05,
+    "max_source_positions": 1500,
+    "max_target_positions": 448,
+    "median_filter_width": 7,
+    "model_type": "whisper",
+    "num_hidden_layers": 32,
+    "num_mel_bins": 128,
+    "pad_token_id": 50257,
+    "scale_embedding": false,
+    "use_cache": true,
+    "use_weighted_layer_sum": false,
+    "vocab_size": 51866
+  },
+  "audio_model_id": "openai/whisper-large-v3-turbo",
+  "audio_sample_rate": 16000,
+  "auto_map": {
+    "AutoConfig": "asr_config.ASRConfig",
+    "AutoModel": "asr_modeling.ASRModel",
+    "AutoModelForSpeechSeq2Seq": "asr_modeling.ASRModel",
+    "AutoProcessor": "asr_processing.ASRProcessor"
+  },
+  "custom_pipelines": {
+    "automatic-speech-recognition": {
+      "impl": "asr_pipeline.ASRPipeline",
+      "pt": [
+        "AutoModelForSpeechSeq2Seq"
+      ],
+      "tf": [],
+      "type": "audio"
+    }
+  },
+  "downsample_rate": 16,
+  "dtype": "bfloat16",
+  "encoder_dim": 1280,
+  "inference_diversity_penalty": 0.0,
+  "inference_warmup_tokens": 10,
+  "label_smoothing": 0.0,
+  "llm_dim": 2048,
+  "max_new_tokens": 96,
+  "min_new_tokens": 0,
+  "model_dtype": "bfloat16",
+  "model_type": "asr_model",
+  "num_experts": 4,
+  "num_experts_per_tok": 2,
+  "pipeline_tag": "automatic-speech-recognition",
+  "projector_dropout": 0.05,
+  "projector_hidden_dim": null,
+  "projector_init_std": 0.02,
+  "projector_input_noise": 0.02,
+  "projector_num_layers": 2,
+  "projector_pool_stride": 2,
+  "projector_type": "mlp",
+  "router_aux_loss_coef": 0.01,
+  "system_prompt": "/no_think /system_override",
+  "text_config": {
+    "_name_or_path": "HuggingFaceTB/SmolLM3-3B",
+    "architectures": [
+      "SmolLM3ForCausalLM"
+    ],
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "bos_token_id": null,
+    "dtype": "bfloat16",
+    "eos_token_id": 128012,
+    "hidden_act": "silu",
+    "hidden_size": 2048,
+    "initializer_range": 0.02,
+    "intermediate_size": 11008,
+    "layer_types": [
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention"
+    ],
+    "max_position_embeddings": 65536,
+    "max_window_layers": 28,
+    "mlp_bias": false,
+    "model_type": "smollm3",
+    "no_rope_layer_interval": 4,
+    "no_rope_layers": [
+      1,
+      1,
+      1,
+      0,
+      1,
+      1,
+      1,
+      0,
+      1,
+      1,
+      1,
+      0,
+      1,
+      1,
+      1,
+      0,
+      1,
+      1,
+      1,
+      0,
+      1,
+      1,
+      1,
+      0,
+      1,
+      1,
+      1,
+      0,
+      1,
+      1,
+      1,
+      0,
+      1,
+      1,
+      1,
+      0
+    ],
+    "num_attention_heads": 16,
+    "num_hidden_layers": 36,
+    "num_key_value_heads": 4,
+    "pretraining_tp": 2,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": null,
+    "rope_theta": 5000000.0,
+    "sliding_window": null,
+    "use_cache": false,
+    "use_sliding_window": false,
+    "vocab_size": 128257
+  },
+  "text_model_id": "HuggingFaceTB/SmolLM3-3B",
+  "transformers_version": "4.57.3",
+  "use_cache": false,
+  "use_specaugment": true,
+  "user_prompt": "Transcribe: <audio>",
+  "vocab_size": 128257
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "bos_token_id": 128000,
+  "eos_token_id": 128012,
+  "max_new_tokens": 96,
+  "pad_token_id": 128004,
+  "temperature": null,
+  "top_k": null,
+  "top_p": null,
+  "transformers_version": "4.57.3"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:59a82cd5acbf8e93a0566583e3eecdd2a8108d36856408f1e7d6fc9259059181
+size 23462224

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "chunk_length": 30,
+  "dither": 0.0,
+  "feature_extractor_type": "WhisperFeatureExtractor",
+  "feature_size": 128,
+  "hop_length": 160,
+  "n_fft": 400,
+  "n_samples": 480000,
+  "nb_max_frames": 3000,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "processor_class": "ASRProcessor",
+  "return_attention_mask": false,
+  "sampling_rate": 16000,
+  "auto_map": {
+    "AutoProcessor": "asr_processing.ASRProcessor"
+  }
+}

projectors.py ADDED Viewed

	@@ -0,0 +1,527 @@

+"""Audio projector modules for bridging encoder and decoder embeddings.
+This module contains all projector architectures:
+- MLPAudioProjector: Simple 2-layer MLP with conv downsampling
+- MoEAudioProjector: MOSA-style dense mixture of experts
+- SwiGLUAudioProjector: SwiGLU-based projector with temporal pooling
+- ResidualAudioProjector: Residual MLP blocks with linear projection
+- SharedMoEAudioProjector: Shared expert + sparse routed experts
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
+from transformers.models.llama.modeling_llama import LlamaRMSNorm
+# =============================================================================
+# MLP Projector
+# =============================================================================
+class MLPAudioProjector(nn.Module):
+    """2-layer MLP projector with conv-based 2x temporal downsampling."""
+    def __init__(self, config):
+        super().__init__()
+        encoder_dim = getattr(config, "encoder_dim", 768)
+        llm_dim = getattr(config, "llm_dim", 2048)
+        self.downsample = nn.Conv1d(
+            encoder_dim, encoder_dim, kernel_size=3, stride=2, padding=1, bias=False
+        )
+        self.linear_1 = nn.Linear(encoder_dim, llm_dim, bias=False)
+        self.act = nn.GELU()
+        self.linear_2 = nn.Linear(llm_dim, llm_dim, bias=False)
+        self.apply(self._init_weights)
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            nn.init.normal_(module.weight, mean=0.0, std=0.02)
+        elif isinstance(module, nn.Conv1d):
+            nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+    def forward(self, x):
+        """
+        x: [Batch, Seq_Len, Dim]
+        Returns: [Batch, Seq_Len // 2, llm_dim]
+        """
+        # Conv1d expects [Batch, Channels, Seq_Len]
+        x = x.transpose(1, 2)
+        x = self.downsample(x)
+        x = x.transpose(1, 2)
+        x = self.linear_1(x)
+        x = self.act(x)
+        return self.linear_2(x)
+# =============================================================================
+# MoE Projector (MOSA-style)
+# =============================================================================
+class SimpleAdapter(nn.Module):
+    """Simple adapter: Linear -> ReLU -> Dropout -> Linear."""
+    def __init__(self, in_features, hidden_features, out_features, dropout=0.0):
+        super().__init__()
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.relu = nn.ReLU()
+        self.dropout = nn.Dropout(dropout)
+        self.fc2 = nn.Linear(hidden_features, out_features)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.relu(x)
+        x = self.dropout(x)
+        return self.fc2(x)
+class MoEAudioProjector(nn.Module):
+    """
+    MOSA-style projector: Mixture of Simple Adapters.
+    From paper (arXiv:2508.18998):
+    - Dense mixture (softmax over ALL experts) instead of sparse Top-K
+    - Simple Linear->ReLU->Linear adapters
+    - No auxiliary losses - just cross-entropy on transcripts
+    - Conv downsampling: stride 4 total (two conv layers, stride 2 each)
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.encoder_dim = config.encoder_dim
+        self.llm_dim = config.llm_dim
+        self.num_experts = getattr(config, "num_experts", 4)
+        adapter_hidden = getattr(config, "projector_hidden_dim", None) or 4096
+        self.dropout_rate = getattr(config, "projector_dropout", 0.1)
+        # Convolutional Subsampling (stride 4 total)
+        self.conv = nn.Sequential(
+            nn.Conv1d(self.encoder_dim, self.llm_dim, kernel_size=3, stride=2, padding=1),
+            nn.ReLU(),
+            nn.Conv1d(self.llm_dim, self.llm_dim, kernel_size=3, stride=2, padding=1),
+            nn.ReLU(),
+        )
+        # Router
+        router_hidden = 512
+        self.router = nn.Sequential(
+            nn.Linear(self.encoder_dim, router_hidden),
+            nn.ReLU(),
+            nn.Linear(router_hidden, self.num_experts),
+        )
+        # Experts
+        self.experts = nn.ModuleList(
+            [
+                SimpleAdapter(self.llm_dim, adapter_hidden, self.llm_dim, dropout=self.dropout_rate)
+                for _ in range(self.num_experts)
+            ]
+        )
+        self.ln_post = LlamaRMSNorm(self.llm_dim, eps=1e-6)
+        self._init_weights()
+    def _init_weights(self):
+        std = 0.02
+        with torch.no_grad():
+            for module in self.conv:
+                if isinstance(module, nn.Conv1d):
+                    nn.init.normal_(module.weight, mean=0.0, std=std)
+                    if module.bias is not None:
+                        nn.init.zeros_(module.bias)
+            for module in self.router:
+                if isinstance(module, nn.Linear):
+                    nn.init.normal_(module.weight, mean=0.0, std=std)
+                    if module.bias is not None:
+                        nn.init.zeros_(module.bias)
+            for expert in self.experts:
+                nn.init.normal_(expert.fc1.weight, mean=0.0, std=std)
+                nn.init.normal_(expert.fc2.weight, mean=0.0, std=std)
+                if expert.fc1.bias is not None:
+                    nn.init.zeros_(expert.fc1.bias)
+                if expert.fc2.bias is not None:
+                    nn.init.zeros_(expert.fc2.bias)
+            self.ln_post.weight.data.fill_(1.0)
+    def forward(self, x):
+        batch_size, seq_len, _ = x.shape
+        # Pad to be divisible by stride (4)
+        pad_amt = (4 - (seq_len % 4)) % 4
+        if pad_amt > 0:
+            x = F.pad(x, (0, 0, 0, pad_amt))
+            seq_len = x.shape[1]
+        # Convolutional Downsampling
+        h_conv = self.conv(x.permute(0, 2, 1)).permute(0, 2, 1)
+        # Router on high-res input, then downsample weights
+        router_logits = self.router(x)
+        router_logits = router_logits.view(batch_size, seq_len // 4, 4, self.num_experts).mean(
+            dim=2
+        )
+        routing_weights = F.softmax(router_logits, dim=-1)
+        # Weighted sum of expert outputs
+        final_out = torch.zeros_like(h_conv)
+        for i, expert in enumerate(self.experts):
+            expert_out = expert(h_conv)
+            expert_weight = routing_weights[:, :, i : i + 1]
+            final_out.add_(expert_out * expert_weight)
+        return self.ln_post(final_out)
+    def get_aux_loss(self) -> torch.Tensor:
+        """Return auxiliary loss (none for dense MoE)."""
+        return torch.tensor(0.0)
+# =============================================================================
+# SwiGLU Projector
+# =============================================================================
+class SwiGLU(nn.Module):
+    def __init__(self, in_features, hidden_features, out_features, bias=False, dropout=0.0):
+        super().__init__()
+        self.w1 = nn.Linear(in_features, hidden_features, bias=bias)
+        self.w2 = nn.Linear(in_features, hidden_features, bias=bias)
+        self.w3 = nn.Linear(hidden_features, out_features, bias=bias)
+        self.act = nn.SiLU()
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        x_gate = self.act(self.w1(x))
+        x_val = self.w2(x)
+        x = x_gate * x_val
+        x = self.dropout(x)
+        return self.w3(x)
+class SwiGLUAudioProjector(nn.Module):
+    """SwiGLU-based projector with temporal pooling."""
+    def __init__(self, config):
+        super().__init__()
+        self.k = getattr(config, "projector_pool_stride", 4)
+        in_dim = config.encoder_dim * self.k
+        out_dim = config.llm_dim
+        hidden_dim = config.projector_hidden_dim
+        if hidden_dim is None:
+            hidden_dim = config.encoder_dim * 2
+        dropout_rate = getattr(config, "projector_dropout", 0.0)
+        self.proj1 = SwiGLU(in_dim, hidden_dim, hidden_dim, dropout=dropout_rate)
+        self.proj2 = SwiGLU(hidden_dim, hidden_dim, out_dim, dropout=dropout_rate)
+        self.output_dropout = nn.Dropout(dropout_rate)
+        with torch.no_grad():
+            std = getattr(config, "projector_init_std", 0.02)
+            nn.init.normal_(self.proj1.w1.weight, mean=0.0, std=std)
+            nn.init.normal_(self.proj1.w2.weight, mean=0.0, std=std)
+            nn.init.normal_(self.proj1.w3.weight, mean=0.0, std=std)
+            nn.init.normal_(self.proj2.w1.weight, mean=0.0, std=std)
+            nn.init.normal_(self.proj2.w2.weight, mean=0.0, std=std)
+            nn.init.normal_(self.proj2.w3.weight, mean=0.0, std=std)
+    def forward(self, x):
+        batch_size, seq_len, dim = x.size()
+        target_dtype = self.proj1.w1.weight.dtype
+        if x.dtype != target_dtype:
+            x = x.to(target_dtype)
+        remainder = seq_len % self.k
+        if remainder:
+            pad_len = self.k - remainder
+            x = F.pad(x, (0, 0, 0, pad_len))
+        x = x.contiguous().view(batch_size, -1, dim * self.k)
+        x = self.proj1(x)
+        x = self.proj2(x)
+        return self.output_dropout(x)
+# Alias for backwards compatibility
+AudioProjector = SwiGLUAudioProjector
+# =============================================================================
+# Residual Projector
+# =============================================================================
+class ResidualMLP(nn.Module):
+    """MLP block with residual connection: Output = x + MLP(x)."""
+    def __init__(self, dim, hidden_dim, dropout=0.0):
+        super().__init__()
+        self.fc1 = nn.Linear(dim, hidden_dim)
+        self.fc2 = nn.Linear(hidden_dim, dim)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        residual = x
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return residual + x
+class ResidualAudioProjector(nn.Module):
+    """Residual MLP projector for audio-to-LLM feature translation."""
+    def __init__(self, config):
+        super().__init__()
+        self.k = getattr(config, "projector_pool_stride", 4)
+        in_dim = config.encoder_dim * self.k
+        out_dim = config.llm_dim
+        hidden_dim = getattr(config, "projector_hidden_dim", None) or out_dim * 4
+        self.num_layers = getattr(config, "projector_num_layers", 2)
+        dropout_rate = getattr(config, "projector_dropout", 0.0)
+        self.input_proj = nn.Linear(in_dim, out_dim)
+        self.ln_input = LlamaRMSNorm(out_dim, eps=1e-6)
+        self.layers = nn.ModuleList(
+            [ResidualMLP(out_dim, hidden_dim, dropout=dropout_rate) for _ in range(self.num_layers)]
+        )
+        self.layer_norms = nn.ModuleList(
+            [LlamaRMSNorm(out_dim, eps=1e-6) for _ in range(self.num_layers)]
+        )
+        self.output_dropout = nn.Dropout(dropout_rate)
+        self._init_weights(config)
+    def _init_weights(self, config):
+        std = getattr(config, "projector_init_std", 0.02)
+        with torch.no_grad():
+            nn.init.normal_(self.input_proj.weight, mean=0.0, std=std)
+            if self.input_proj.bias is not None:
+                nn.init.zeros_(self.input_proj.bias)
+            self.ln_input.weight.data.fill_(1.0)
+            for ln in self.layer_norms:
+                ln.weight.data.fill_(1.0)
+            for layer in self.layers:
+                nn.init.normal_(layer.fc1.weight, mean=0.0, std=std)
+                nn.init.normal_(layer.fc2.weight, mean=0.0, std=std * 0.1)
+                if layer.fc1.bias is not None:
+                    nn.init.zeros_(layer.fc1.bias)
+                if layer.fc2.bias is not None:
+                    nn.init.zeros_(layer.fc2.bias)
+    def forward(self, x):
+        batch_size, seq_len, dim = x.size()
+        target_dtype = self.input_proj.weight.dtype
+        if x.dtype != target_dtype:
+            x = x.to(target_dtype)
+        remainder = seq_len % self.k
+        if remainder:
+            pad_len = self.k - remainder
+            x = F.pad(x, (0, 0, 0, pad_len))
+        x = x.contiguous().view(batch_size, -1, dim * self.k)
+        x = self.input_proj(x)
+        x = self.ln_input(x)
+        for layer, ln in zip(self.layers, self.layer_norms):
+            x = layer(x)
+            x = ln(x)
+        return self.output_dropout(x)
+# =============================================================================
+# Shared MoE Projector
+# =============================================================================
+class SwiGLUExpert(nn.Module):
+    """SwiGLU expert MLP."""
+    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
+        super().__init__()
+        self.gate_proj = nn.Linear(input_dim, hidden_dim, bias=False)
+        self.up_proj = nn.Linear(input_dim, hidden_dim, bias=False)
+        self.down_proj = nn.Linear(hidden_dim, output_dim, bias=False)
+        self.act = nn.SiLU()
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.down_proj(self.act(self.gate_proj(x)) * self.up_proj(x))
+class SharedMoEBlock(nn.Module):
+    """MoE block with shared expert + sparse routed experts."""
+    def __init__(
+        self,
+        input_dim: int,
+        hidden_dim: int,
+        output_dim: int,
+        num_experts: int = 4,
+        top_k: int = 2,
+    ):
+        super().__init__()
+        self.num_experts = num_experts
+        self.top_k = top_k
+        self.output_dim = output_dim
+        self.router = nn.Linear(input_dim, num_experts, bias=False)
+        nn.init.zeros_(self.router.weight)
+        self.shared_expert = SwiGLUExpert(input_dim, hidden_dim, output_dim)
+        self.experts = nn.ModuleList(
+            [SwiGLUExpert(input_dim, hidden_dim, output_dim) for _ in range(num_experts)]
+        )
+        self.last_router_logits = None
+        self.last_router_probs = None
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        batch_size, seq_len, dim = hidden_states.shape
+        shared_out = self.shared_expert(hidden_states)
+        flat_hidden = hidden_states.view(-1, dim)
+        router_logits = self.router(flat_hidden)
+        router_probs = F.softmax(router_logits.float(), dim=-1)
+        self.last_router_logits = router_logits
+        self.last_router_probs = router_probs
+        top_k_weights, top_k_indices = torch.topk(router_probs, self.top_k, dim=-1)
+        top_k_weights = top_k_weights / top_k_weights.sum(dim=-1, keepdim=True)
+        top_k_weights = top_k_weights.to(hidden_states.dtype)
+        routed_out = self._dispatch_experts(flat_hidden, top_k_indices, top_k_weights)
+        routed_out = routed_out.view(batch_size, seq_len, -1)
+        return shared_out + routed_out
+    def _dispatch_experts(
+        self,
+        hidden_states: torch.Tensor,
+        top_k_indices: torch.Tensor,
+        top_k_weights: torch.Tensor,
+    ) -> torch.Tensor:
+        num_tokens = hidden_states.shape[0]
+        output = torch.zeros(
+            num_tokens, self.output_dim, device=hidden_states.device, dtype=hidden_states.dtype
+        )
+        for expert_idx, expert in enumerate(self.experts):
+            expert_mask = top_k_indices == expert_idx
+            if not expert_mask.any():
+                continue
+            token_indices, slot_indices = torch.where(expert_mask)
+            expert_input = hidden_states[token_indices]
+            expert_output = expert(expert_input)
+            weights = top_k_weights[token_indices, slot_indices].unsqueeze(-1)
+            output.index_add_(0, token_indices, expert_output * weights)
+        return output
+def load_balancing_loss(router_probs: torch.Tensor, num_experts: int, top_k: int) -> torch.Tensor:
+    """Auxiliary loss to encourage balanced expert usage."""
+    _, selected = torch.topk(router_probs, top_k, dim=-1)
+    expert_mask = F.one_hot(selected, num_experts).float()
+    tokens_per_expert = expert_mask.mean(dim=(0, 1))
+    prob_per_expert = router_probs.mean(dim=0)
+    return (tokens_per_expert * prob_per_expert).sum() * num_experts
+def z_loss(router_logits: torch.Tensor) -> torch.Tensor:
+    """Z-loss to prevent router logits from growing too large."""
+    return torch.logsumexp(router_logits.float(), dim=-1).square().mean()
+class SharedMoEAudioProjector(nn.Module):
+    """Shared expert + sparse routed experts projector."""
+    def __init__(self, config):
+        super().__init__()
+        self.k = getattr(config, "projector_pool_stride", 4)
+        encoder_dim = config.encoder_dim
+        in_dim = encoder_dim * self.k
+        out_dim = config.llm_dim
+        hidden_dim = getattr(config, "projector_hidden_dim", None) or in_dim
+        self.num_experts = getattr(config, "num_experts", 4)
+        self.top_k = getattr(config, "num_experts_per_tok", 2)
+        self.aux_loss_coef = getattr(config, "router_aux_loss_coef", 0.02)
+        self.z_loss_coef = getattr(config, "router_z_loss_coef", 0.001)
+        self.moe = SharedMoEBlock(in_dim, hidden_dim, out_dim, self.num_experts, self.top_k)
+        self._init_weights(in_dim)
+    def _init_weights(self, in_dim: int):
+        with torch.no_grad():
+            nn.init.orthogonal_(self.moe.shared_expert.gate_proj.weight)
+            nn.init.orthogonal_(self.moe.shared_expert.up_proj.weight)
+            nn.init.orthogonal_(self.moe.shared_expert.down_proj.weight, gain=0.5)
+            for expert in self.moe.experts:
+                nn.init.orthogonal_(expert.gate_proj.weight)
+                nn.init.orthogonal_(expert.up_proj.weight)
+                nn.init.orthogonal_(expert.down_proj.weight, gain=0.01)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        batch_size, seq_len, dim = x.size()
+        target_dtype = self.moe.shared_expert.gate_proj.weight.dtype
+        if x.dtype != target_dtype:
+            x = x.to(target_dtype)
+        if seq_len % self.k:
+            x = F.pad(x, (0, 0, 0, self.k - seq_len % self.k))
+        x = x.view(batch_size, -1, dim * self.k)
+        return self.moe(x)
+    def get_aux_loss(self) -> torch.Tensor:
+        if self.moe.last_router_logits is None:
+            return torch.tensor(0.0, device=self.moe.router.weight.device)
+        balance = load_balancing_loss(self.moe.last_router_probs, self.num_experts, self.top_k)
+        z = z_loss(self.moe.last_router_logits)
+        return self.aux_loss_coef * balance + self.z_loss_coef * z
+# =============================================================================
+# Projector Registry
+# =============================================================================
+PROJECTOR_CLASSES = {
+    "mlp": MLPAudioProjector,
+    "moe": MoEAudioProjector,
+    "swiglu": SwiGLUAudioProjector,
+    "residual": ResidualAudioProjector,
+    "shared_moe": SharedMoEAudioProjector,
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "additional_special_tokens": [
+    {
+      "content": "<audio>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|finetune_right_pad_id|>"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d4aeaf198f783cbf58d8cd59812baac429ffe49147bf9648f6618de20b8d4a4c
+size 17209003

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,2075 @@

+{
+  "added_tokens_decoder": {
+    "128000": {
+      "content": "<|begin_of_text|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128001": {
+      "content": "<|end_of_text|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128002": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "128003": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "128004": {
+      "content": "<|finetune_right_pad_id|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128005": {
+      "content": "<|reserved_special_token_2|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128006": {
+      "content": "<|start_header_id|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128007": {
+      "content": "<|end_header_id|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128008": {
+      "content": "<|eom_id|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128009": {
+      "content": "<|eot_id|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128010": {
+      "content": "<|python_tag|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128011": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128012": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128013": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "128014": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "128015": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "128016": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "128017": {
+      "content": "<code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "128018": {
+      "content": "</code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "128019": {
+      "content": "<|reserved_special_token_11|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128020": {
+      "content": "<|reserved_special_token_12|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128021": {
+      "content": "<|reserved_special_token_13|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128022": {
+      "content": "<|reserved_special_token_14|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128023": {
+      "content": "<|reserved_special_token_15|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128024": {
+      "content": "<|reserved_special_token_16|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128025": {
+      "content": "<|reserved_special_token_17|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128026": {
+      "content": "<|reserved_special_token_18|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128027": {
+      "content": "<|reserved_special_token_19|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128028": {
+      "content": "<|reserved_special_token_20|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128029": {
+      "content": "<|reserved_special_token_21|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128030": {
+      "content": "<|reserved_special_token_22|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128031": {
+      "content": "<|reserved_special_token_23|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128032": {
+      "content": "<|reserved_special_token_24|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128033": {
+      "content": "<|reserved_special_token_25|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128034": {
+      "content": "<|reserved_special_token_26|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128035": {
+      "content": "<|reserved_special_token_27|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128036": {
+      "content": "<|reserved_special_token_28|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128037": {
+      "content": "<|reserved_special_token_29|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128038": {
+      "content": "<|reserved_special_token_30|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128039": {
+      "content": "<|reserved_special_token_31|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128040": {
+      "content": "<|reserved_special_token_32|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128041": {
+      "content": "<|reserved_special_token_33|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128042": {
+      "content": "<|reserved_special_token_34|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128043": {
+      "content": "<|reserved_special_token_35|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128044": {
+      "content": "<|reserved_special_token_36|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128045": {
+      "content": "<|reserved_special_token_37|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128046": {
+      "content": "<|reserved_special_token_38|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128047": {
+      "content": "<|reserved_special_token_39|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128048": {
+      "content": "<|reserved_special_token_40|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128049": {
+      "content": "<|reserved_special_token_41|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128050": {
+      "content": "<|reserved_special_token_42|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128051": {
+      "content": "<|reserved_special_token_43|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128052": {
+      "content": "<|reserved_special_token_44|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128053": {
+      "content": "<|reserved_special_token_45|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128054": {
+      "content": "<|reserved_special_token_46|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128055": {
+      "content": "<|reserved_special_token_47|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128056": {
+      "content": "<|reserved_special_token_48|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128057": {
+      "content": "<|reserved_special_token_49|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128058": {
+      "content": "<|reserved_special_token_50|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128059": {
+      "content": "<|reserved_special_token_51|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128060": {
+      "content": "<|reserved_special_token_52|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128061": {
+      "content": "<|reserved_special_token_53|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128062": {
+      "content": "<|reserved_special_token_54|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128063": {
+      "content": "<|reserved_special_token_55|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128064": {
+      "content": "<|reserved_special_token_56|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128065": {
+      "content": "<|reserved_special_token_57|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128066": {
+      "content": "<|reserved_special_token_58|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128067": {
+      "content": "<|reserved_special_token_59|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128068": {
+      "content": "<|reserved_special_token_60|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128069": {
+      "content": "<|reserved_special_token_61|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128070": {
+      "content": "<|reserved_special_token_62|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128071": {
+      "content": "<|reserved_special_token_63|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128072": {
+      "content": "<|reserved_special_token_64|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128073": {
+      "content": "<|reserved_special_token_65|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128074": {
+      "content": "<|reserved_special_token_66|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128075": {
+      "content": "<|reserved_special_token_67|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128076": {
+      "content": "<|reserved_special_token_68|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128077": {
+      "content": "<|reserved_special_token_69|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128078": {
+      "content": "<|reserved_special_token_70|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128079": {
+      "content": "<|reserved_special_token_71|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128080": {
+      "content": "<|reserved_special_token_72|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128081": {
+      "content": "<|reserved_special_token_73|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128082": {
+      "content": "<|reserved_special_token_74|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128083": {
+      "content": "<|reserved_special_token_75|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128084": {
+      "content": "<|reserved_special_token_76|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128085": {
+      "content": "<|reserved_special_token_77|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128086": {
+      "content": "<|reserved_special_token_78|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128087": {
+      "content": "<|reserved_special_token_79|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128088": {
+      "content": "<|reserved_special_token_80|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128089": {
+      "content": "<|reserved_special_token_81|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128090": {
+      "content": "<|reserved_special_token_82|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128091": {
+      "content": "<|reserved_special_token_83|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128092": {
+      "content": "<|reserved_special_token_84|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128093": {
+      "content": "<|reserved_special_token_85|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128094": {
+      "content": "<|reserved_special_token_86|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128095": {
+      "content": "<|reserved_special_token_87|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128096": {
+      "content": "<|reserved_special_token_88|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128097": {
+      "content": "<|reserved_special_token_89|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128098": {
+      "content": "<|reserved_special_token_90|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128099": {
+      "content": "<|reserved_special_token_91|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128100": {
+      "content": "<|reserved_special_token_92|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128101": {
+      "content": "<|reserved_special_token_93|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128102": {
+      "content": "<|reserved_special_token_94|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128103": {
+      "content": "<|reserved_special_token_95|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128104": {
+      "content": "<|reserved_special_token_96|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128105": {
+      "content": "<|reserved_special_token_97|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128106": {
+      "content": "<|reserved_special_token_98|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128107": {
+      "content": "<|reserved_special_token_99|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128108": {
+      "content": "<|reserved_special_token_100|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128109": {
+      "content": "<|reserved_special_token_101|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128110": {
+      "content": "<|reserved_special_token_102|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128111": {
+      "content": "<|reserved_special_token_103|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128112": {
+      "content": "<|reserved_special_token_104|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128113": {
+      "content": "<|reserved_special_token_105|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128114": {
+      "content": "<|reserved_special_token_106|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128115": {
+      "content": "<|reserved_special_token_107|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128116": {
+      "content": "<|reserved_special_token_108|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128117": {
+      "content": "<|reserved_special_token_109|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128118": {
+      "content": "<|reserved_special_token_110|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128119": {
+      "content": "<|reserved_special_token_111|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128120": {
+      "content": "<|reserved_special_token_112|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128121": {
+      "content": "<|reserved_special_token_113|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128122": {
+      "content": "<|reserved_special_token_114|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128123": {
+      "content": "<|reserved_special_token_115|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128124": {
+      "content": "<|reserved_special_token_116|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128125": {
+      "content": "<|reserved_special_token_117|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128126": {
+      "content": "<|reserved_special_token_118|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128127": {
+      "content": "<|reserved_special_token_119|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128128": {
+      "content": "<|reserved_special_token_120|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128129": {
+      "content": "<|reserved_special_token_121|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128130": {
+      "content": "<|reserved_special_token_122|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128131": {
+      "content": "<|reserved_special_token_123|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128132": {
+      "content": "<|reserved_special_token_124|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128133": {
+      "content": "<|reserved_special_token_125|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128134": {
+      "content": "<|reserved_special_token_126|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128135": {
+      "content": "<|reserved_special_token_127|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128136": {
+      "content": "<|reserved_special_token_128|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128137": {
+      "content": "<|reserved_special_token_129|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128138": {
+      "content": "<|reserved_special_token_130|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128139": {
+      "content": "<|reserved_special_token_131|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128140": {
+      "content": "<|reserved_special_token_132|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128141": {
+      "content": "<|reserved_special_token_133|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128142": {
+      "content": "<|reserved_special_token_134|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128143": {
+      "content": "<|reserved_special_token_135|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128144": {
+      "content": "<|reserved_special_token_136|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128145": {
+      "content": "<|reserved_special_token_137|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128146": {
+      "content": "<|reserved_special_token_138|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128147": {
+      "content": "<|reserved_special_token_139|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128148": {
+      "content": "<|reserved_special_token_140|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128149": {
+      "content": "<|reserved_special_token_141|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128150": {
+      "content": "<|reserved_special_token_142|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128151": {
+      "content": "<|reserved_special_token_143|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128152": {
+      "content": "<|reserved_special_token_144|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128153": {
+      "content": "<|reserved_special_token_145|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128154": {
+      "content": "<|reserved_special_token_146|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128155": {
+      "content": "<|reserved_special_token_147|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128156": {
+      "content": "<|reserved_special_token_148|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128157": {
+      "content": "<|reserved_special_token_149|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128158": {
+      "content": "<|reserved_special_token_150|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128159": {
+      "content": "<|reserved_special_token_151|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128160": {
+      "content": "<|reserved_special_token_152|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128161": {
+      "content": "<|reserved_special_token_153|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128162": {
+      "content": "<|reserved_special_token_154|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128163": {
+      "content": "<|reserved_special_token_155|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128164": {
+      "content": "<|reserved_special_token_156|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128165": {
+      "content": "<|reserved_special_token_157|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128166": {
+      "content": "<|reserved_special_token_158|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128167": {
+      "content": "<|reserved_special_token_159|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128168": {
+      "content": "<|reserved_special_token_160|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128169": {
+      "content": "<|reserved_special_token_161|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128170": {
+      "content": "<|reserved_special_token_162|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128171": {
+      "content": "<|reserved_special_token_163|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128172": {
+      "content": "<|reserved_special_token_164|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128173": {
+      "content": "<|reserved_special_token_165|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128174": {
+      "content": "<|reserved_special_token_166|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128175": {
+      "content": "<|reserved_special_token_167|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128176": {
+      "content": "<|reserved_special_token_168|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128177": {
+      "content": "<|reserved_special_token_169|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128178": {
+      "content": "<|reserved_special_token_170|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128179": {
+      "content": "<|reserved_special_token_171|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128180": {
+      "content": "<|reserved_special_token_172|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128181": {
+      "content": "<|reserved_special_token_173|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128182": {
+      "content": "<|reserved_special_token_174|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128183": {
+      "content": "<|reserved_special_token_175|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128184": {
+      "content": "<|reserved_special_token_176|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128185": {
+      "content": "<|reserved_special_token_177|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128186": {
+      "content": "<|reserved_special_token_178|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128187": {
+      "content": "<|reserved_special_token_179|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128188": {
+      "content": "<|reserved_special_token_180|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128189": {
+      "content": "<|reserved_special_token_181|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128190": {
+      "content": "<|reserved_special_token_182|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128191": {
+      "content": "<|reserved_special_token_183|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128192": {
+      "content": "<|reserved_special_token_184|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128193": {
+      "content": "<|reserved_special_token_185|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128194": {
+      "content": "<|reserved_special_token_186|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128195": {
+      "content": "<|reserved_special_token_187|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128196": {
+      "content": "<|reserved_special_token_188|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128197": {
+      "content": "<|reserved_special_token_189|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128198": {
+      "content": "<|reserved_special_token_190|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128199": {
+      "content": "<|reserved_special_token_191|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128200": {
+      "content": "<|reserved_special_token_192|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128201": {
+      "content": "<|reserved_special_token_193|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128202": {
+      "content": "<|reserved_special_token_194|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128203": {
+      "content": "<|reserved_special_token_195|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128204": {
+      "content": "<|reserved_special_token_196|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128205": {
+      "content": "<|reserved_special_token_197|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128206": {
+      "content": "<|reserved_special_token_198|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128207": {
+      "content": "<|reserved_special_token_199|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128208": {
+      "content": "<|reserved_special_token_200|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128209": {
+      "content": "<|reserved_special_token_201|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128210": {
+      "content": "<|reserved_special_token_202|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128211": {
+      "content": "<|reserved_special_token_203|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128212": {
+      "content": "<|reserved_special_token_204|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128213": {
+      "content": "<|reserved_special_token_205|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128214": {
+      "content": "<|reserved_special_token_206|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128215": {
+      "content": "<|reserved_special_token_207|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128216": {
+      "content": "<|reserved_special_token_208|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128217": {
+      "content": "<|reserved_special_token_209|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128218": {
+      "content": "<|reserved_special_token_210|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128219": {
+      "content": "<|reserved_special_token_211|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128220": {
+      "content": "<|reserved_special_token_212|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128221": {
+      "content": "<|reserved_special_token_213|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128222": {
+      "content": "<|reserved_special_token_214|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128223": {
+      "content": "<|reserved_special_token_215|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128224": {
+      "content": "<|reserved_special_token_216|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128225": {
+      "content": "<|reserved_special_token_217|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128226": {
+      "content": "<|reserved_special_token_218|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128227": {
+      "content": "<|reserved_special_token_219|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128228": {
+      "content": "<|reserved_special_token_220|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128229": {
+      "content": "<|reserved_special_token_221|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128230": {
+      "content": "<|reserved_special_token_222|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128231": {
+      "content": "<|reserved_special_token_223|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128232": {
+      "content": "<|reserved_special_token_224|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128233": {
+      "content": "<|reserved_special_token_225|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128234": {
+      "content": "<|reserved_special_token_226|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128235": {
+      "content": "<|reserved_special_token_227|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128236": {
+      "content": "<|reserved_special_token_228|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128237": {
+      "content": "<|reserved_special_token_229|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128238": {
+      "content": "<|reserved_special_token_230|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128239": {
+      "content": "<|reserved_special_token_231|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128240": {
+      "content": "<|reserved_special_token_232|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128241": {
+      "content": "<|reserved_special_token_233|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128242": {
+      "content": "<|reserved_special_token_234|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128243": {
+      "content": "<|reserved_special_token_235|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128244": {
+      "content": "<|reserved_special_token_236|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128245": {
+      "content": "<|reserved_special_token_237|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128246": {
+      "content": "<|reserved_special_token_238|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128247": {
+      "content": "<|reserved_special_token_239|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128248": {
+      "content": "<|reserved_special_token_240|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128249": {
+      "content": "<|reserved_special_token_241|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128250": {
+      "content": "<|reserved_special_token_242|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128251": {
+      "content": "<|reserved_special_token_243|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128252": {
+      "content": "<|reserved_special_token_244|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128253": {
+      "content": "<|reserved_special_token_245|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128254": {
+      "content": "<|reserved_special_token_246|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128255": {
+      "content": "<|reserved_special_token_247|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128256": {
+      "content": "<audio>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<audio>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|im_end|>",
+  "extra_special_tokens": {},
+  "fast": false,
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 131072,
+  "pad_token": "<|finetune_right_pad_id|>",
+  "tokenizer_class": "PreTrainedTokenizerFast"
+}