mrbeniwal commited on
Commit
643b247
·
verified ·
1 Parent(s): b89cd41

Upload 14 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,178 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ pipeline_tag: automatic-speech-recognition
7
+ tags:
8
+ - speech
9
+ - audio
10
+ - asr
11
+ - speech-to-text
12
+ - whisper
13
+ - tiny-audio
14
+ base_model:
15
+ - openai/whisper-large-v3-turbo
16
+ - HuggingFaceTB/SmolLM3-3B
17
+ datasets:
18
+ - speechbrain/LoquaciousSet
19
+ metrics:
20
+ - wer
21
  ---
22
+
23
+ # Tiny Audio ASR - LoquaciousSet Training
24
+
25
+ A Speech-to-Text model trained using the [Tiny Audio](https://github.com/alexkroman/tiny-audio) framework, combining a frozen Whisper encoder with a trained MLP projector and frozen SmolLM3-3B decoder.
26
+
27
+ ## Model Description
28
+
29
+ This model uses an encoder-projector-decoder architecture for automatic speech recognition:
30
+
31
+ | Component | Model | Parameters | Training Status |
32
+ |-----------|-------|------------|-----------------|
33
+ | Audio Encoder | openai/whisper-large-v3-turbo | ~800M | Frozen |
34
+ | Projector | MLP | 11.7M | **Trained** |
35
+ | Language Model | HuggingFaceTB/SmolLM3-3B | 3B | Frozen |
36
+ | **Total** | - | **3.72B** | 0.32% trainable |
37
+
38
+ ## Training Details
39
+
40
+ ### Infrastructure
41
+ - **GPU**: NVIDIA H100 80GB HBM3
42
+ - **Cloud Provider**: E2E Networks
43
+ - **Framework**: PyTorch 2.8.0, Transformers 4.57.3
44
+
45
+ ### Hyperparameters
46
+ - **Dataset**: speechbrain/LoquaciousSet (small subset)
47
+ - **Train Samples**: 1,000
48
+ - **Evaluation Samples**: 100
49
+ - **Batch Size**: 8
50
+ - **Learning Rate**: 3e-4
51
+ - **Max Steps**: 500
52
+ - **Warmup Steps**: 50
53
+ - **Precision**: BF16
54
+ - **Gradient Checkpointing**: Enabled
55
+
56
+ ### Training Metrics
57
+
58
+ | Step | Training Loss | Validation Loss |
59
+ |------|---------------|-----------------|
60
+ | 100 | 3.078 | 3.165 |
61
+ | 200 | 2.543 | 3.163 |
62
+ | 300 | 0.500 | 0.813 |
63
+ | 400 | 0.140 | 0.728 |
64
+ | 500 | 0.101 | 0.764 |
65
+
66
+ Training time: ~18 minutes on H100.
67
+
68
+ ## Usage
69
+
70
+ ```python
71
+ from src.asr_config import ASRConfig
72
+ from src.asr_modeling import ASRModel
73
+ import torchaudio
74
+
75
+ # Initialize model
76
+ config = ASRConfig(
77
+ audio_model_id="openai/whisper-large-v3-turbo",
78
+ text_model_id="HuggingFaceTB/SmolLM3-3B",
79
+ projector_type="mlp",
80
+ attn_implementation="sdpa",
81
+ )
82
+ model = ASRModel(config)
83
+
84
+ # Load audio
85
+ waveform, sample_rate = torchaudio.load("audio.wav")
86
+ if sample_rate != 16000:
87
+ waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
88
+ audio_array = waveform.squeeze().numpy()
89
+
90
+ # Transcribe
91
+ inputs = model.feature_extractor(
92
+ audio_array, sampling_rate=16000, return_tensors="pt"
93
+ ).input_features.to(model.device).to(model.dtype)
94
+
95
+ with torch.no_grad():
96
+ output = model.generate(input_features=inputs, max_new_tokens=256)
97
+
98
+ transcription = model.tokenizer.decode(output[0], skip_special_tokens=True)
99
+ print(transcription)
100
+ ```
101
+
102
+ ## Example Results
103
+
104
+ **Input Audio**: Sample from LoquaciousSet evaluation set
105
+
106
+ **Ground Truth**:
107
+ ```
108
+ THESE ARE REFORMS THAT WILL DISCIPLINE AND CONSTRAIN THE EXERCISE OF POWER
109
+ BY THE GOVERNMENT AND ANY OTHER ECONOMIC OR POLITICAL ACTOR FOR GENERATIONS TO COME
110
+ ```
111
+
112
+ **Model Output**:
113
+ ```
114
+ These are reforms that will discipline and constrain the exercise of power
115
+ by the government and any other economic or political actor for generations to come
116
+ ```
117
+
118
+ ## Limitations
119
+
120
+ - Trained on a small subset (1,000 samples) for demonstration purposes
121
+ - Full training with 50,000+ steps recommended for production use
122
+ - English language only
123
+ - Optimized for clean speech; performance may degrade on noisy audio
124
+
125
+ ## Citation
126
+
127
+ ### Tiny Audio Framework
128
+ ```bibtex
129
+ @software{kroman2025tinyaudio,
130
+ author = {Kroman, Alex},
131
+ title = {Tiny Audio: Train Your Own Speech Recognition Model in 24 Hours},
132
+ year = {2025},
133
+ url = {https://github.com/alexkroman/tiny-audio}
134
+ }
135
+ ```
136
+
137
+ ### LoquaciousSet Dataset
138
+ ```bibtex
139
+ @misc{speechbrain2024loquaciousset,
140
+ author = {{SpeechBrain Team}},
141
+ title = {LoquaciousSet: 25,000 Hours of Transcribed English Speech},
142
+ year = {2024},
143
+ publisher = {Hugging Face},
144
+ url = {https://huggingface.co/datasets/speechbrain/LoquaciousSet}
145
+ }
146
+ ```
147
+
148
+ ### Whisper
149
+ ```bibtex
150
+ @article{radford2022whisper,
151
+ title = {Robust Speech Recognition via Large-Scale Weak Supervision},
152
+ author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
153
+ journal = {arXiv preprint arXiv:2212.04356},
154
+ year = {2022}
155
+ }
156
+ ```
157
+
158
+ ### SmolLM
159
+ ```bibtex
160
+ @misc{smollm2024,
161
+ author = {{Hugging Face}},
162
+ title = {SmolLM: Smaller Language Models for Efficient Inference},
163
+ year = {2024},
164
+ url = {https://huggingface.co/HuggingFaceTB/SmolLM3-3B}
165
+ }
166
+ ```
167
+
168
+ ## License
169
+
170
+ Apache 2.0 - See the [Tiny Audio repository](https://github.com/alexkroman/tiny-audio) for details.
171
+
172
+ ## Acknowledgments
173
+
174
+ - [Alex Kroman](https://github.com/alexkroman) for the Tiny Audio framework
175
+ - [SpeechBrain](https://speechbrain.github.io/) for the LoquaciousSet dataset
176
+ - [OpenAI](https://openai.com/) for Whisper
177
+ - [Hugging Face](https://huggingface.co/) for SmolLM3 and infrastructure
178
+ - [E2E Networks](https://www.e2enetworks.com/) for GPU cloud infrastructure
asr_config.py ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Optional
2
+
3
+ import transformers
4
+
5
+
6
+ class ASRConfig(transformers.PretrainedConfig):
7
+ model_type = "asr_model"
8
+ is_composition = True
9
+
10
+ def __init__(
11
+ self,
12
+ audio_model_id: str = "openai/whisper-large-v3-turbo",
13
+ text_model_id: str = "HuggingFaceTB/SmolLM3-3B",
14
+ attn_implementation: str = "flash_attention_2",
15
+ model_dtype: str = "bfloat16",
16
+ num_beams: Optional[int] = None,
17
+ system_prompt: str = "/no_think /system_override",
18
+ user_prompt: str = "Transcribe: <audio>",
19
+ encoder_dim: Optional[int] = None,
20
+ llm_dim: Optional[int] = None,
21
+ audio_sample_rate: int = 16000,
22
+ projector_init_std: float = 0.02,
23
+ projector_pool_stride: int = 2,
24
+ downsample_rate: int = 16,
25
+ projector_hidden_dim: Optional[int] = None,
26
+ projector_type: str = "moe", # "moe", "swiglu", "residual", "shared_moe", "mlp"
27
+ projector_num_layers: int = 2, # Number of layers (for residual projector)
28
+ projector_dropout: float = 0.05, # Dropout rate for projector layers
29
+ projector_input_noise: float = 0.02, # Input noise for projector
30
+ # MoE-specific configuration
31
+ num_experts: int = 4, # Number of experts in MoE projectors
32
+ num_experts_per_tok: int = 2, # Top-k experts per token
33
+ router_aux_loss_coef: float = 0.01, # Auxiliary loss coefficient for load balancing
34
+ use_specaugment: bool = True, # Apply SpecAugment during training
35
+ label_smoothing: float = 0.0, # Label smoothing for cross-entropy loss
36
+ inference_diversity_penalty: float = 0.0,
37
+ inference_warmup_tokens: int = 10,
38
+ max_new_tokens: Optional[int] = None,
39
+ min_new_tokens: Optional[int] = None,
40
+ repetition_penalty: Optional[float] = None,
41
+ length_penalty: Optional[float] = None,
42
+ no_repeat_ngram_size: Optional[int] = None,
43
+ use_cache: Optional[bool] = None,
44
+ **kwargs,
45
+ ):
46
+ # Set default generation parameters (greedy decoding only)
47
+ generation_defaults = {
48
+ "num_beams": 1,
49
+ "max_new_tokens": 96,
50
+ "min_new_tokens": 0,
51
+ "repetition_penalty": 1.0,
52
+ "length_penalty": 1.0,
53
+ "no_repeat_ngram_size": 0,
54
+ "use_cache": True,
55
+ }
56
+
57
+ # Apply defaults (config.json values take precedence)
58
+ kwargs = {**generation_defaults, **kwargs}
59
+
60
+ self.audio_model_id = audio_model_id
61
+ self.text_model_id = text_model_id
62
+ self.attn_implementation = attn_implementation
63
+ self.model_dtype = model_dtype
64
+ self.system_prompt = system_prompt
65
+ self.user_prompt = user_prompt
66
+ self.encoder_dim = encoder_dim
67
+ self.llm_dim = llm_dim
68
+ self.audio_sample_rate = audio_sample_rate
69
+ self.projector_init_std = projector_init_std
70
+ self.projector_pool_stride = projector_pool_stride
71
+ self.downsample_rate = downsample_rate
72
+ self.projector_hidden_dim = projector_hidden_dim
73
+ self.projector_type = projector_type
74
+ self.projector_num_layers = projector_num_layers
75
+ self.projector_dropout = projector_dropout
76
+ self.projector_input_noise = projector_input_noise
77
+ # MoE-specific configuration
78
+ self.num_experts = num_experts
79
+ self.num_experts_per_tok = num_experts_per_tok
80
+ self.router_aux_loss_coef = router_aux_loss_coef
81
+ self.use_specaugment = use_specaugment
82
+ self.label_smoothing = label_smoothing
83
+ self.inference_diversity_penalty = inference_diversity_penalty
84
+ self.inference_warmup_tokens = inference_warmup_tokens
85
+
86
+ # Generation parameters (use explicit value if provided, else use default)
87
+ self.num_beams = num_beams if num_beams is not None else generation_defaults["num_beams"]
88
+ self.max_new_tokens = (
89
+ max_new_tokens if max_new_tokens is not None else generation_defaults["max_new_tokens"]
90
+ )
91
+ self.min_new_tokens = (
92
+ min_new_tokens if min_new_tokens is not None else generation_defaults["min_new_tokens"]
93
+ )
94
+ self.repetition_penalty = (
95
+ repetition_penalty
96
+ if repetition_penalty is not None
97
+ else generation_defaults["repetition_penalty"]
98
+ )
99
+ self.length_penalty = (
100
+ length_penalty if length_penalty is not None else generation_defaults["length_penalty"]
101
+ )
102
+ self.no_repeat_ngram_size = (
103
+ no_repeat_ngram_size
104
+ if no_repeat_ngram_size is not None
105
+ else generation_defaults["no_repeat_ngram_size"]
106
+ )
107
+ self.use_cache = use_cache if use_cache is not None else generation_defaults["use_cache"]
108
+
109
+ if "audio_config" not in kwargs:
110
+ self.audio_config = transformers.AutoConfig.from_pretrained(audio_model_id)
111
+ # Override dtype to match model_dtype
112
+ self.audio_config.dtype = model_dtype
113
+ else:
114
+ self.audio_config = kwargs.pop("audio_config")
115
+
116
+ if "text_config" not in kwargs:
117
+ self.text_config = transformers.AutoConfig.from_pretrained(
118
+ text_model_id, trust_remote_code=True
119
+ )
120
+ # Override dtype to match model_dtype
121
+ self.text_config.dtype = model_dtype
122
+ else:
123
+ self.text_config = kwargs.pop("text_config")
124
+
125
+ if isinstance(self.text_config, dict):
126
+ # Reconstruct config from dict using the model_type stored in the dict
127
+ model_type = self.text_config["model_type"]
128
+ config_class = transformers.AutoConfig.for_model(model_type).__class__
129
+ self.text_config = config_class(**self.text_config)
130
+
131
+ if isinstance(self.audio_config, dict):
132
+ model_type = self.audio_config.get("model_type")
133
+ if model_type:
134
+ config_class = transformers.AutoConfig.for_model(model_type).__class__
135
+ self.audio_config = config_class(**self.audio_config)
136
+
137
+ super().__init__(**kwargs)
138
+
139
+ self.auto_map = {
140
+ "AutoConfig": "asr_config.ASRConfig",
141
+ "AutoModel": "asr_modeling.ASRModel",
142
+ "AutoModelForSpeechSeq2Seq": "asr_modeling.ASRModel",
143
+ "AutoProcessor": "asr_processing.ASRProcessor",
144
+ }
145
+ self.custom_pipelines = {
146
+ "automatic-speech-recognition": {
147
+ "impl": "asr_pipeline.ASRPipeline",
148
+ "pt": ["AutoModelForSpeechSeq2Seq"],
149
+ "tf": [],
150
+ "type": "audio",
151
+ }
152
+ }
153
+ self.architectures = ["ASRModel"]
154
+ self.pipeline_tag = "automatic-speech-recognition"
155
+
156
+
157
+ transformers.AutoConfig.register("asr_model", ASRConfig)
asr_modeling.py ADDED
@@ -0,0 +1,549 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from pathlib import Path
3
+ from typing import Optional, Union
4
+
5
+ import torch
6
+ import torch.nn as nn
7
+ from transformers import (
8
+ AutoConfig,
9
+ AutoModel,
10
+ AutoModelForCausalLM,
11
+ AutoTokenizer,
12
+ PreTrainedModel,
13
+ )
14
+ from transformers.generation import GenerationMixin
15
+ from transformers.modeling_outputs import CausalLMOutputWithPast
16
+ from transformers.models.whisper.modeling_whisper import (
17
+ _compute_mask_indices,
18
+ )
19
+
20
+ try:
21
+ from .asr_config import ASRConfig
22
+ from .projectors import PROJECTOR_CLASSES
23
+ except ImportError:
24
+ from asr_config import ASRConfig # type: ignore[no-redef]
25
+ from projectors import PROJECTOR_CLASSES # type: ignore[no-redef]
26
+
27
+
28
+ class ASRModel(PreTrainedModel, GenerationMixin):
29
+ """Audio-to-text model combining an audio encoder, projector, and language model."""
30
+
31
+ config_class = ASRConfig
32
+ base_model_prefix = "model"
33
+ main_input_name = "input_features"
34
+ _supports_flash_attn_2 = True
35
+ supports_gradient_checkpointing = True
36
+ _is_loading_from_pretrained: bool = False
37
+ _pretrained_model_path: Optional[str] = None
38
+
39
+ TRANSCRIBE_PROMPT = "Transcribe: "
40
+
41
+ @classmethod
42
+ def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
43
+ """Load model from pretrained, handling device placement correctly."""
44
+ from safetensors.torch import load_file
45
+ from transformers.utils.hub import cached_file
46
+
47
+ config = kwargs.pop("config", None)
48
+ if config is None:
49
+ config = ASRConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
50
+
51
+ # Set flag to avoid device_map="auto" in sub-model loaders
52
+ cls._is_loading_from_pretrained = True
53
+ cls._pretrained_model_path = pretrained_model_name_or_path
54
+
55
+ try:
56
+ model = cls(config, **kwargs)
57
+
58
+ # Load projector weights from safetensors
59
+ subfolder = kwargs.get("subfolder")
60
+ revision = kwargs.get("revision")
61
+ cache_kwargs = {}
62
+ if subfolder:
63
+ cache_kwargs["subfolder"] = subfolder
64
+ if revision:
65
+ cache_kwargs["revision"] = revision
66
+
67
+ model_file = cached_file(
68
+ pretrained_model_name_or_path,
69
+ "model.safetensors",
70
+ _raise_exceptions_for_missing_entries=False,
71
+ **cache_kwargs,
72
+ )
73
+
74
+ if model_file is not None:
75
+ state_dict = load_file(model_file)
76
+ model.load_state_dict(state_dict, strict=False)
77
+
78
+ return model
79
+ finally:
80
+ cls._is_loading_from_pretrained = False
81
+ cls._pretrained_model_path = None
82
+
83
+ def __init__(self, config: ASRConfig, **kwargs):
84
+ super().__init__(config)
85
+
86
+ self.system_prompt = config.system_prompt
87
+ target_dtype = getattr(torch, config.model_dtype)
88
+
89
+ # Audio encoder (frozen)
90
+ self.audio_tower = self._load_audio_encoder(config, target_dtype)
91
+
92
+ # Language model (frozen)
93
+ self.language_model = self._load_language_model(config, target_dtype)
94
+
95
+ # Initialize tokenizer and special tokens
96
+ self._init_tokenizer(config)
97
+
98
+ # Set up generation config with greedy decoding defaults
99
+ self.generation_config = self.language_model.generation_config
100
+ self.generation_config.max_new_tokens = config.max_new_tokens
101
+ self.generation_config.num_beams = config.num_beams
102
+ self.generation_config.do_sample = False
103
+ # Clear sampling params (inherited from LLM) since we use greedy decoding
104
+ self.generation_config.temperature = None
105
+ self.generation_config.top_p = None
106
+ self.generation_config.top_k = None
107
+ self.generation_config.use_cache = config.use_cache
108
+ self.generation_config.length_penalty = config.length_penalty
109
+ self.generation_config.repetition_penalty = config.repetition_penalty
110
+ self.generation_config.no_repeat_ngram_size = config.no_repeat_ngram_size
111
+ self.generation_config.eos_token_id = self.tokenizer.convert_tokens_to_ids("<|im_end|>")
112
+ self.generation_config.pad_token_id = self.tokenizer.pad_token_id
113
+
114
+ # Feature extractor for audio preprocessing
115
+ self.feature_extractor = self._create_feature_extractor(config)
116
+
117
+ # Audio projector (trainable)
118
+ self.projector = self._create_projector(config, target_dtype)
119
+
120
+ # For model parallelism
121
+ self._no_split_modules = getattr(self.language_model, "_no_split_modules", [])
122
+
123
+ def _create_feature_extractor(self, config: ASRConfig):
124
+ """Create the appropriate feature extractor for the audio encoder."""
125
+ from transformers import AutoFeatureExtractor
126
+
127
+ return AutoFeatureExtractor.from_pretrained(config.audio_model_id)
128
+
129
+ @classmethod
130
+ def _load_audio_encoder(cls, config: ASRConfig, dtype: torch.dtype) -> nn.Module:
131
+ """Load and freeze the audio encoder."""
132
+ encoder_kwargs = {
133
+ "attn_implementation": config.attn_implementation,
134
+ "low_cpu_mem_usage": True,
135
+ "dtype": dtype,
136
+ }
137
+
138
+ if "whisper" in config.audio_model_id.lower():
139
+ from transformers import WhisperModel
140
+
141
+ full_model = WhisperModel.from_pretrained(config.audio_model_id, **encoder_kwargs)
142
+ encoder = full_model.encoder
143
+ del full_model
144
+ else:
145
+ encoder = AutoModel.from_pretrained(config.audio_model_id, **encoder_kwargs)
146
+
147
+ encoder.requires_grad_(False)
148
+ encoder.eval()
149
+ return encoder
150
+
151
+ @classmethod
152
+ def _load_language_model(cls, config: ASRConfig, dtype: torch.dtype) -> PreTrainedModel:
153
+ """Load and freeze the language model."""
154
+ decoder_kwargs = {
155
+ "attn_implementation": config.attn_implementation,
156
+ "trust_remote_code": True,
157
+ "tie_word_embeddings": True,
158
+ "low_cpu_mem_usage": True,
159
+ "dtype": dtype,
160
+ }
161
+
162
+ decoder = AutoModelForCausalLM.from_pretrained(config.text_model_id, **decoder_kwargs)
163
+ decoder.config.use_cache = getattr(config, "use_cache", True)
164
+ decoder.requires_grad_(False)
165
+ decoder.eval()
166
+ return decoder
167
+
168
+ def _create_projector(self, config: ASRConfig, dtype: torch.dtype) -> nn.Module:
169
+ """Create the trainable audio projector."""
170
+ # Auto-detect dimensions if not specified
171
+ if config.encoder_dim is None:
172
+ enc_cfg = self.audio_tower.config
173
+ config.encoder_dim = getattr(enc_cfg, "hidden_size", None) or getattr(
174
+ enc_cfg, "d_model", None
175
+ )
176
+ if config.encoder_dim is None:
177
+ raise ValueError("Could not auto-detect encoder_dim. Please specify in config.")
178
+
179
+ if config.llm_dim is None:
180
+ dec_cfg = self.language_model.config
181
+ config.llm_dim = getattr(dec_cfg, "hidden_size", None) or getattr(
182
+ dec_cfg, "d_model", None
183
+ )
184
+ if config.llm_dim is None:
185
+ raise ValueError("Could not auto-detect llm_dim. Please specify in config.")
186
+
187
+ # Select projector type based on config
188
+ projector_type = getattr(config, "projector_type", "mlp")
189
+ projector_class = PROJECTOR_CLASSES.get(projector_type)
190
+ if projector_class is None:
191
+ raise ValueError(
192
+ f"Unknown projector_type: {projector_type}. "
193
+ f"Valid options: {list(PROJECTOR_CLASSES.keys())}"
194
+ )
195
+ projector = projector_class(config)
196
+
197
+ # Move projector to same device as language model (important when using quantization)
198
+ device = next(self.language_model.parameters()).device
199
+ return projector.to(device=device, dtype=dtype)
200
+
201
+ def _init_tokenizer(self, config: ASRConfig):
202
+ """Initialize tokenizer with audio token."""
203
+ self.tokenizer = AutoTokenizer.from_pretrained(config.text_model_id, trust_remote_code=True)
204
+
205
+ # Set pad token
206
+ if (
207
+ self.tokenizer.pad_token is None
208
+ or self.tokenizer.pad_token_id == self.tokenizer.eos_token_id
209
+ ) and "<|finetune_right_pad_id|>" in self.tokenizer.get_vocab():
210
+ self.tokenizer.pad_token = "<|finetune_right_pad_id|>"
211
+
212
+ # Add audio token
213
+ existing_special = self.tokenizer.additional_special_tokens or []
214
+ if "<audio>" not in existing_special:
215
+ self.tokenizer.add_special_tokens(
216
+ {"additional_special_tokens": existing_special + ["<audio>"]}
217
+ )
218
+ self.language_model.resize_token_embeddings(len(self.tokenizer), mean_resizing=False)
219
+
220
+ self.audio_token_id = self.tokenizer.convert_tokens_to_ids("<audio>")
221
+ self.tokenizer.padding_side = "right"
222
+
223
+ # Sync token IDs to configs
224
+ for cfg in [self.config.text_config, self.language_model.config, self.generation_config]:
225
+ if cfg is not None:
226
+ cfg.pad_token_id = self.tokenizer.pad_token_id
227
+ cfg.eos_token_id = self.tokenizer.eos_token_id
228
+ cfg.bos_token_id = self.tokenizer.bos_token_id
229
+
230
+ def _init_weights(self, module):
231
+ """Weight initialization (projector weights are initialized in MoEAudioProjector)."""
232
+ pass
233
+
234
+ def _set_gradient_checkpointing(self, enable: bool = True, gradient_checkpointing_func=None):
235
+ """Enable/disable gradient checkpointing for the language model."""
236
+ # The LLM still stores activations during forward for backprop to projector
237
+ # Gradient checkpointing trades compute for memory by recomputing activations
238
+ if hasattr(self.language_model, "_set_gradient_checkpointing"):
239
+ self.language_model._set_gradient_checkpointing(enable, gradient_checkpointing_func)
240
+ elif hasattr(self.language_model, "gradient_checkpointing_enable") and enable:
241
+ self.language_model.gradient_checkpointing_enable(
242
+ gradient_checkpointing_kwargs={"use_reentrant": False}
243
+ )
244
+ elif hasattr(self.language_model, "gradient_checkpointing_disable") and not enable:
245
+ self.language_model.gradient_checkpointing_disable()
246
+
247
+ def get_input_embeddings(self):
248
+ return self.language_model.get_input_embeddings()
249
+
250
+ def set_input_embeddings(self, value):
251
+ self.language_model.set_input_embeddings(value)
252
+
253
+ def get_output_embeddings(self):
254
+ return self.language_model.get_output_embeddings()
255
+
256
+ def set_output_embeddings(self, value):
257
+ self.language_model.set_output_embeddings(value)
258
+
259
+ def get_processor(self):
260
+ """Get the processor for this model."""
261
+ try:
262
+ from .asr_processing import ASRProcessor
263
+ except ImportError:
264
+ from asr_processing import ASRProcessor # type: ignore[no-redef]
265
+
266
+ return ASRProcessor(feature_extractor=self.feature_extractor, tokenizer=self.tokenizer)
267
+
268
+ def state_dict(self, *args, **kwargs):
269
+ """Only save trainable projector weights."""
270
+ return {f"projector.{k}": v for k, v in self.projector.state_dict().items()}
271
+
272
+ def _apply_specaugment(
273
+ self,
274
+ input_features: torch.Tensor,
275
+ attention_mask: Optional[torch.Tensor] = None,
276
+ ) -> torch.Tensor:
277
+ if not getattr(self.config, "use_specaugment", False):
278
+ return input_features
279
+
280
+ if not self.training:
281
+ return input_features
282
+
283
+ # Input shape: (batch_size, num_mel_bins, sequence_length) for Whisper
284
+ batch_size, hidden_size, sequence_length = input_features.size()
285
+
286
+ mask_time_prob = getattr(self.config, "mask_time_prob", 0.05)
287
+ mask_time_length = getattr(self.config, "mask_time_length", 10)
288
+ mask_feature_prob = getattr(self.config, "mask_feature_prob", 0.0)
289
+ mask_feature_length = getattr(self.config, "mask_feature_length", 10)
290
+
291
+ # Time masking
292
+ if mask_time_prob > 0:
293
+ mask_time_np = _compute_mask_indices(
294
+ (batch_size, sequence_length),
295
+ mask_prob=mask_time_prob,
296
+ mask_length=mask_time_length,
297
+ attention_mask=attention_mask,
298
+ min_masks=2,
299
+ )
300
+ mask_time_indices = torch.tensor(
301
+ mask_time_np, device=input_features.device, dtype=torch.bool
302
+ )
303
+ # Expand to cover all features: (batch, seq) -> (batch, features, seq)
304
+ mask_time_expanded = mask_time_indices[:, None].expand(-1, hidden_size, -1)
305
+ input_features = input_features.masked_fill(mask_time_expanded, 0.0)
306
+
307
+ # Feature masking
308
+ if mask_feature_prob > 0:
309
+ mask_feature_np = _compute_mask_indices(
310
+ (batch_size, hidden_size),
311
+ mask_prob=mask_feature_prob,
312
+ mask_length=mask_feature_length,
313
+ min_masks=2,
314
+ )
315
+ mask_feature_indices = torch.tensor(
316
+ mask_feature_np, device=input_features.device, dtype=torch.bool
317
+ )
318
+ # Expand: (batch, features) -> (batch, features, seq)
319
+ mask_feature_expanded = mask_feature_indices[:, :, None].expand(-1, -1, sequence_length)
320
+ input_features = input_features.masked_fill(mask_feature_expanded, 0.0)
321
+
322
+ return input_features
323
+
324
+ def _encode_audio(
325
+ self,
326
+ audio_features: torch.Tensor,
327
+ audio_attention_mask: Optional[torch.Tensor] = None,
328
+ ) -> torch.Tensor:
329
+ """Encode audio and project to LLM embedding space.
330
+
331
+ Returns flattened audio embeddings of shape (total_audio_tokens, hidden_dim).
332
+ """
333
+ # Apply SpecAugment during training (before encoding)
334
+ audio_features = self._apply_specaugment(audio_features, audio_attention_mask)
335
+
336
+ with torch.no_grad():
337
+ encoder_out = self.audio_tower(
338
+ input_features=audio_features, attention_mask=audio_attention_mask
339
+ )
340
+ hidden_states = encoder_out.last_hidden_state
341
+
342
+ audio_embeds = self.projector(hidden_states)
343
+
344
+ # Flatten: (batch, seq, hidden) -> (batch * seq, hidden)
345
+ # This allows masked_scatter to do 1:1 replacement
346
+ return audio_embeds.reshape(-1, audio_embeds.shape[-1])
347
+
348
+ def forward(
349
+ self,
350
+ input_ids: Optional[torch.Tensor] = None,
351
+ input_features: Optional[torch.Tensor] = None,
352
+ attention_mask: Optional[torch.Tensor] = None,
353
+ position_ids: Optional[torch.Tensor] = None,
354
+ past_key_values: Optional[torch.Tensor] = None,
355
+ inputs_embeds: Optional[torch.Tensor] = None,
356
+ labels: Optional[torch.Tensor] = None,
357
+ use_cache: Optional[bool] = None,
358
+ cache_position: Optional[torch.Tensor] = None,
359
+ audio_attention_mask: Optional[torch.Tensor] = None,
360
+ **kwargs,
361
+ ) -> CausalLMOutputWithPast:
362
+ """Forward pass for training and inference."""
363
+ # Get text embeddings if not provided
364
+ if inputs_embeds is None:
365
+ inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
366
+
367
+ if input_features is not None and input_ids is not None:
368
+ # Encode audio -> flattened (total_audio_tokens, hidden_dim)
369
+ audio_embeds = self._encode_audio(input_features, audio_attention_mask)
370
+
371
+ # Replace <audio> token placeholders with audio embeddings using masked_scatter
372
+ audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
373
+ inputs_embeds = inputs_embeds.masked_scatter(
374
+ audio_token_mask.to(inputs_embeds.device),
375
+ audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
376
+ )
377
+
378
+ # Run through language model (let it compute loss if labels provided)
379
+ outputs = self.language_model(
380
+ attention_mask=attention_mask,
381
+ position_ids=position_ids,
382
+ past_key_values=past_key_values,
383
+ inputs_embeds=inputs_embeds,
384
+ labels=labels,
385
+ use_cache=use_cache,
386
+ cache_position=cache_position,
387
+ **kwargs,
388
+ )
389
+
390
+ # Add auxiliary loss from MoE projectors if available
391
+ if outputs.loss is not None and hasattr(self.projector, "get_aux_loss"):
392
+ aux_loss = self.projector.get_aux_loss()
393
+ if aux_loss is not None and aux_loss.numel() > 0:
394
+ outputs.loss = outputs.loss + aux_loss.to(outputs.loss.device)
395
+
396
+ return outputs
397
+
398
+ def prepare_inputs_for_generation(self, *args, **kwargs):
399
+ """Prepare inputs for generation, handling audio features for cached decoding."""
400
+ input_features = kwargs.pop("input_features", None)
401
+ cache_position = kwargs.get("cache_position")
402
+
403
+ model_inputs = self.language_model.prepare_inputs_for_generation(*args, **kwargs)
404
+
405
+ # Only pass audio features on the first generation step (cache_position[0] == 0)
406
+ if cache_position is not None and cache_position[0] == 0 and input_features is not None:
407
+ model_inputs["input_features"] = input_features
408
+
409
+ return model_inputs
410
+
411
+ def _get_num_audio_tokens(self, input_features: torch.Tensor) -> int:
412
+ """Calculate number of audio tokens based on input shape.
413
+
414
+ Whisper: input_features shape is (batch, n_mels, mel_len)
415
+ Encoder output is mel_len // 2 due to stride-2 conv
416
+ MLP projector adds another stride-2 for 4x total downsampling
417
+ """
418
+ mel_len = input_features.shape[-1]
419
+ return mel_len // 4
420
+
421
+ @torch.no_grad()
422
+ def generate(
423
+ self,
424
+ input_ids: Optional[torch.Tensor] = None,
425
+ input_features: Optional[torch.Tensor] = None,
426
+ attention_mask: Optional[torch.Tensor] = None,
427
+ audio_attention_mask: Optional[torch.Tensor] = None,
428
+ system_prompt: Optional[str] = None,
429
+ **generate_kwargs,
430
+ ) -> torch.Tensor:
431
+ """Generate transcription from audio input.
432
+
433
+ Can be called in two ways:
434
+ 1. With input_ids containing <audio> tokens (from processor)
435
+ 2. With just audio, and we build the prompt internally
436
+ """
437
+ if input_features is None:
438
+ raise ValueError("input_features required for generation")
439
+
440
+ device = input_features.device
441
+ batch_size = input_features.shape[0]
442
+
443
+ # Encode audio -> flattened embeddings
444
+ audio_embeds = self._encode_audio(input_features, audio_attention_mask)
445
+
446
+ # If input_ids not provided, build prompt with correct number of audio tokens
447
+ if input_ids is None:
448
+ num_audio_tokens = self._get_num_audio_tokens(input_features)
449
+ audio_placeholder = "<audio>" * num_audio_tokens
450
+
451
+ system_prompt = system_prompt or self.system_prompt
452
+
453
+ messages: list[dict[str, str]] = []
454
+ if system_prompt:
455
+ messages.append({"role": "system", "content": system_prompt})
456
+ messages.append({"role": "user", "content": self.TRANSCRIBE_PROMPT + audio_placeholder})
457
+
458
+ input_ids = self.tokenizer.apply_chat_template(
459
+ messages,
460
+ tokenize=True,
461
+ add_generation_prompt=True,
462
+ return_tensors="pt",
463
+ ).to(device)
464
+
465
+ if input_ids.dim() == 1:
466
+ input_ids = input_ids.unsqueeze(0)
467
+ if input_ids.shape[0] == 1 and batch_size > 1:
468
+ input_ids = input_ids.expand(batch_size, -1)
469
+
470
+ attention_mask = torch.ones_like(input_ids)
471
+
472
+ # Get text embeddings and replace audio tokens with audio embeddings
473
+ inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
474
+ audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
475
+ inputs_embeds = inputs_embeds.masked_scatter(
476
+ audio_token_mask.to(inputs_embeds.device),
477
+ audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
478
+ )
479
+
480
+ # Generate using language model
481
+ output = self.language_model.generate(
482
+ inputs_embeds=inputs_embeds,
483
+ attention_mask=attention_mask,
484
+ generation_config=self.generation_config,
485
+ **generate_kwargs,
486
+ )
487
+
488
+ # When using inputs_embeds without input_ids, generate returns only new tokens
489
+ if isinstance(output, torch.Tensor):
490
+ return output
491
+ return output.sequences
492
+
493
+ def save_pretrained(self, save_directory: Union[str, Path], **kwargs):
494
+ """Save model, tokenizer, and processor."""
495
+ import shutil
496
+ from pathlib import Path as PathlibPath
497
+
498
+ save_dir = PathlibPath(save_directory)
499
+ save_dir.mkdir(parents=True, exist_ok=True)
500
+
501
+ # Update config with actual vocab size
502
+ self.config.vocab_size = self.language_model.config.vocab_size
503
+ self.config.text_config.vocab_size = self.language_model.config.vocab_size
504
+
505
+ if hasattr(self.audio_tower.config, "num_mel_bins"):
506
+ self.config.audio_config.num_mel_bins = self.audio_tower.config.num_mel_bins
507
+
508
+ # Save model (temporarily remove non-serializable attributes)
509
+ tokenizer = self.tokenizer
510
+ del self.tokenizer
511
+
512
+ try:
513
+ super().save_pretrained(save_dir, **kwargs)
514
+ finally:
515
+ self.tokenizer = tokenizer
516
+
517
+ # Save tokenizer and feature extractor
518
+ self.tokenizer.save_pretrained(save_dir)
519
+ self.feature_extractor.save_pretrained(save_dir)
520
+
521
+ # Add processor auto_map to preprocessor_config.json
522
+ config_path = save_dir / "preprocessor_config.json"
523
+ if config_path.exists():
524
+ with config_path.open() as f:
525
+ processor_config = json.load(f)
526
+ else:
527
+ processor_config = {}
528
+
529
+ processor_config.update(
530
+ {
531
+ "processor_class": "ASRProcessor",
532
+ "auto_map": {"AutoProcessor": "asr_processing.ASRProcessor"},
533
+ }
534
+ )
535
+
536
+ with config_path.open("w") as f:
537
+ json.dump(processor_config, f, indent=2)
538
+
539
+ # Copy source files for auto-loading
540
+ src_dir = PathlibPath(__file__).parent
541
+ for asr_file in src_dir.glob("asr_*.py"):
542
+ shutil.copy(asr_file, save_dir / asr_file.name)
543
+ # Copy projectors module
544
+ shutil.copy(src_dir / "projectors.py", save_dir / "projectors.py")
545
+
546
+
547
+ # Register with transformers Auto classes
548
+ AutoConfig.register("asr_model", ASRConfig)
549
+ AutoModel.register(ASRConfig, ASRModel)
asr_pipeline.py ADDED
@@ -0,0 +1,472 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any
2
+
3
+ import numpy as np
4
+ import torch
5
+ import transformers
6
+
7
+ try:
8
+ from .asr_modeling import ASRModel
9
+ except ImportError:
10
+ from asr_modeling import ASRModel # type: ignore[no-redef]
11
+
12
+
13
+ class ForcedAligner:
14
+ """Lazy-loaded forced aligner for word-level timestamps using torchaudio wav2vec2."""
15
+
16
+ _bundle = None
17
+ _model = None
18
+ _labels = None
19
+ _dictionary = None
20
+
21
+ @classmethod
22
+ def get_instance(cls, device: str = "cuda"):
23
+ if cls._model is None:
24
+ import torchaudio
25
+
26
+ cls._bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
27
+ cls._model = cls._bundle.get_model().to(device)
28
+ cls._model.eval()
29
+ cls._labels = cls._bundle.get_labels()
30
+ cls._dictionary = {c: i for i, c in enumerate(cls._labels)}
31
+ return cls._model, cls._labels, cls._dictionary
32
+
33
+ @classmethod
34
+ def align(
35
+ cls,
36
+ audio: np.ndarray,
37
+ text: str,
38
+ sample_rate: int = 16000,
39
+ language: str = "eng",
40
+ batch_size: int = 16,
41
+ ) -> list[dict]:
42
+ """Align transcript to audio and return word-level timestamps.
43
+
44
+ Args:
45
+ audio: Audio waveform as numpy array
46
+ text: Transcript text to align
47
+ sample_rate: Audio sample rate (default 16000)
48
+ language: ISO-639-3 language code (default "eng" for English, unused)
49
+ batch_size: Batch size for alignment model (unused)
50
+
51
+ Returns:
52
+ List of dicts with 'word', 'start', 'end' keys
53
+ """
54
+ import torchaudio
55
+ from torchaudio.functional import forced_align, merge_tokens
56
+
57
+ device = "cuda" if torch.cuda.is_available() else "cpu"
58
+ model, labels, dictionary = cls.get_instance(device)
59
+
60
+ # Convert audio to tensor (copy to ensure array is writable)
61
+ if isinstance(audio, np.ndarray):
62
+ waveform = torch.from_numpy(audio.copy()).float()
63
+ else:
64
+ waveform = audio.clone().float()
65
+
66
+ # Ensure 2D (channels, time)
67
+ if waveform.dim() == 1:
68
+ waveform = waveform.unsqueeze(0)
69
+
70
+ # Resample if needed (wav2vec2 expects 16kHz)
71
+ if sample_rate != cls._bundle.sample_rate:
72
+ waveform = torchaudio.functional.resample(
73
+ waveform, sample_rate, cls._bundle.sample_rate
74
+ )
75
+
76
+ waveform = waveform.to(device)
77
+
78
+ # Get emissions from model
79
+ with torch.inference_mode():
80
+ emissions, _ = model(waveform)
81
+ emissions = torch.log_softmax(emissions, dim=-1)
82
+
83
+ emission = emissions[0].cpu()
84
+
85
+ # Normalize text: uppercase, keep only valid characters
86
+ transcript = text.upper()
87
+ # Build tokens from transcript
88
+ tokens = []
89
+ for char in transcript:
90
+ if char in dictionary:
91
+ tokens.append(dictionary[char])
92
+ elif char == " ":
93
+ tokens.append(dictionary.get("|", dictionary.get(" ", 0)))
94
+
95
+ if not tokens:
96
+ return []
97
+
98
+ targets = torch.tensor([tokens], dtype=torch.int32)
99
+
100
+ # Run forced alignment
101
+ # Note: forced_align is deprecated in torchaudio 2.6+ and will be removed in 2.9 (late 2025)
102
+ # No official replacement announced yet. See https://github.com/pytorch/audio/issues/3902
103
+ aligned_tokens, scores = forced_align(emission.unsqueeze(0), targets, blank=0)
104
+
105
+ # Use torchaudio's merge_tokens to get token spans (removes blanks and merges repeats)
106
+ token_spans = merge_tokens(aligned_tokens[0], scores[0])
107
+
108
+ # Convert frame indices to time (model stride is 320 samples at 16kHz = 20ms)
109
+ frame_duration = 320 / cls._bundle.sample_rate
110
+
111
+ # Group token spans into words based on pipe separator
112
+ words = text.split()
113
+ word_timestamps = []
114
+ current_word_start = None
115
+ current_word_end = None
116
+ word_idx = 0
117
+
118
+ for span in token_spans:
119
+ token_char = labels[span.token]
120
+ if token_char == "|": # Word separator
121
+ if current_word_start is not None and word_idx < len(words):
122
+ word_timestamps.append({
123
+ "word": words[word_idx],
124
+ "start": current_word_start * frame_duration,
125
+ "end": current_word_end * frame_duration,
126
+ })
127
+ word_idx += 1
128
+ current_word_start = None
129
+ current_word_end = None
130
+ else:
131
+ if current_word_start is None:
132
+ current_word_start = span.start
133
+ current_word_end = span.end
134
+
135
+ # Don't forget the last word
136
+ if current_word_start is not None and word_idx < len(words):
137
+ word_timestamps.append({
138
+ "word": words[word_idx],
139
+ "start": current_word_start * frame_duration,
140
+ "end": current_word_end * frame_duration,
141
+ })
142
+
143
+ return word_timestamps
144
+
145
+
146
+ class SpeakerDiarizer:
147
+ """Lazy-loaded speaker diarization using pyannote-audio."""
148
+
149
+ _pipeline = None
150
+
151
+ @classmethod
152
+ def get_instance(cls, hf_token: str | None = None):
153
+ """Get or create the diarization pipeline.
154
+
155
+ Args:
156
+ hf_token: HuggingFace token with access to pyannote models.
157
+ Can also be set via HF_TOKEN environment variable.
158
+ """
159
+ if cls._pipeline is None:
160
+ from pyannote.audio import Pipeline
161
+
162
+ cls._pipeline = Pipeline.from_pretrained(
163
+ "pyannote/speaker-diarization-3.1",
164
+ )
165
+
166
+ # Move to GPU if available
167
+ if torch.cuda.is_available():
168
+ cls._pipeline.to(torch.device("cuda"))
169
+ elif torch.backends.mps.is_available():
170
+ cls._pipeline.to(torch.device("mps"))
171
+
172
+ return cls._pipeline
173
+
174
+ @classmethod
175
+ def diarize(
176
+ cls,
177
+ audio: np.ndarray | str,
178
+ sample_rate: int = 16000,
179
+ num_speakers: int | None = None,
180
+ min_speakers: int | None = None,
181
+ max_speakers: int | None = None,
182
+ hf_token: str | None = None,
183
+ ) -> list[dict]:
184
+ """Run speaker diarization on audio.
185
+
186
+ Args:
187
+ audio: Audio waveform as numpy array or path to audio file
188
+ sample_rate: Audio sample rate (default 16000)
189
+ num_speakers: Exact number of speakers (if known)
190
+ min_speakers: Minimum number of speakers
191
+ max_speakers: Maximum number of speakers
192
+ hf_token: HuggingFace token for pyannote models
193
+
194
+ Returns:
195
+ List of dicts with 'speaker', 'start', 'end' keys
196
+ """
197
+ pipeline = cls.get_instance(hf_token)
198
+
199
+ # Prepare audio input
200
+ if isinstance(audio, np.ndarray):
201
+ # pyannote expects {"waveform": tensor, "sample_rate": int}
202
+ waveform = torch.from_numpy(audio).unsqueeze(0) # Add channel dim
203
+ if waveform.dim() == 1:
204
+ waveform = waveform.unsqueeze(0)
205
+ audio_input = {"waveform": waveform, "sample_rate": sample_rate}
206
+ else:
207
+ # File path
208
+ audio_input = audio
209
+
210
+ # Run diarization
211
+ diarization_args = {}
212
+ if num_speakers is not None:
213
+ diarization_args["num_speakers"] = num_speakers
214
+ if min_speakers is not None:
215
+ diarization_args["min_speakers"] = min_speakers
216
+ if max_speakers is not None:
217
+ diarization_args["max_speakers"] = max_speakers
218
+
219
+ diarization = pipeline(audio_input, **diarization_args)
220
+
221
+ # Handle different pyannote return types
222
+ # pyannote 3.x returns DiarizeOutput dataclass, older versions return Annotation
223
+ if hasattr(diarization, "itertracks"):
224
+ annotation = diarization
225
+ elif hasattr(diarization, "speaker_diarization"):
226
+ # pyannote 3.x DiarizeOutput dataclass
227
+ annotation = diarization.speaker_diarization
228
+ elif isinstance(diarization, tuple):
229
+ # Some versions return (annotation, embeddings) tuple
230
+ annotation = diarization[0]
231
+ else:
232
+ raise TypeError(f"Unexpected diarization output type: {type(diarization)}")
233
+
234
+ # Convert to simple format
235
+ segments = []
236
+ for turn, _, speaker in annotation.itertracks(yield_label=True):
237
+ segments.append({
238
+ "speaker": speaker,
239
+ "start": turn.start,
240
+ "end": turn.end,
241
+ })
242
+
243
+ return segments
244
+
245
+ @classmethod
246
+ def assign_speakers_to_words(
247
+ cls,
248
+ words: list[dict],
249
+ speaker_segments: list[dict],
250
+ ) -> list[dict]:
251
+ """Assign speaker labels to words based on timestamp overlap.
252
+
253
+ Args:
254
+ words: List of word dicts with 'word', 'start', 'end' keys
255
+ speaker_segments: List of speaker dicts with 'speaker', 'start', 'end' keys
256
+
257
+ Returns:
258
+ Words list with 'speaker' key added to each word
259
+ """
260
+ for word in words:
261
+ word_mid = (word["start"] + word["end"]) / 2
262
+
263
+ # Find the speaker segment that contains this word's midpoint
264
+ best_speaker = None
265
+ for seg in speaker_segments:
266
+ if seg["start"] <= word_mid <= seg["end"]:
267
+ best_speaker = seg["speaker"]
268
+ break
269
+
270
+ # If no exact match, find closest segment
271
+ if best_speaker is None and speaker_segments:
272
+ min_dist = float("inf")
273
+ for seg in speaker_segments:
274
+ seg_mid = (seg["start"] + seg["end"]) / 2
275
+ dist = abs(word_mid - seg_mid)
276
+ if dist < min_dist:
277
+ min_dist = dist
278
+ best_speaker = seg["speaker"]
279
+
280
+ word["speaker"] = best_speaker
281
+
282
+ return words
283
+
284
+
285
+ class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
286
+ """ASR Pipeline for audio-to-text transcription."""
287
+
288
+ model: ASRModel
289
+
290
+ def __init__(self, model: ASRModel, **kwargs):
291
+ feature_extractor = kwargs.pop("feature_extractor", None)
292
+ tokenizer = kwargs.pop("tokenizer", model.tokenizer)
293
+
294
+ if feature_extractor is None:
295
+ feature_extractor = model.get_processor().feature_extractor
296
+
297
+ super().__init__(
298
+ model=model, feature_extractor=feature_extractor, tokenizer=tokenizer, **kwargs
299
+ )
300
+ self._current_audio = None
301
+
302
+ def _sanitize_parameters(self, **kwargs):
303
+ """Intercept our custom parameters before parent class validates them."""
304
+ # Remove our custom parameters so parent doesn't see them
305
+ kwargs.pop("return_timestamps", None)
306
+ kwargs.pop("return_speakers", None)
307
+ kwargs.pop("num_speakers", None)
308
+ kwargs.pop("min_speakers", None)
309
+ kwargs.pop("max_speakers", None)
310
+ kwargs.pop("hf_token", None)
311
+
312
+ return super()._sanitize_parameters(**kwargs)
313
+
314
+ def __call__(
315
+ self,
316
+ inputs,
317
+ **kwargs,
318
+ ):
319
+ """Transcribe audio with optional word-level timestamps and speaker diarization.
320
+
321
+ Args:
322
+ inputs: Audio input (file path, dict with array/sampling_rate, etc.)
323
+ return_timestamps: If True, return word-level timestamps using forced alignment
324
+ return_speakers: If True, return speaker labels for each word
325
+ num_speakers: Exact number of speakers (if known, for diarization)
326
+ min_speakers: Minimum number of speakers (for diarization)
327
+ max_speakers: Maximum number of speakers (for diarization)
328
+ hf_token: HuggingFace token for pyannote models (or set HF_TOKEN env var)
329
+ **kwargs: Additional arguments passed to the pipeline
330
+
331
+ Returns:
332
+ Dict with 'text' key, 'words' key if return_timestamps=True,
333
+ and speaker labels on words if return_speakers=True
334
+ """
335
+ # Extract our params before super().__call__ (which will also call _sanitize_parameters)
336
+ return_timestamps = kwargs.pop("return_timestamps", False)
337
+ return_speakers = kwargs.pop("return_speakers", False)
338
+ diarization_params = {
339
+ "num_speakers": kwargs.pop("num_speakers", None),
340
+ "min_speakers": kwargs.pop("min_speakers", None),
341
+ "max_speakers": kwargs.pop("max_speakers", None),
342
+ "hf_token": kwargs.pop("hf_token", None),
343
+ }
344
+
345
+ if return_speakers:
346
+ return_timestamps = True
347
+
348
+ # Store audio for timestamp alignment and diarization
349
+ if return_timestamps or return_speakers:
350
+ self._current_audio = self._extract_audio(inputs)
351
+
352
+ # Run standard transcription
353
+ result = super().__call__(inputs, **kwargs)
354
+
355
+ # Add timestamps if requested
356
+ if return_timestamps and self._current_audio is not None:
357
+ text = result.get("text", "")
358
+ if text:
359
+ try:
360
+ words = ForcedAligner.align(
361
+ self._current_audio["array"],
362
+ text,
363
+ sample_rate=self._current_audio.get("sampling_rate", 16000),
364
+ )
365
+ result["words"] = words
366
+ except Exception as e:
367
+ result["words"] = []
368
+ result["timestamp_error"] = str(e)
369
+ else:
370
+ result["words"] = []
371
+
372
+ # Add speaker diarization if requested
373
+ if return_speakers and self._current_audio is not None:
374
+ try:
375
+ # Run diarization
376
+ speaker_segments = SpeakerDiarizer.diarize(
377
+ self._current_audio["array"],
378
+ sample_rate=self._current_audio.get("sampling_rate", 16000),
379
+ **{k: v for k, v in diarization_params.items() if v is not None},
380
+ )
381
+ result["speaker_segments"] = speaker_segments
382
+
383
+ # Assign speakers to words
384
+ if result.get("words"):
385
+ result["words"] = SpeakerDiarizer.assign_speakers_to_words(
386
+ result["words"],
387
+ speaker_segments,
388
+ )
389
+ except Exception as e:
390
+ result["speaker_segments"] = []
391
+ result["diarization_error"] = str(e)
392
+
393
+ # Clean up
394
+ self._current_audio = None
395
+
396
+ return result
397
+
398
+ def _extract_audio(self, inputs) -> dict | None:
399
+ """Extract audio array from various input formats using HF utilities."""
400
+ from transformers.pipelines.audio_utils import ffmpeg_read
401
+
402
+ if isinstance(inputs, dict):
403
+ if "array" in inputs:
404
+ return {
405
+ "array": inputs["array"],
406
+ "sampling_rate": inputs.get("sampling_rate", 16000),
407
+ }
408
+ if "raw" in inputs:
409
+ return {
410
+ "array": inputs["raw"],
411
+ "sampling_rate": inputs.get("sampling_rate", 16000),
412
+ }
413
+ elif isinstance(inputs, str):
414
+ # File path - load audio using ffmpeg (same as HF pipeline)
415
+ with open(inputs, "rb") as f:
416
+ audio = ffmpeg_read(f.read(), sampling_rate=16000)
417
+ return {"array": audio, "sampling_rate": 16000}
418
+ elif isinstance(inputs, bytes):
419
+ audio = ffmpeg_read(inputs, sampling_rate=16000)
420
+ return {"array": audio, "sampling_rate": 16000}
421
+ elif isinstance(inputs, np.ndarray):
422
+ return {"array": inputs, "sampling_rate": 16000}
423
+
424
+ return None
425
+
426
+ def preprocess(self, inputs, **preprocess_params):
427
+ # Handle dict with "array" key (from datasets)
428
+ if isinstance(inputs, dict) and "array" in inputs:
429
+ inputs = {
430
+ "raw": inputs["array"],
431
+ "sampling_rate": inputs.get("sampling_rate", self.feature_extractor.sampling_rate),
432
+ }
433
+
434
+ for item in super().preprocess(inputs, **preprocess_params):
435
+ if "is_last" not in item:
436
+ item["is_last"] = True
437
+ yield item
438
+
439
+ def _forward(self, model_inputs, **generate_kwargs) -> dict[str, Any]:
440
+ # Extract audio features and is_last flag
441
+ is_last = model_inputs.pop("is_last", True) if isinstance(model_inputs, dict) else True
442
+
443
+ if isinstance(model_inputs, dict):
444
+ input_features = model_inputs.get("input_features")
445
+ if input_features is not None:
446
+ input_features = input_features.to(self.model.device)
447
+ else:
448
+ input_features = model_inputs.to(self.model.device)
449
+
450
+ generated_ids = self.model.generate(
451
+ input_features=input_features,
452
+ **generate_kwargs,
453
+ )
454
+
455
+ return {"tokens": generated_ids, "is_last": is_last}
456
+
457
+ def postprocess(self, model_outputs, **kwargs) -> dict[str, str]:
458
+ # Handle list of outputs (from chunking)
459
+ if isinstance(model_outputs, list):
460
+ model_outputs = model_outputs[0] if model_outputs else {}
461
+
462
+ tokens = model_outputs.get("tokens")
463
+ if tokens is None:
464
+ return super().postprocess(model_outputs, **kwargs)
465
+
466
+ if torch.is_tensor(tokens):
467
+ tokens = tokens.cpu()
468
+ if tokens.dim() > 1:
469
+ tokens = tokens[0]
470
+
471
+ text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
472
+ return {"text": text}
asr_processing.py ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Optional, Union
2
+
3
+ import torch
4
+ import transformers
5
+ from transformers import ProcessorMixin
6
+
7
+ try:
8
+ from .asr_config import ASRConfig
9
+ except ImportError:
10
+ from asr_config import ASRConfig # type: ignore[no-redef]
11
+
12
+
13
+ class ASRProcessor(ProcessorMixin):
14
+ """Processor for Whisper-based ASR models."""
15
+
16
+ attributes = ["feature_extractor", "tokenizer"]
17
+ feature_extractor_class = "AutoFeatureExtractor"
18
+ tokenizer_class = "AutoTokenizer"
19
+ AUDIO_TOKEN = "<audio>"
20
+ TRANSCRIBE_PROMPT = "Transcribe: "
21
+
22
+ def __init__(self, feature_extractor, tokenizer):
23
+ self.feature_extractor = feature_extractor
24
+ self.tokenizer = tokenizer
25
+ self.audio_token_id = tokenizer.convert_tokens_to_ids(self.AUDIO_TOKEN)
26
+
27
+ def __call__(
28
+ self,
29
+ audio: Optional[Union[list, "torch.Tensor"]] = None,
30
+ text: Optional[str] = None,
31
+ system_prompt: Optional[str] = None,
32
+ return_tensors: str = "pt",
33
+ **kwargs,
34
+ ) -> dict:
35
+ """Process audio and text inputs for inference.
36
+
37
+ Args:
38
+ audio: Raw audio waveform(s)
39
+ text: Target transcription (optional, for training - but use DataCollator instead)
40
+ system_prompt: Optional system prompt
41
+ return_tensors: Return format ("pt" for PyTorch)
42
+
43
+ Returns:
44
+ Dict with input_features, input_ids, attention_mask
45
+ """
46
+ result = {}
47
+
48
+ # Process audio
49
+ if audio is not None:
50
+ audio_inputs = self.feature_extractor(
51
+ audio,
52
+ sampling_rate=getattr(self.feature_extractor, "sampling_rate", 16000),
53
+ return_tensors=return_tensors,
54
+ **kwargs,
55
+ )
56
+ result["input_features"] = audio_inputs["input_features"]
57
+ # Whisper encoder output length = mel_len // 2 (stride-2 conv)
58
+ num_audio_tokens = audio_inputs["input_features"].shape[-1] // 2
59
+ else:
60
+ num_audio_tokens = 0
61
+
62
+ # Build prompt with audio token placeholders
63
+ user_content = self.TRANSCRIBE_PROMPT
64
+ if num_audio_tokens > 0:
65
+ user_content += self.AUDIO_TOKEN * num_audio_tokens
66
+
67
+ messages = []
68
+ if system_prompt:
69
+ messages.append({"role": "system", "content": system_prompt})
70
+ messages.append({"role": "user", "content": user_content})
71
+ if text is not None:
72
+ messages.append({"role": "assistant", "content": text})
73
+
74
+ # Tokenize
75
+ input_ids = self.tokenizer.apply_chat_template(
76
+ messages,
77
+ tokenize=True,
78
+ add_generation_prompt=(text is None),
79
+ return_tensors=return_tensors,
80
+ )
81
+
82
+ if isinstance(input_ids, torch.Tensor) and input_ids.dim() == 1:
83
+ input_ids = input_ids.unsqueeze(0)
84
+
85
+ result["input_ids"] = input_ids
86
+ result["attention_mask"] = torch.ones_like(input_ids)
87
+
88
+ return result
89
+
90
+
91
+ ASRProcessor.register_for_auto_class()
92
+ transformers.AutoProcessor.register(ASRConfig, ASRProcessor)
chat_template.jinja ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {# ───── defaults ───── #}
2
+ {%- if enable_thinking is not defined -%}
3
+ {%- set enable_thinking = true -%}
4
+ {%- endif -%}
5
+
6
+ {# ───── reasoning mode ───── #}
7
+ {%- if enable_thinking -%}
8
+ {%- set reasoning_mode = "/think" -%}
9
+ {%- else -%}
10
+ {%- set reasoning_mode = "/no_think" -%}
11
+ {%- endif -%}
12
+
13
+ {# ───── header (system message) ───── #}
14
+ {{- "<|im_start|>system\n" -}}
15
+
16
+ {%- if messages[0].role == "system" -%}
17
+ {%- set system_message = messages[0].content -%}
18
+ {%- if "/no_think" in system_message -%}
19
+ {%- set reasoning_mode = "/no_think" -%}
20
+ {%- elif "/think" in system_message -%}
21
+ {%- set reasoning_mode = "/think" -%}
22
+ {%- endif -%}
23
+ {%- set custom_instructions = system_message.replace("/no_think", "").replace("/think", "").rstrip() -%}
24
+ {%- endif -%}
25
+
26
+ {%- if "/system_override" in system_message -%}
27
+ {{- custom_instructions.replace("/system_override", "").rstrip() -}}
28
+ {{- "<|im_end|>\n" -}}
29
+ {%- else -%}
30
+ {{- "## Metadata\n\n" -}}
31
+ {{- "Knowledge Cutoff Date: June 2025\n" -}}
32
+ {%- set today = strftime_now("%d %B %Y") -%}
33
+ {{- "Today Date: " ~ today ~ "\n" -}}
34
+ {{- "Reasoning Mode: " + reasoning_mode + "\n\n" -}}
35
+
36
+ {{- "## Custom Instructions\n\n" -}}
37
+ {%- if custom_instructions -%}
38
+ {{- custom_instructions + "\n\n" -}}
39
+ {%- elif reasoning_mode == "/think" -%}
40
+ {{- "You are a helpful AI assistant named SmolLM, trained by Hugging Face. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracking, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion.\n\n" -}}
41
+ {%- else -%}
42
+ {{- "You are a helpful AI assistant named SmolLM, trained by Hugging Face.\n\n" -}}
43
+ {%- endif -%}
44
+
45
+ {%- if xml_tools or python_tools or tools -%}
46
+ {{- "### Tools\n\n" -}}
47
+ {%- if xml_tools or tools -%}
48
+ {%- if tools -%}
49
+ {%- set xml_tools = tools -%}
50
+ {%- endif -%}
51
+ {%- set ns = namespace(xml_tool_string="You may call one or more functions to assist with the user query.\nYou are provided with function signatures within <tools></tools> XML tags:\n\n<tools>\n") -%}
52
+ {%- for tool in xml_tools[:] -%} {# The slicing makes sure that xml_tools is a list #}
53
+ {%- set ns.xml_tool_string = ns.xml_tool_string ~ (tool | string) ~ "\n" -%}
54
+ {%- endfor -%}
55
+ {%- set xml_tool_string = ns.xml_tool_string + "</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>" -%}
56
+ {{- xml_tool_string -}}
57
+ {%- endif -%}
58
+ {%- if python_tools -%}
59
+ {%- set ns = namespace(python_tool_string="When you send a message containing Python code between '<code>' and '</code>' tags, it will be executed in a stateful Jupyter notebook environment, and you will then be given the output to continued reasoning in an agentic loop.\n\nYou can use the following tools in your python code like regular functions:\n<tools>\n") -%}
60
+ {%- for tool in python_tools[:] -%} {# The slicing makes sure that python_tools is a list #}
61
+ {%- set ns.python_tool_string = ns.python_tool_string ~ (tool | string) ~ "\n" -%}
62
+ {%- endfor -%}
63
+ {%- set python_tool_string = ns.python_tool_string + "</tools>\n\nThe state persists between code executions: so variables that you define in one step are still available thereafter." -%}
64
+ {{- python_tool_string -}}
65
+ {%- endif -%}
66
+ {{- "\n\n" -}}
67
+ {{- "<|im_end|>\n" -}}
68
+ {%- endif -%}
69
+ {%- endif -%}
70
+ {# ───── main loop ───── #}
71
+ {%- for message in messages -%}
72
+ {%- set content = message.content if message.content is string else "" -%}
73
+ {%- if message.role == "user" -%}
74
+ {{ "<|im_start|>" + message.role + "\n" + content + "<|im_end|>\n" }}
75
+ {%- elif message.role == "assistant" -%}
76
+ {% generation %}
77
+ {%- if reasoning_mode == "/think" -%}
78
+ {{ "<|im_start|>assistant\n" + content.lstrip("\n") + "<|im_end|>\n" }}
79
+ {%- else -%}
80
+ {{ "<|im_start|>assistant\n" + "<think>\n\n</think>\n" + content.lstrip("\n") + "<|im_end|>\n" }}
81
+ {%- endif -%}
82
+ {% endgeneration %}
83
+ {%- elif message.role == "tool" -%}
84
+ {{ "<|im_start|>" + "user\n" + content + "<|im_end|>\n" }}
85
+ {%- endif -%}
86
+ {%- endfor -%}
87
+ {# ───── generation prompt ───── #}
88
+ {%- if add_generation_prompt -%}
89
+ {%- if reasoning_mode == "/think" -%}
90
+ {{ "<|im_start|>assistant\n" }}
91
+ {%- else -%}
92
+ {{ "<|im_start|>assistant\n" + "<think>\n\n</think>\n" }}
93
+ {%- endif -%}
94
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ASRModel"
4
+ ],
5
+ "attn_implementation": "sdpa",
6
+ "audio_config": {
7
+ "_name_or_path": "openai/whisper-large-v3-turbo",
8
+ "activation_dropout": 0.0,
9
+ "activation_function": "gelu",
10
+ "apply_spec_augment": false,
11
+ "architectures": [
12
+ "WhisperForConditionalGeneration"
13
+ ],
14
+ "attention_dropout": 0.0,
15
+ "bos_token_id": 50257,
16
+ "classifier_proj_size": 256,
17
+ "d_model": 1280,
18
+ "decoder_attention_heads": 20,
19
+ "decoder_ffn_dim": 5120,
20
+ "decoder_layerdrop": 0.0,
21
+ "decoder_layers": 4,
22
+ "decoder_start_token_id": 50258,
23
+ "dropout": 0.0,
24
+ "dtype": "bfloat16",
25
+ "encoder_attention_heads": 20,
26
+ "encoder_ffn_dim": 5120,
27
+ "encoder_layerdrop": 0.0,
28
+ "encoder_layers": 32,
29
+ "eos_token_id": 50257,
30
+ "init_std": 0.02,
31
+ "mask_feature_length": 10,
32
+ "mask_feature_min_masks": 0,
33
+ "mask_feature_prob": 0.0,
34
+ "mask_time_length": 10,
35
+ "mask_time_min_masks": 2,
36
+ "mask_time_prob": 0.05,
37
+ "max_source_positions": 1500,
38
+ "max_target_positions": 448,
39
+ "median_filter_width": 7,
40
+ "model_type": "whisper",
41
+ "num_hidden_layers": 32,
42
+ "num_mel_bins": 128,
43
+ "pad_token_id": 50257,
44
+ "scale_embedding": false,
45
+ "use_cache": true,
46
+ "use_weighted_layer_sum": false,
47
+ "vocab_size": 51866
48
+ },
49
+ "audio_model_id": "openai/whisper-large-v3-turbo",
50
+ "audio_sample_rate": 16000,
51
+ "auto_map": {
52
+ "AutoConfig": "asr_config.ASRConfig",
53
+ "AutoModel": "asr_modeling.ASRModel",
54
+ "AutoModelForSpeechSeq2Seq": "asr_modeling.ASRModel",
55
+ "AutoProcessor": "asr_processing.ASRProcessor"
56
+ },
57
+ "custom_pipelines": {
58
+ "automatic-speech-recognition": {
59
+ "impl": "asr_pipeline.ASRPipeline",
60
+ "pt": [
61
+ "AutoModelForSpeechSeq2Seq"
62
+ ],
63
+ "tf": [],
64
+ "type": "audio"
65
+ }
66
+ },
67
+ "downsample_rate": 16,
68
+ "dtype": "bfloat16",
69
+ "encoder_dim": 1280,
70
+ "inference_diversity_penalty": 0.0,
71
+ "inference_warmup_tokens": 10,
72
+ "label_smoothing": 0.0,
73
+ "llm_dim": 2048,
74
+ "max_new_tokens": 96,
75
+ "min_new_tokens": 0,
76
+ "model_dtype": "bfloat16",
77
+ "model_type": "asr_model",
78
+ "num_experts": 4,
79
+ "num_experts_per_tok": 2,
80
+ "pipeline_tag": "automatic-speech-recognition",
81
+ "projector_dropout": 0.05,
82
+ "projector_hidden_dim": null,
83
+ "projector_init_std": 0.02,
84
+ "projector_input_noise": 0.02,
85
+ "projector_num_layers": 2,
86
+ "projector_pool_stride": 2,
87
+ "projector_type": "mlp",
88
+ "router_aux_loss_coef": 0.01,
89
+ "system_prompt": "/no_think /system_override",
90
+ "text_config": {
91
+ "_name_or_path": "HuggingFaceTB/SmolLM3-3B",
92
+ "architectures": [
93
+ "SmolLM3ForCausalLM"
94
+ ],
95
+ "attention_bias": false,
96
+ "attention_dropout": 0.0,
97
+ "bos_token_id": null,
98
+ "dtype": "bfloat16",
99
+ "eos_token_id": 128012,
100
+ "hidden_act": "silu",
101
+ "hidden_size": 2048,
102
+ "initializer_range": 0.02,
103
+ "intermediate_size": 11008,
104
+ "layer_types": [
105
+ "full_attention",
106
+ "full_attention",
107
+ "full_attention",
108
+ "full_attention",
109
+ "full_attention",
110
+ "full_attention",
111
+ "full_attention",
112
+ "full_attention",
113
+ "full_attention",
114
+ "full_attention",
115
+ "full_attention",
116
+ "full_attention",
117
+ "full_attention",
118
+ "full_attention",
119
+ "full_attention",
120
+ "full_attention",
121
+ "full_attention",
122
+ "full_attention",
123
+ "full_attention",
124
+ "full_attention",
125
+ "full_attention",
126
+ "full_attention",
127
+ "full_attention",
128
+ "full_attention",
129
+ "full_attention",
130
+ "full_attention",
131
+ "full_attention",
132
+ "full_attention",
133
+ "full_attention",
134
+ "full_attention",
135
+ "full_attention",
136
+ "full_attention",
137
+ "full_attention",
138
+ "full_attention",
139
+ "full_attention",
140
+ "full_attention"
141
+ ],
142
+ "max_position_embeddings": 65536,
143
+ "max_window_layers": 28,
144
+ "mlp_bias": false,
145
+ "model_type": "smollm3",
146
+ "no_rope_layer_interval": 4,
147
+ "no_rope_layers": [
148
+ 1,
149
+ 1,
150
+ 1,
151
+ 0,
152
+ 1,
153
+ 1,
154
+ 1,
155
+ 0,
156
+ 1,
157
+ 1,
158
+ 1,
159
+ 0,
160
+ 1,
161
+ 1,
162
+ 1,
163
+ 0,
164
+ 1,
165
+ 1,
166
+ 1,
167
+ 0,
168
+ 1,
169
+ 1,
170
+ 1,
171
+ 0,
172
+ 1,
173
+ 1,
174
+ 1,
175
+ 0,
176
+ 1,
177
+ 1,
178
+ 1,
179
+ 0,
180
+ 1,
181
+ 1,
182
+ 1,
183
+ 0
184
+ ],
185
+ "num_attention_heads": 16,
186
+ "num_hidden_layers": 36,
187
+ "num_key_value_heads": 4,
188
+ "pretraining_tp": 2,
189
+ "rms_norm_eps": 1e-06,
190
+ "rope_scaling": null,
191
+ "rope_theta": 5000000.0,
192
+ "sliding_window": null,
193
+ "use_cache": false,
194
+ "use_sliding_window": false,
195
+ "vocab_size": 128257
196
+ },
197
+ "text_model_id": "HuggingFaceTB/SmolLM3-3B",
198
+ "transformers_version": "4.57.3",
199
+ "use_cache": false,
200
+ "use_specaugment": true,
201
+ "user_prompt": "Transcribe: <audio>",
202
+ "vocab_size": 128257
203
+ }
generation_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 128000,
3
+ "eos_token_id": 128012,
4
+ "max_new_tokens": 96,
5
+ "pad_token_id": 128004,
6
+ "temperature": null,
7
+ "top_k": null,
8
+ "top_p": null,
9
+ "transformers_version": "4.57.3"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59a82cd5acbf8e93a0566583e3eecdd2a8108d36856408f1e7d6fc9259059181
3
+ size 23462224
preprocessor_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "chunk_length": 30,
3
+ "dither": 0.0,
4
+ "feature_extractor_type": "WhisperFeatureExtractor",
5
+ "feature_size": 128,
6
+ "hop_length": 160,
7
+ "n_fft": 400,
8
+ "n_samples": 480000,
9
+ "nb_max_frames": 3000,
10
+ "padding_side": "right",
11
+ "padding_value": 0.0,
12
+ "processor_class": "ASRProcessor",
13
+ "return_attention_mask": false,
14
+ "sampling_rate": 16000,
15
+ "auto_map": {
16
+ "AutoProcessor": "asr_processing.ASRProcessor"
17
+ }
18
+ }
projectors.py ADDED
@@ -0,0 +1,527 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Audio projector modules for bridging encoder and decoder embeddings.
2
+
3
+ This module contains all projector architectures:
4
+ - MLPAudioProjector: Simple 2-layer MLP with conv downsampling
5
+ - MoEAudioProjector: MOSA-style dense mixture of experts
6
+ - SwiGLUAudioProjector: SwiGLU-based projector with temporal pooling
7
+ - ResidualAudioProjector: Residual MLP blocks with linear projection
8
+ - SharedMoEAudioProjector: Shared expert + sparse routed experts
9
+ """
10
+
11
+ import torch
12
+ import torch.nn as nn
13
+ import torch.nn.functional as F # noqa: N812
14
+ from transformers.models.llama.modeling_llama import LlamaRMSNorm
15
+
16
+ # =============================================================================
17
+ # MLP Projector
18
+ # =============================================================================
19
+
20
+
21
+ class MLPAudioProjector(nn.Module):
22
+ """2-layer MLP projector with conv-based 2x temporal downsampling."""
23
+
24
+ def __init__(self, config):
25
+ super().__init__()
26
+
27
+ encoder_dim = getattr(config, "encoder_dim", 768)
28
+ llm_dim = getattr(config, "llm_dim", 2048)
29
+
30
+ self.downsample = nn.Conv1d(
31
+ encoder_dim, encoder_dim, kernel_size=3, stride=2, padding=1, bias=False
32
+ )
33
+ self.linear_1 = nn.Linear(encoder_dim, llm_dim, bias=False)
34
+ self.act = nn.GELU()
35
+ self.linear_2 = nn.Linear(llm_dim, llm_dim, bias=False)
36
+
37
+ self.apply(self._init_weights)
38
+
39
+ def _init_weights(self, module):
40
+ if isinstance(module, nn.Linear):
41
+ nn.init.normal_(module.weight, mean=0.0, std=0.02)
42
+ elif isinstance(module, nn.Conv1d):
43
+ nn.init.normal_(module.weight, mean=0.0, std=0.02)
44
+ if module.bias is not None:
45
+ nn.init.zeros_(module.bias)
46
+
47
+ def forward(self, x):
48
+ """
49
+ x: [Batch, Seq_Len, Dim]
50
+ Returns: [Batch, Seq_Len // 2, llm_dim]
51
+ """
52
+ # Conv1d expects [Batch, Channels, Seq_Len]
53
+ x = x.transpose(1, 2)
54
+ x = self.downsample(x)
55
+ x = x.transpose(1, 2)
56
+
57
+ x = self.linear_1(x)
58
+ x = self.act(x)
59
+ return self.linear_2(x)
60
+
61
+
62
+ # =============================================================================
63
+ # MoE Projector (MOSA-style)
64
+ # =============================================================================
65
+
66
+
67
+ class SimpleAdapter(nn.Module):
68
+ """Simple adapter: Linear -> ReLU -> Dropout -> Linear."""
69
+
70
+ def __init__(self, in_features, hidden_features, out_features, dropout=0.0):
71
+ super().__init__()
72
+ self.fc1 = nn.Linear(in_features, hidden_features)
73
+ self.relu = nn.ReLU()
74
+ self.dropout = nn.Dropout(dropout)
75
+ self.fc2 = nn.Linear(hidden_features, out_features)
76
+
77
+ def forward(self, x):
78
+ x = self.fc1(x)
79
+ x = self.relu(x)
80
+ x = self.dropout(x)
81
+ return self.fc2(x)
82
+
83
+
84
+ class MoEAudioProjector(nn.Module):
85
+ """
86
+ MOSA-style projector: Mixture of Simple Adapters.
87
+
88
+ From paper (arXiv:2508.18998):
89
+ - Dense mixture (softmax over ALL experts) instead of sparse Top-K
90
+ - Simple Linear->ReLU->Linear adapters
91
+ - No auxiliary losses - just cross-entropy on transcripts
92
+ - Conv downsampling: stride 4 total (two conv layers, stride 2 each)
93
+ """
94
+
95
+ def __init__(self, config):
96
+ super().__init__()
97
+
98
+ self.encoder_dim = config.encoder_dim
99
+ self.llm_dim = config.llm_dim
100
+ self.num_experts = getattr(config, "num_experts", 4)
101
+ adapter_hidden = getattr(config, "projector_hidden_dim", None) or 4096
102
+ self.dropout_rate = getattr(config, "projector_dropout", 0.1)
103
+
104
+ # Convolutional Subsampling (stride 4 total)
105
+ self.conv = nn.Sequential(
106
+ nn.Conv1d(self.encoder_dim, self.llm_dim, kernel_size=3, stride=2, padding=1),
107
+ nn.ReLU(),
108
+ nn.Conv1d(self.llm_dim, self.llm_dim, kernel_size=3, stride=2, padding=1),
109
+ nn.ReLU(),
110
+ )
111
+
112
+ # Router
113
+ router_hidden = 512
114
+ self.router = nn.Sequential(
115
+ nn.Linear(self.encoder_dim, router_hidden),
116
+ nn.ReLU(),
117
+ nn.Linear(router_hidden, self.num_experts),
118
+ )
119
+
120
+ # Experts
121
+ self.experts = nn.ModuleList(
122
+ [
123
+ SimpleAdapter(self.llm_dim, adapter_hidden, self.llm_dim, dropout=self.dropout_rate)
124
+ for _ in range(self.num_experts)
125
+ ]
126
+ )
127
+
128
+ self.ln_post = LlamaRMSNorm(self.llm_dim, eps=1e-6)
129
+ self._init_weights()
130
+
131
+ def _init_weights(self):
132
+ std = 0.02
133
+ with torch.no_grad():
134
+ for module in self.conv:
135
+ if isinstance(module, nn.Conv1d):
136
+ nn.init.normal_(module.weight, mean=0.0, std=std)
137
+ if module.bias is not None:
138
+ nn.init.zeros_(module.bias)
139
+
140
+ for module in self.router:
141
+ if isinstance(module, nn.Linear):
142
+ nn.init.normal_(module.weight, mean=0.0, std=std)
143
+ if module.bias is not None:
144
+ nn.init.zeros_(module.bias)
145
+
146
+ for expert in self.experts:
147
+ nn.init.normal_(expert.fc1.weight, mean=0.0, std=std)
148
+ nn.init.normal_(expert.fc2.weight, mean=0.0, std=std)
149
+ if expert.fc1.bias is not None:
150
+ nn.init.zeros_(expert.fc1.bias)
151
+ if expert.fc2.bias is not None:
152
+ nn.init.zeros_(expert.fc2.bias)
153
+
154
+ self.ln_post.weight.data.fill_(1.0)
155
+
156
+ def forward(self, x):
157
+ batch_size, seq_len, _ = x.shape
158
+
159
+ # Pad to be divisible by stride (4)
160
+ pad_amt = (4 - (seq_len % 4)) % 4
161
+ if pad_amt > 0:
162
+ x = F.pad(x, (0, 0, 0, pad_amt))
163
+ seq_len = x.shape[1]
164
+
165
+ # Convolutional Downsampling
166
+ h_conv = self.conv(x.permute(0, 2, 1)).permute(0, 2, 1)
167
+
168
+ # Router on high-res input, then downsample weights
169
+ router_logits = self.router(x)
170
+ router_logits = router_logits.view(batch_size, seq_len // 4, 4, self.num_experts).mean(
171
+ dim=2
172
+ )
173
+ routing_weights = F.softmax(router_logits, dim=-1)
174
+
175
+ # Weighted sum of expert outputs
176
+ final_out = torch.zeros_like(h_conv)
177
+ for i, expert in enumerate(self.experts):
178
+ expert_out = expert(h_conv)
179
+ expert_weight = routing_weights[:, :, i : i + 1]
180
+ final_out.add_(expert_out * expert_weight)
181
+
182
+ return self.ln_post(final_out)
183
+
184
+ def get_aux_loss(self) -> torch.Tensor:
185
+ """Return auxiliary loss (none for dense MoE)."""
186
+ return torch.tensor(0.0)
187
+
188
+
189
+ # =============================================================================
190
+ # SwiGLU Projector
191
+ # =============================================================================
192
+
193
+
194
+ class SwiGLU(nn.Module):
195
+ def __init__(self, in_features, hidden_features, out_features, bias=False, dropout=0.0):
196
+ super().__init__()
197
+ self.w1 = nn.Linear(in_features, hidden_features, bias=bias)
198
+ self.w2 = nn.Linear(in_features, hidden_features, bias=bias)
199
+ self.w3 = nn.Linear(hidden_features, out_features, bias=bias)
200
+ self.act = nn.SiLU()
201
+ self.dropout = nn.Dropout(dropout)
202
+
203
+ def forward(self, x):
204
+ x_gate = self.act(self.w1(x))
205
+ x_val = self.w2(x)
206
+ x = x_gate * x_val
207
+ x = self.dropout(x)
208
+ return self.w3(x)
209
+
210
+
211
+ class SwiGLUAudioProjector(nn.Module):
212
+ """SwiGLU-based projector with temporal pooling."""
213
+
214
+ def __init__(self, config):
215
+ super().__init__()
216
+ self.k = getattr(config, "projector_pool_stride", 4)
217
+ in_dim = config.encoder_dim * self.k
218
+ out_dim = config.llm_dim
219
+ hidden_dim = config.projector_hidden_dim
220
+ if hidden_dim is None:
221
+ hidden_dim = config.encoder_dim * 2
222
+
223
+ dropout_rate = getattr(config, "projector_dropout", 0.0)
224
+
225
+ self.proj1 = SwiGLU(in_dim, hidden_dim, hidden_dim, dropout=dropout_rate)
226
+ self.proj2 = SwiGLU(hidden_dim, hidden_dim, out_dim, dropout=dropout_rate)
227
+ self.output_dropout = nn.Dropout(dropout_rate)
228
+
229
+ with torch.no_grad():
230
+ std = getattr(config, "projector_init_std", 0.02)
231
+ nn.init.normal_(self.proj1.w1.weight, mean=0.0, std=std)
232
+ nn.init.normal_(self.proj1.w2.weight, mean=0.0, std=std)
233
+ nn.init.normal_(self.proj1.w3.weight, mean=0.0, std=std)
234
+ nn.init.normal_(self.proj2.w1.weight, mean=0.0, std=std)
235
+ nn.init.normal_(self.proj2.w2.weight, mean=0.0, std=std)
236
+ nn.init.normal_(self.proj2.w3.weight, mean=0.0, std=std)
237
+
238
+ def forward(self, x):
239
+ batch_size, seq_len, dim = x.size()
240
+
241
+ target_dtype = self.proj1.w1.weight.dtype
242
+ if x.dtype != target_dtype:
243
+ x = x.to(target_dtype)
244
+
245
+ remainder = seq_len % self.k
246
+ if remainder:
247
+ pad_len = self.k - remainder
248
+ x = F.pad(x, (0, 0, 0, pad_len))
249
+
250
+ x = x.contiguous().view(batch_size, -1, dim * self.k)
251
+ x = self.proj1(x)
252
+ x = self.proj2(x)
253
+
254
+ return self.output_dropout(x)
255
+
256
+
257
+ # Alias for backwards compatibility
258
+ AudioProjector = SwiGLUAudioProjector
259
+
260
+
261
+ # =============================================================================
262
+ # Residual Projector
263
+ # =============================================================================
264
+
265
+
266
+ class ResidualMLP(nn.Module):
267
+ """MLP block with residual connection: Output = x + MLP(x)."""
268
+
269
+ def __init__(self, dim, hidden_dim, dropout=0.0):
270
+ super().__init__()
271
+ self.fc1 = nn.Linear(dim, hidden_dim)
272
+ self.fc2 = nn.Linear(hidden_dim, dim)
273
+ self.act = nn.GELU()
274
+ self.dropout = nn.Dropout(dropout)
275
+
276
+ def forward(self, x):
277
+ residual = x
278
+ x = self.fc1(x)
279
+ x = self.act(x)
280
+ x = self.dropout(x)
281
+ x = self.fc2(x)
282
+ x = self.dropout(x)
283
+ return residual + x
284
+
285
+
286
+ class ResidualAudioProjector(nn.Module):
287
+ """Residual MLP projector for audio-to-LLM feature translation."""
288
+
289
+ def __init__(self, config):
290
+ super().__init__()
291
+
292
+ self.k = getattr(config, "projector_pool_stride", 4)
293
+ in_dim = config.encoder_dim * self.k
294
+ out_dim = config.llm_dim
295
+ hidden_dim = getattr(config, "projector_hidden_dim", None) or out_dim * 4
296
+ self.num_layers = getattr(config, "projector_num_layers", 2)
297
+ dropout_rate = getattr(config, "projector_dropout", 0.0)
298
+
299
+ self.input_proj = nn.Linear(in_dim, out_dim)
300
+ self.ln_input = LlamaRMSNorm(out_dim, eps=1e-6)
301
+
302
+ self.layers = nn.ModuleList(
303
+ [ResidualMLP(out_dim, hidden_dim, dropout=dropout_rate) for _ in range(self.num_layers)]
304
+ )
305
+ self.layer_norms = nn.ModuleList(
306
+ [LlamaRMSNorm(out_dim, eps=1e-6) for _ in range(self.num_layers)]
307
+ )
308
+
309
+ self.output_dropout = nn.Dropout(dropout_rate)
310
+ self._init_weights(config)
311
+
312
+ def _init_weights(self, config):
313
+ std = getattr(config, "projector_init_std", 0.02)
314
+
315
+ with torch.no_grad():
316
+ nn.init.normal_(self.input_proj.weight, mean=0.0, std=std)
317
+ if self.input_proj.bias is not None:
318
+ nn.init.zeros_(self.input_proj.bias)
319
+
320
+ self.ln_input.weight.data.fill_(1.0)
321
+ for ln in self.layer_norms:
322
+ ln.weight.data.fill_(1.0)
323
+
324
+ for layer in self.layers:
325
+ nn.init.normal_(layer.fc1.weight, mean=0.0, std=std)
326
+ nn.init.normal_(layer.fc2.weight, mean=0.0, std=std * 0.1)
327
+ if layer.fc1.bias is not None:
328
+ nn.init.zeros_(layer.fc1.bias)
329
+ if layer.fc2.bias is not None:
330
+ nn.init.zeros_(layer.fc2.bias)
331
+
332
+ def forward(self, x):
333
+ batch_size, seq_len, dim = x.size()
334
+
335
+ target_dtype = self.input_proj.weight.dtype
336
+ if x.dtype != target_dtype:
337
+ x = x.to(target_dtype)
338
+
339
+ remainder = seq_len % self.k
340
+ if remainder:
341
+ pad_len = self.k - remainder
342
+ x = F.pad(x, (0, 0, 0, pad_len))
343
+
344
+ x = x.contiguous().view(batch_size, -1, dim * self.k)
345
+ x = self.input_proj(x)
346
+ x = self.ln_input(x)
347
+
348
+ for layer, ln in zip(self.layers, self.layer_norms):
349
+ x = layer(x)
350
+ x = ln(x)
351
+
352
+ return self.output_dropout(x)
353
+
354
+
355
+ # =============================================================================
356
+ # Shared MoE Projector
357
+ # =============================================================================
358
+
359
+
360
+ class SwiGLUExpert(nn.Module):
361
+ """SwiGLU expert MLP."""
362
+
363
+ def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
364
+ super().__init__()
365
+ self.gate_proj = nn.Linear(input_dim, hidden_dim, bias=False)
366
+ self.up_proj = nn.Linear(input_dim, hidden_dim, bias=False)
367
+ self.down_proj = nn.Linear(hidden_dim, output_dim, bias=False)
368
+ self.act = nn.SiLU()
369
+
370
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
371
+ return self.down_proj(self.act(self.gate_proj(x)) * self.up_proj(x))
372
+
373
+
374
+ class SharedMoEBlock(nn.Module):
375
+ """MoE block with shared expert + sparse routed experts."""
376
+
377
+ def __init__(
378
+ self,
379
+ input_dim: int,
380
+ hidden_dim: int,
381
+ output_dim: int,
382
+ num_experts: int = 4,
383
+ top_k: int = 2,
384
+ ):
385
+ super().__init__()
386
+ self.num_experts = num_experts
387
+ self.top_k = top_k
388
+ self.output_dim = output_dim
389
+
390
+ self.router = nn.Linear(input_dim, num_experts, bias=False)
391
+ nn.init.zeros_(self.router.weight)
392
+
393
+ self.shared_expert = SwiGLUExpert(input_dim, hidden_dim, output_dim)
394
+ self.experts = nn.ModuleList(
395
+ [SwiGLUExpert(input_dim, hidden_dim, output_dim) for _ in range(num_experts)]
396
+ )
397
+
398
+ self.last_router_logits = None
399
+ self.last_router_probs = None
400
+
401
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
402
+ batch_size, seq_len, dim = hidden_states.shape
403
+
404
+ shared_out = self.shared_expert(hidden_states)
405
+
406
+ flat_hidden = hidden_states.view(-1, dim)
407
+ router_logits = self.router(flat_hidden)
408
+ router_probs = F.softmax(router_logits.float(), dim=-1)
409
+
410
+ self.last_router_logits = router_logits
411
+ self.last_router_probs = router_probs
412
+
413
+ top_k_weights, top_k_indices = torch.topk(router_probs, self.top_k, dim=-1)
414
+ top_k_weights = top_k_weights / top_k_weights.sum(dim=-1, keepdim=True)
415
+ top_k_weights = top_k_weights.to(hidden_states.dtype)
416
+
417
+ routed_out = self._dispatch_experts(flat_hidden, top_k_indices, top_k_weights)
418
+ routed_out = routed_out.view(batch_size, seq_len, -1)
419
+
420
+ return shared_out + routed_out
421
+
422
+ def _dispatch_experts(
423
+ self,
424
+ hidden_states: torch.Tensor,
425
+ top_k_indices: torch.Tensor,
426
+ top_k_weights: torch.Tensor,
427
+ ) -> torch.Tensor:
428
+ num_tokens = hidden_states.shape[0]
429
+ output = torch.zeros(
430
+ num_tokens, self.output_dim, device=hidden_states.device, dtype=hidden_states.dtype
431
+ )
432
+
433
+ for expert_idx, expert in enumerate(self.experts):
434
+ expert_mask = top_k_indices == expert_idx
435
+ if not expert_mask.any():
436
+ continue
437
+
438
+ token_indices, slot_indices = torch.where(expert_mask)
439
+ expert_input = hidden_states[token_indices]
440
+ expert_output = expert(expert_input)
441
+ weights = top_k_weights[token_indices, slot_indices].unsqueeze(-1)
442
+ output.index_add_(0, token_indices, expert_output * weights)
443
+
444
+ return output
445
+
446
+
447
+ def load_balancing_loss(router_probs: torch.Tensor, num_experts: int, top_k: int) -> torch.Tensor:
448
+ """Auxiliary loss to encourage balanced expert usage."""
449
+ _, selected = torch.topk(router_probs, top_k, dim=-1)
450
+ expert_mask = F.one_hot(selected, num_experts).float()
451
+ tokens_per_expert = expert_mask.mean(dim=(0, 1))
452
+ prob_per_expert = router_probs.mean(dim=0)
453
+ return (tokens_per_expert * prob_per_expert).sum() * num_experts
454
+
455
+
456
+ def z_loss(router_logits: torch.Tensor) -> torch.Tensor:
457
+ """Z-loss to prevent router logits from growing too large."""
458
+ return torch.logsumexp(router_logits.float(), dim=-1).square().mean()
459
+
460
+
461
+ class SharedMoEAudioProjector(nn.Module):
462
+ """Shared expert + sparse routed experts projector."""
463
+
464
+ def __init__(self, config):
465
+ super().__init__()
466
+
467
+ self.k = getattr(config, "projector_pool_stride", 4)
468
+
469
+ encoder_dim = config.encoder_dim
470
+ in_dim = encoder_dim * self.k
471
+ out_dim = config.llm_dim
472
+ hidden_dim = getattr(config, "projector_hidden_dim", None) or in_dim
473
+
474
+ self.num_experts = getattr(config, "num_experts", 4)
475
+ self.top_k = getattr(config, "num_experts_per_tok", 2)
476
+ self.aux_loss_coef = getattr(config, "router_aux_loss_coef", 0.02)
477
+ self.z_loss_coef = getattr(config, "router_z_loss_coef", 0.001)
478
+
479
+ self.moe = SharedMoEBlock(in_dim, hidden_dim, out_dim, self.num_experts, self.top_k)
480
+ self._init_weights(in_dim)
481
+
482
+ def _init_weights(self, in_dim: int):
483
+ with torch.no_grad():
484
+ nn.init.orthogonal_(self.moe.shared_expert.gate_proj.weight)
485
+ nn.init.orthogonal_(self.moe.shared_expert.up_proj.weight)
486
+ nn.init.orthogonal_(self.moe.shared_expert.down_proj.weight, gain=0.5)
487
+
488
+ for expert in self.moe.experts:
489
+ nn.init.orthogonal_(expert.gate_proj.weight)
490
+ nn.init.orthogonal_(expert.up_proj.weight)
491
+ nn.init.orthogonal_(expert.down_proj.weight, gain=0.01)
492
+
493
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
494
+ batch_size, seq_len, dim = x.size()
495
+
496
+ target_dtype = self.moe.shared_expert.gate_proj.weight.dtype
497
+ if x.dtype != target_dtype:
498
+ x = x.to(target_dtype)
499
+
500
+ if seq_len % self.k:
501
+ x = F.pad(x, (0, 0, 0, self.k - seq_len % self.k))
502
+
503
+ x = x.view(batch_size, -1, dim * self.k)
504
+
505
+ return self.moe(x)
506
+
507
+ def get_aux_loss(self) -> torch.Tensor:
508
+ if self.moe.last_router_logits is None:
509
+ return torch.tensor(0.0, device=self.moe.router.weight.device)
510
+
511
+ balance = load_balancing_loss(self.moe.last_router_probs, self.num_experts, self.top_k)
512
+ z = z_loss(self.moe.last_router_logits)
513
+
514
+ return self.aux_loss_coef * balance + self.z_loss_coef * z
515
+
516
+
517
+ # =============================================================================
518
+ # Projector Registry
519
+ # =============================================================================
520
+
521
+ PROJECTOR_CLASSES = {
522
+ "mlp": MLPAudioProjector,
523
+ "moe": MoEAudioProjector,
524
+ "swiglu": SwiGLUAudioProjector,
525
+ "residual": ResidualAudioProjector,
526
+ "shared_moe": SharedMoEAudioProjector,
527
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<audio>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ }
10
+ ],
11
+ "eos_token": {
12
+ "content": "<|im_end|>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false
17
+ },
18
+ "pad_token": "<|finetune_right_pad_id|>"
19
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d4aeaf198f783cbf58d8cd59812baac429ffe49147bf9648f6618de20b8d4a4c
3
+ size 17209003
tokenizer_config.json ADDED
@@ -0,0 +1,2075 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "128000": {
4
+ "content": "<|begin_of_text|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "128001": {
12
+ "content": "<|end_of_text|>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "128002": {
20
+ "content": "<think>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": false
26
+ },
27
+ "128003": {
28
+ "content": "</think>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": false
34
+ },
35
+ "128004": {
36
+ "content": "<|finetune_right_pad_id|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "128005": {
44
+ "content": "<|reserved_special_token_2|>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "128006": {
52
+ "content": "<|start_header_id|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "128007": {
60
+ "content": "<|end_header_id|>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "128008": {
68
+ "content": "<|eom_id|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "128009": {
76
+ "content": "<|eot_id|>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "128010": {
84
+ "content": "<|python_tag|>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "128011": {
92
+ "content": "<|im_start|>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "128012": {
100
+ "content": "<|im_end|>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "128013": {
108
+ "content": "<tool_response>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": false
114
+ },
115
+ "128014": {
116
+ "content": "</tool_response>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": false
122
+ },
123
+ "128015": {
124
+ "content": "<tool_call>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": false
130
+ },
131
+ "128016": {
132
+ "content": "</tool_call>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": false
138
+ },
139
+ "128017": {
140
+ "content": "<code>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": false
146
+ },
147
+ "128018": {
148
+ "content": "</code>",
149
+ "lstrip": false,
150
+ "normalized": false,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": false
154
+ },
155
+ "128019": {
156
+ "content": "<|reserved_special_token_11|>",
157
+ "lstrip": false,
158
+ "normalized": false,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": true
162
+ },
163
+ "128020": {
164
+ "content": "<|reserved_special_token_12|>",
165
+ "lstrip": false,
166
+ "normalized": false,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": true
170
+ },
171
+ "128021": {
172
+ "content": "<|reserved_special_token_13|>",
173
+ "lstrip": false,
174
+ "normalized": false,
175
+ "rstrip": false,
176
+ "single_word": false,
177
+ "special": true
178
+ },
179
+ "128022": {
180
+ "content": "<|reserved_special_token_14|>",
181
+ "lstrip": false,
182
+ "normalized": false,
183
+ "rstrip": false,
184
+ "single_word": false,
185
+ "special": true
186
+ },
187
+ "128023": {
188
+ "content": "<|reserved_special_token_15|>",
189
+ "lstrip": false,
190
+ "normalized": false,
191
+ "rstrip": false,
192
+ "single_word": false,
193
+ "special": true
194
+ },
195
+ "128024": {
196
+ "content": "<|reserved_special_token_16|>",
197
+ "lstrip": false,
198
+ "normalized": false,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": true
202
+ },
203
+ "128025": {
204
+ "content": "<|reserved_special_token_17|>",
205
+ "lstrip": false,
206
+ "normalized": false,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": true
210
+ },
211
+ "128026": {
212
+ "content": "<|reserved_special_token_18|>",
213
+ "lstrip": false,
214
+ "normalized": false,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": true
218
+ },
219
+ "128027": {
220
+ "content": "<|reserved_special_token_19|>",
221
+ "lstrip": false,
222
+ "normalized": false,
223
+ "rstrip": false,
224
+ "single_word": false,
225
+ "special": true
226
+ },
227
+ "128028": {
228
+ "content": "<|reserved_special_token_20|>",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false,
233
+ "special": true
234
+ },
235
+ "128029": {
236
+ "content": "<|reserved_special_token_21|>",
237
+ "lstrip": false,
238
+ "normalized": false,
239
+ "rstrip": false,
240
+ "single_word": false,
241
+ "special": true
242
+ },
243
+ "128030": {
244
+ "content": "<|reserved_special_token_22|>",
245
+ "lstrip": false,
246
+ "normalized": false,
247
+ "rstrip": false,
248
+ "single_word": false,
249
+ "special": true
250
+ },
251
+ "128031": {
252
+ "content": "<|reserved_special_token_23|>",
253
+ "lstrip": false,
254
+ "normalized": false,
255
+ "rstrip": false,
256
+ "single_word": false,
257
+ "special": true
258
+ },
259
+ "128032": {
260
+ "content": "<|reserved_special_token_24|>",
261
+ "lstrip": false,
262
+ "normalized": false,
263
+ "rstrip": false,
264
+ "single_word": false,
265
+ "special": true
266
+ },
267
+ "128033": {
268
+ "content": "<|reserved_special_token_25|>",
269
+ "lstrip": false,
270
+ "normalized": false,
271
+ "rstrip": false,
272
+ "single_word": false,
273
+ "special": true
274
+ },
275
+ "128034": {
276
+ "content": "<|reserved_special_token_26|>",
277
+ "lstrip": false,
278
+ "normalized": false,
279
+ "rstrip": false,
280
+ "single_word": false,
281
+ "special": true
282
+ },
283
+ "128035": {
284
+ "content": "<|reserved_special_token_27|>",
285
+ "lstrip": false,
286
+ "normalized": false,
287
+ "rstrip": false,
288
+ "single_word": false,
289
+ "special": true
290
+ },
291
+ "128036": {
292
+ "content": "<|reserved_special_token_28|>",
293
+ "lstrip": false,
294
+ "normalized": false,
295
+ "rstrip": false,
296
+ "single_word": false,
297
+ "special": true
298
+ },
299
+ "128037": {
300
+ "content": "<|reserved_special_token_29|>",
301
+ "lstrip": false,
302
+ "normalized": false,
303
+ "rstrip": false,
304
+ "single_word": false,
305
+ "special": true
306
+ },
307
+ "128038": {
308
+ "content": "<|reserved_special_token_30|>",
309
+ "lstrip": false,
310
+ "normalized": false,
311
+ "rstrip": false,
312
+ "single_word": false,
313
+ "special": true
314
+ },
315
+ "128039": {
316
+ "content": "<|reserved_special_token_31|>",
317
+ "lstrip": false,
318
+ "normalized": false,
319
+ "rstrip": false,
320
+ "single_word": false,
321
+ "special": true
322
+ },
323
+ "128040": {
324
+ "content": "<|reserved_special_token_32|>",
325
+ "lstrip": false,
326
+ "normalized": false,
327
+ "rstrip": false,
328
+ "single_word": false,
329
+ "special": true
330
+ },
331
+ "128041": {
332
+ "content": "<|reserved_special_token_33|>",
333
+ "lstrip": false,
334
+ "normalized": false,
335
+ "rstrip": false,
336
+ "single_word": false,
337
+ "special": true
338
+ },
339
+ "128042": {
340
+ "content": "<|reserved_special_token_34|>",
341
+ "lstrip": false,
342
+ "normalized": false,
343
+ "rstrip": false,
344
+ "single_word": false,
345
+ "special": true
346
+ },
347
+ "128043": {
348
+ "content": "<|reserved_special_token_35|>",
349
+ "lstrip": false,
350
+ "normalized": false,
351
+ "rstrip": false,
352
+ "single_word": false,
353
+ "special": true
354
+ },
355
+ "128044": {
356
+ "content": "<|reserved_special_token_36|>",
357
+ "lstrip": false,
358
+ "normalized": false,
359
+ "rstrip": false,
360
+ "single_word": false,
361
+ "special": true
362
+ },
363
+ "128045": {
364
+ "content": "<|reserved_special_token_37|>",
365
+ "lstrip": false,
366
+ "normalized": false,
367
+ "rstrip": false,
368
+ "single_word": false,
369
+ "special": true
370
+ },
371
+ "128046": {
372
+ "content": "<|reserved_special_token_38|>",
373
+ "lstrip": false,
374
+ "normalized": false,
375
+ "rstrip": false,
376
+ "single_word": false,
377
+ "special": true
378
+ },
379
+ "128047": {
380
+ "content": "<|reserved_special_token_39|>",
381
+ "lstrip": false,
382
+ "normalized": false,
383
+ "rstrip": false,
384
+ "single_word": false,
385
+ "special": true
386
+ },
387
+ "128048": {
388
+ "content": "<|reserved_special_token_40|>",
389
+ "lstrip": false,
390
+ "normalized": false,
391
+ "rstrip": false,
392
+ "single_word": false,
393
+ "special": true
394
+ },
395
+ "128049": {
396
+ "content": "<|reserved_special_token_41|>",
397
+ "lstrip": false,
398
+ "normalized": false,
399
+ "rstrip": false,
400
+ "single_word": false,
401
+ "special": true
402
+ },
403
+ "128050": {
404
+ "content": "<|reserved_special_token_42|>",
405
+ "lstrip": false,
406
+ "normalized": false,
407
+ "rstrip": false,
408
+ "single_word": false,
409
+ "special": true
410
+ },
411
+ "128051": {
412
+ "content": "<|reserved_special_token_43|>",
413
+ "lstrip": false,
414
+ "normalized": false,
415
+ "rstrip": false,
416
+ "single_word": false,
417
+ "special": true
418
+ },
419
+ "128052": {
420
+ "content": "<|reserved_special_token_44|>",
421
+ "lstrip": false,
422
+ "normalized": false,
423
+ "rstrip": false,
424
+ "single_word": false,
425
+ "special": true
426
+ },
427
+ "128053": {
428
+ "content": "<|reserved_special_token_45|>",
429
+ "lstrip": false,
430
+ "normalized": false,
431
+ "rstrip": false,
432
+ "single_word": false,
433
+ "special": true
434
+ },
435
+ "128054": {
436
+ "content": "<|reserved_special_token_46|>",
437
+ "lstrip": false,
438
+ "normalized": false,
439
+ "rstrip": false,
440
+ "single_word": false,
441
+ "special": true
442
+ },
443
+ "128055": {
444
+ "content": "<|reserved_special_token_47|>",
445
+ "lstrip": false,
446
+ "normalized": false,
447
+ "rstrip": false,
448
+ "single_word": false,
449
+ "special": true
450
+ },
451
+ "128056": {
452
+ "content": "<|reserved_special_token_48|>",
453
+ "lstrip": false,
454
+ "normalized": false,
455
+ "rstrip": false,
456
+ "single_word": false,
457
+ "special": true
458
+ },
459
+ "128057": {
460
+ "content": "<|reserved_special_token_49|>",
461
+ "lstrip": false,
462
+ "normalized": false,
463
+ "rstrip": false,
464
+ "single_word": false,
465
+ "special": true
466
+ },
467
+ "128058": {
468
+ "content": "<|reserved_special_token_50|>",
469
+ "lstrip": false,
470
+ "normalized": false,
471
+ "rstrip": false,
472
+ "single_word": false,
473
+ "special": true
474
+ },
475
+ "128059": {
476
+ "content": "<|reserved_special_token_51|>",
477
+ "lstrip": false,
478
+ "normalized": false,
479
+ "rstrip": false,
480
+ "single_word": false,
481
+ "special": true
482
+ },
483
+ "128060": {
484
+ "content": "<|reserved_special_token_52|>",
485
+ "lstrip": false,
486
+ "normalized": false,
487
+ "rstrip": false,
488
+ "single_word": false,
489
+ "special": true
490
+ },
491
+ "128061": {
492
+ "content": "<|reserved_special_token_53|>",
493
+ "lstrip": false,
494
+ "normalized": false,
495
+ "rstrip": false,
496
+ "single_word": false,
497
+ "special": true
498
+ },
499
+ "128062": {
500
+ "content": "<|reserved_special_token_54|>",
501
+ "lstrip": false,
502
+ "normalized": false,
503
+ "rstrip": false,
504
+ "single_word": false,
505
+ "special": true
506
+ },
507
+ "128063": {
508
+ "content": "<|reserved_special_token_55|>",
509
+ "lstrip": false,
510
+ "normalized": false,
511
+ "rstrip": false,
512
+ "single_word": false,
513
+ "special": true
514
+ },
515
+ "128064": {
516
+ "content": "<|reserved_special_token_56|>",
517
+ "lstrip": false,
518
+ "normalized": false,
519
+ "rstrip": false,
520
+ "single_word": false,
521
+ "special": true
522
+ },
523
+ "128065": {
524
+ "content": "<|reserved_special_token_57|>",
525
+ "lstrip": false,
526
+ "normalized": false,
527
+ "rstrip": false,
528
+ "single_word": false,
529
+ "special": true
530
+ },
531
+ "128066": {
532
+ "content": "<|reserved_special_token_58|>",
533
+ "lstrip": false,
534
+ "normalized": false,
535
+ "rstrip": false,
536
+ "single_word": false,
537
+ "special": true
538
+ },
539
+ "128067": {
540
+ "content": "<|reserved_special_token_59|>",
541
+ "lstrip": false,
542
+ "normalized": false,
543
+ "rstrip": false,
544
+ "single_word": false,
545
+ "special": true
546
+ },
547
+ "128068": {
548
+ "content": "<|reserved_special_token_60|>",
549
+ "lstrip": false,
550
+ "normalized": false,
551
+ "rstrip": false,
552
+ "single_word": false,
553
+ "special": true
554
+ },
555
+ "128069": {
556
+ "content": "<|reserved_special_token_61|>",
557
+ "lstrip": false,
558
+ "normalized": false,
559
+ "rstrip": false,
560
+ "single_word": false,
561
+ "special": true
562
+ },
563
+ "128070": {
564
+ "content": "<|reserved_special_token_62|>",
565
+ "lstrip": false,
566
+ "normalized": false,
567
+ "rstrip": false,
568
+ "single_word": false,
569
+ "special": true
570
+ },
571
+ "128071": {
572
+ "content": "<|reserved_special_token_63|>",
573
+ "lstrip": false,
574
+ "normalized": false,
575
+ "rstrip": false,
576
+ "single_word": false,
577
+ "special": true
578
+ },
579
+ "128072": {
580
+ "content": "<|reserved_special_token_64|>",
581
+ "lstrip": false,
582
+ "normalized": false,
583
+ "rstrip": false,
584
+ "single_word": false,
585
+ "special": true
586
+ },
587
+ "128073": {
588
+ "content": "<|reserved_special_token_65|>",
589
+ "lstrip": false,
590
+ "normalized": false,
591
+ "rstrip": false,
592
+ "single_word": false,
593
+ "special": true
594
+ },
595
+ "128074": {
596
+ "content": "<|reserved_special_token_66|>",
597
+ "lstrip": false,
598
+ "normalized": false,
599
+ "rstrip": false,
600
+ "single_word": false,
601
+ "special": true
602
+ },
603
+ "128075": {
604
+ "content": "<|reserved_special_token_67|>",
605
+ "lstrip": false,
606
+ "normalized": false,
607
+ "rstrip": false,
608
+ "single_word": false,
609
+ "special": true
610
+ },
611
+ "128076": {
612
+ "content": "<|reserved_special_token_68|>",
613
+ "lstrip": false,
614
+ "normalized": false,
615
+ "rstrip": false,
616
+ "single_word": false,
617
+ "special": true
618
+ },
619
+ "128077": {
620
+ "content": "<|reserved_special_token_69|>",
621
+ "lstrip": false,
622
+ "normalized": false,
623
+ "rstrip": false,
624
+ "single_word": false,
625
+ "special": true
626
+ },
627
+ "128078": {
628
+ "content": "<|reserved_special_token_70|>",
629
+ "lstrip": false,
630
+ "normalized": false,
631
+ "rstrip": false,
632
+ "single_word": false,
633
+ "special": true
634
+ },
635
+ "128079": {
636
+ "content": "<|reserved_special_token_71|>",
637
+ "lstrip": false,
638
+ "normalized": false,
639
+ "rstrip": false,
640
+ "single_word": false,
641
+ "special": true
642
+ },
643
+ "128080": {
644
+ "content": "<|reserved_special_token_72|>",
645
+ "lstrip": false,
646
+ "normalized": false,
647
+ "rstrip": false,
648
+ "single_word": false,
649
+ "special": true
650
+ },
651
+ "128081": {
652
+ "content": "<|reserved_special_token_73|>",
653
+ "lstrip": false,
654
+ "normalized": false,
655
+ "rstrip": false,
656
+ "single_word": false,
657
+ "special": true
658
+ },
659
+ "128082": {
660
+ "content": "<|reserved_special_token_74|>",
661
+ "lstrip": false,
662
+ "normalized": false,
663
+ "rstrip": false,
664
+ "single_word": false,
665
+ "special": true
666
+ },
667
+ "128083": {
668
+ "content": "<|reserved_special_token_75|>",
669
+ "lstrip": false,
670
+ "normalized": false,
671
+ "rstrip": false,
672
+ "single_word": false,
673
+ "special": true
674
+ },
675
+ "128084": {
676
+ "content": "<|reserved_special_token_76|>",
677
+ "lstrip": false,
678
+ "normalized": false,
679
+ "rstrip": false,
680
+ "single_word": false,
681
+ "special": true
682
+ },
683
+ "128085": {
684
+ "content": "<|reserved_special_token_77|>",
685
+ "lstrip": false,
686
+ "normalized": false,
687
+ "rstrip": false,
688
+ "single_word": false,
689
+ "special": true
690
+ },
691
+ "128086": {
692
+ "content": "<|reserved_special_token_78|>",
693
+ "lstrip": false,
694
+ "normalized": false,
695
+ "rstrip": false,
696
+ "single_word": false,
697
+ "special": true
698
+ },
699
+ "128087": {
700
+ "content": "<|reserved_special_token_79|>",
701
+ "lstrip": false,
702
+ "normalized": false,
703
+ "rstrip": false,
704
+ "single_word": false,
705
+ "special": true
706
+ },
707
+ "128088": {
708
+ "content": "<|reserved_special_token_80|>",
709
+ "lstrip": false,
710
+ "normalized": false,
711
+ "rstrip": false,
712
+ "single_word": false,
713
+ "special": true
714
+ },
715
+ "128089": {
716
+ "content": "<|reserved_special_token_81|>",
717
+ "lstrip": false,
718
+ "normalized": false,
719
+ "rstrip": false,
720
+ "single_word": false,
721
+ "special": true
722
+ },
723
+ "128090": {
724
+ "content": "<|reserved_special_token_82|>",
725
+ "lstrip": false,
726
+ "normalized": false,
727
+ "rstrip": false,
728
+ "single_word": false,
729
+ "special": true
730
+ },
731
+ "128091": {
732
+ "content": "<|reserved_special_token_83|>",
733
+ "lstrip": false,
734
+ "normalized": false,
735
+ "rstrip": false,
736
+ "single_word": false,
737
+ "special": true
738
+ },
739
+ "128092": {
740
+ "content": "<|reserved_special_token_84|>",
741
+ "lstrip": false,
742
+ "normalized": false,
743
+ "rstrip": false,
744
+ "single_word": false,
745
+ "special": true
746
+ },
747
+ "128093": {
748
+ "content": "<|reserved_special_token_85|>",
749
+ "lstrip": false,
750
+ "normalized": false,
751
+ "rstrip": false,
752
+ "single_word": false,
753
+ "special": true
754
+ },
755
+ "128094": {
756
+ "content": "<|reserved_special_token_86|>",
757
+ "lstrip": false,
758
+ "normalized": false,
759
+ "rstrip": false,
760
+ "single_word": false,
761
+ "special": true
762
+ },
763
+ "128095": {
764
+ "content": "<|reserved_special_token_87|>",
765
+ "lstrip": false,
766
+ "normalized": false,
767
+ "rstrip": false,
768
+ "single_word": false,
769
+ "special": true
770
+ },
771
+ "128096": {
772
+ "content": "<|reserved_special_token_88|>",
773
+ "lstrip": false,
774
+ "normalized": false,
775
+ "rstrip": false,
776
+ "single_word": false,
777
+ "special": true
778
+ },
779
+ "128097": {
780
+ "content": "<|reserved_special_token_89|>",
781
+ "lstrip": false,
782
+ "normalized": false,
783
+ "rstrip": false,
784
+ "single_word": false,
785
+ "special": true
786
+ },
787
+ "128098": {
788
+ "content": "<|reserved_special_token_90|>",
789
+ "lstrip": false,
790
+ "normalized": false,
791
+ "rstrip": false,
792
+ "single_word": false,
793
+ "special": true
794
+ },
795
+ "128099": {
796
+ "content": "<|reserved_special_token_91|>",
797
+ "lstrip": false,
798
+ "normalized": false,
799
+ "rstrip": false,
800
+ "single_word": false,
801
+ "special": true
802
+ },
803
+ "128100": {
804
+ "content": "<|reserved_special_token_92|>",
805
+ "lstrip": false,
806
+ "normalized": false,
807
+ "rstrip": false,
808
+ "single_word": false,
809
+ "special": true
810
+ },
811
+ "128101": {
812
+ "content": "<|reserved_special_token_93|>",
813
+ "lstrip": false,
814
+ "normalized": false,
815
+ "rstrip": false,
816
+ "single_word": false,
817
+ "special": true
818
+ },
819
+ "128102": {
820
+ "content": "<|reserved_special_token_94|>",
821
+ "lstrip": false,
822
+ "normalized": false,
823
+ "rstrip": false,
824
+ "single_word": false,
825
+ "special": true
826
+ },
827
+ "128103": {
828
+ "content": "<|reserved_special_token_95|>",
829
+ "lstrip": false,
830
+ "normalized": false,
831
+ "rstrip": false,
832
+ "single_word": false,
833
+ "special": true
834
+ },
835
+ "128104": {
836
+ "content": "<|reserved_special_token_96|>",
837
+ "lstrip": false,
838
+ "normalized": false,
839
+ "rstrip": false,
840
+ "single_word": false,
841
+ "special": true
842
+ },
843
+ "128105": {
844
+ "content": "<|reserved_special_token_97|>",
845
+ "lstrip": false,
846
+ "normalized": false,
847
+ "rstrip": false,
848
+ "single_word": false,
849
+ "special": true
850
+ },
851
+ "128106": {
852
+ "content": "<|reserved_special_token_98|>",
853
+ "lstrip": false,
854
+ "normalized": false,
855
+ "rstrip": false,
856
+ "single_word": false,
857
+ "special": true
858
+ },
859
+ "128107": {
860
+ "content": "<|reserved_special_token_99|>",
861
+ "lstrip": false,
862
+ "normalized": false,
863
+ "rstrip": false,
864
+ "single_word": false,
865
+ "special": true
866
+ },
867
+ "128108": {
868
+ "content": "<|reserved_special_token_100|>",
869
+ "lstrip": false,
870
+ "normalized": false,
871
+ "rstrip": false,
872
+ "single_word": false,
873
+ "special": true
874
+ },
875
+ "128109": {
876
+ "content": "<|reserved_special_token_101|>",
877
+ "lstrip": false,
878
+ "normalized": false,
879
+ "rstrip": false,
880
+ "single_word": false,
881
+ "special": true
882
+ },
883
+ "128110": {
884
+ "content": "<|reserved_special_token_102|>",
885
+ "lstrip": false,
886
+ "normalized": false,
887
+ "rstrip": false,
888
+ "single_word": false,
889
+ "special": true
890
+ },
891
+ "128111": {
892
+ "content": "<|reserved_special_token_103|>",
893
+ "lstrip": false,
894
+ "normalized": false,
895
+ "rstrip": false,
896
+ "single_word": false,
897
+ "special": true
898
+ },
899
+ "128112": {
900
+ "content": "<|reserved_special_token_104|>",
901
+ "lstrip": false,
902
+ "normalized": false,
903
+ "rstrip": false,
904
+ "single_word": false,
905
+ "special": true
906
+ },
907
+ "128113": {
908
+ "content": "<|reserved_special_token_105|>",
909
+ "lstrip": false,
910
+ "normalized": false,
911
+ "rstrip": false,
912
+ "single_word": false,
913
+ "special": true
914
+ },
915
+ "128114": {
916
+ "content": "<|reserved_special_token_106|>",
917
+ "lstrip": false,
918
+ "normalized": false,
919
+ "rstrip": false,
920
+ "single_word": false,
921
+ "special": true
922
+ },
923
+ "128115": {
924
+ "content": "<|reserved_special_token_107|>",
925
+ "lstrip": false,
926
+ "normalized": false,
927
+ "rstrip": false,
928
+ "single_word": false,
929
+ "special": true
930
+ },
931
+ "128116": {
932
+ "content": "<|reserved_special_token_108|>",
933
+ "lstrip": false,
934
+ "normalized": false,
935
+ "rstrip": false,
936
+ "single_word": false,
937
+ "special": true
938
+ },
939
+ "128117": {
940
+ "content": "<|reserved_special_token_109|>",
941
+ "lstrip": false,
942
+ "normalized": false,
943
+ "rstrip": false,
944
+ "single_word": false,
945
+ "special": true
946
+ },
947
+ "128118": {
948
+ "content": "<|reserved_special_token_110|>",
949
+ "lstrip": false,
950
+ "normalized": false,
951
+ "rstrip": false,
952
+ "single_word": false,
953
+ "special": true
954
+ },
955
+ "128119": {
956
+ "content": "<|reserved_special_token_111|>",
957
+ "lstrip": false,
958
+ "normalized": false,
959
+ "rstrip": false,
960
+ "single_word": false,
961
+ "special": true
962
+ },
963
+ "128120": {
964
+ "content": "<|reserved_special_token_112|>",
965
+ "lstrip": false,
966
+ "normalized": false,
967
+ "rstrip": false,
968
+ "single_word": false,
969
+ "special": true
970
+ },
971
+ "128121": {
972
+ "content": "<|reserved_special_token_113|>",
973
+ "lstrip": false,
974
+ "normalized": false,
975
+ "rstrip": false,
976
+ "single_word": false,
977
+ "special": true
978
+ },
979
+ "128122": {
980
+ "content": "<|reserved_special_token_114|>",
981
+ "lstrip": false,
982
+ "normalized": false,
983
+ "rstrip": false,
984
+ "single_word": false,
985
+ "special": true
986
+ },
987
+ "128123": {
988
+ "content": "<|reserved_special_token_115|>",
989
+ "lstrip": false,
990
+ "normalized": false,
991
+ "rstrip": false,
992
+ "single_word": false,
993
+ "special": true
994
+ },
995
+ "128124": {
996
+ "content": "<|reserved_special_token_116|>",
997
+ "lstrip": false,
998
+ "normalized": false,
999
+ "rstrip": false,
1000
+ "single_word": false,
1001
+ "special": true
1002
+ },
1003
+ "128125": {
1004
+ "content": "<|reserved_special_token_117|>",
1005
+ "lstrip": false,
1006
+ "normalized": false,
1007
+ "rstrip": false,
1008
+ "single_word": false,
1009
+ "special": true
1010
+ },
1011
+ "128126": {
1012
+ "content": "<|reserved_special_token_118|>",
1013
+ "lstrip": false,
1014
+ "normalized": false,
1015
+ "rstrip": false,
1016
+ "single_word": false,
1017
+ "special": true
1018
+ },
1019
+ "128127": {
1020
+ "content": "<|reserved_special_token_119|>",
1021
+ "lstrip": false,
1022
+ "normalized": false,
1023
+ "rstrip": false,
1024
+ "single_word": false,
1025
+ "special": true
1026
+ },
1027
+ "128128": {
1028
+ "content": "<|reserved_special_token_120|>",
1029
+ "lstrip": false,
1030
+ "normalized": false,
1031
+ "rstrip": false,
1032
+ "single_word": false,
1033
+ "special": true
1034
+ },
1035
+ "128129": {
1036
+ "content": "<|reserved_special_token_121|>",
1037
+ "lstrip": false,
1038
+ "normalized": false,
1039
+ "rstrip": false,
1040
+ "single_word": false,
1041
+ "special": true
1042
+ },
1043
+ "128130": {
1044
+ "content": "<|reserved_special_token_122|>",
1045
+ "lstrip": false,
1046
+ "normalized": false,
1047
+ "rstrip": false,
1048
+ "single_word": false,
1049
+ "special": true
1050
+ },
1051
+ "128131": {
1052
+ "content": "<|reserved_special_token_123|>",
1053
+ "lstrip": false,
1054
+ "normalized": false,
1055
+ "rstrip": false,
1056
+ "single_word": false,
1057
+ "special": true
1058
+ },
1059
+ "128132": {
1060
+ "content": "<|reserved_special_token_124|>",
1061
+ "lstrip": false,
1062
+ "normalized": false,
1063
+ "rstrip": false,
1064
+ "single_word": false,
1065
+ "special": true
1066
+ },
1067
+ "128133": {
1068
+ "content": "<|reserved_special_token_125|>",
1069
+ "lstrip": false,
1070
+ "normalized": false,
1071
+ "rstrip": false,
1072
+ "single_word": false,
1073
+ "special": true
1074
+ },
1075
+ "128134": {
1076
+ "content": "<|reserved_special_token_126|>",
1077
+ "lstrip": false,
1078
+ "normalized": false,
1079
+ "rstrip": false,
1080
+ "single_word": false,
1081
+ "special": true
1082
+ },
1083
+ "128135": {
1084
+ "content": "<|reserved_special_token_127|>",
1085
+ "lstrip": false,
1086
+ "normalized": false,
1087
+ "rstrip": false,
1088
+ "single_word": false,
1089
+ "special": true
1090
+ },
1091
+ "128136": {
1092
+ "content": "<|reserved_special_token_128|>",
1093
+ "lstrip": false,
1094
+ "normalized": false,
1095
+ "rstrip": false,
1096
+ "single_word": false,
1097
+ "special": true
1098
+ },
1099
+ "128137": {
1100
+ "content": "<|reserved_special_token_129|>",
1101
+ "lstrip": false,
1102
+ "normalized": false,
1103
+ "rstrip": false,
1104
+ "single_word": false,
1105
+ "special": true
1106
+ },
1107
+ "128138": {
1108
+ "content": "<|reserved_special_token_130|>",
1109
+ "lstrip": false,
1110
+ "normalized": false,
1111
+ "rstrip": false,
1112
+ "single_word": false,
1113
+ "special": true
1114
+ },
1115
+ "128139": {
1116
+ "content": "<|reserved_special_token_131|>",
1117
+ "lstrip": false,
1118
+ "normalized": false,
1119
+ "rstrip": false,
1120
+ "single_word": false,
1121
+ "special": true
1122
+ },
1123
+ "128140": {
1124
+ "content": "<|reserved_special_token_132|>",
1125
+ "lstrip": false,
1126
+ "normalized": false,
1127
+ "rstrip": false,
1128
+ "single_word": false,
1129
+ "special": true
1130
+ },
1131
+ "128141": {
1132
+ "content": "<|reserved_special_token_133|>",
1133
+ "lstrip": false,
1134
+ "normalized": false,
1135
+ "rstrip": false,
1136
+ "single_word": false,
1137
+ "special": true
1138
+ },
1139
+ "128142": {
1140
+ "content": "<|reserved_special_token_134|>",
1141
+ "lstrip": false,
1142
+ "normalized": false,
1143
+ "rstrip": false,
1144
+ "single_word": false,
1145
+ "special": true
1146
+ },
1147
+ "128143": {
1148
+ "content": "<|reserved_special_token_135|>",
1149
+ "lstrip": false,
1150
+ "normalized": false,
1151
+ "rstrip": false,
1152
+ "single_word": false,
1153
+ "special": true
1154
+ },
1155
+ "128144": {
1156
+ "content": "<|reserved_special_token_136|>",
1157
+ "lstrip": false,
1158
+ "normalized": false,
1159
+ "rstrip": false,
1160
+ "single_word": false,
1161
+ "special": true
1162
+ },
1163
+ "128145": {
1164
+ "content": "<|reserved_special_token_137|>",
1165
+ "lstrip": false,
1166
+ "normalized": false,
1167
+ "rstrip": false,
1168
+ "single_word": false,
1169
+ "special": true
1170
+ },
1171
+ "128146": {
1172
+ "content": "<|reserved_special_token_138|>",
1173
+ "lstrip": false,
1174
+ "normalized": false,
1175
+ "rstrip": false,
1176
+ "single_word": false,
1177
+ "special": true
1178
+ },
1179
+ "128147": {
1180
+ "content": "<|reserved_special_token_139|>",
1181
+ "lstrip": false,
1182
+ "normalized": false,
1183
+ "rstrip": false,
1184
+ "single_word": false,
1185
+ "special": true
1186
+ },
1187
+ "128148": {
1188
+ "content": "<|reserved_special_token_140|>",
1189
+ "lstrip": false,
1190
+ "normalized": false,
1191
+ "rstrip": false,
1192
+ "single_word": false,
1193
+ "special": true
1194
+ },
1195
+ "128149": {
1196
+ "content": "<|reserved_special_token_141|>",
1197
+ "lstrip": false,
1198
+ "normalized": false,
1199
+ "rstrip": false,
1200
+ "single_word": false,
1201
+ "special": true
1202
+ },
1203
+ "128150": {
1204
+ "content": "<|reserved_special_token_142|>",
1205
+ "lstrip": false,
1206
+ "normalized": false,
1207
+ "rstrip": false,
1208
+ "single_word": false,
1209
+ "special": true
1210
+ },
1211
+ "128151": {
1212
+ "content": "<|reserved_special_token_143|>",
1213
+ "lstrip": false,
1214
+ "normalized": false,
1215
+ "rstrip": false,
1216
+ "single_word": false,
1217
+ "special": true
1218
+ },
1219
+ "128152": {
1220
+ "content": "<|reserved_special_token_144|>",
1221
+ "lstrip": false,
1222
+ "normalized": false,
1223
+ "rstrip": false,
1224
+ "single_word": false,
1225
+ "special": true
1226
+ },
1227
+ "128153": {
1228
+ "content": "<|reserved_special_token_145|>",
1229
+ "lstrip": false,
1230
+ "normalized": false,
1231
+ "rstrip": false,
1232
+ "single_word": false,
1233
+ "special": true
1234
+ },
1235
+ "128154": {
1236
+ "content": "<|reserved_special_token_146|>",
1237
+ "lstrip": false,
1238
+ "normalized": false,
1239
+ "rstrip": false,
1240
+ "single_word": false,
1241
+ "special": true
1242
+ },
1243
+ "128155": {
1244
+ "content": "<|reserved_special_token_147|>",
1245
+ "lstrip": false,
1246
+ "normalized": false,
1247
+ "rstrip": false,
1248
+ "single_word": false,
1249
+ "special": true
1250
+ },
1251
+ "128156": {
1252
+ "content": "<|reserved_special_token_148|>",
1253
+ "lstrip": false,
1254
+ "normalized": false,
1255
+ "rstrip": false,
1256
+ "single_word": false,
1257
+ "special": true
1258
+ },
1259
+ "128157": {
1260
+ "content": "<|reserved_special_token_149|>",
1261
+ "lstrip": false,
1262
+ "normalized": false,
1263
+ "rstrip": false,
1264
+ "single_word": false,
1265
+ "special": true
1266
+ },
1267
+ "128158": {
1268
+ "content": "<|reserved_special_token_150|>",
1269
+ "lstrip": false,
1270
+ "normalized": false,
1271
+ "rstrip": false,
1272
+ "single_word": false,
1273
+ "special": true
1274
+ },
1275
+ "128159": {
1276
+ "content": "<|reserved_special_token_151|>",
1277
+ "lstrip": false,
1278
+ "normalized": false,
1279
+ "rstrip": false,
1280
+ "single_word": false,
1281
+ "special": true
1282
+ },
1283
+ "128160": {
1284
+ "content": "<|reserved_special_token_152|>",
1285
+ "lstrip": false,
1286
+ "normalized": false,
1287
+ "rstrip": false,
1288
+ "single_word": false,
1289
+ "special": true
1290
+ },
1291
+ "128161": {
1292
+ "content": "<|reserved_special_token_153|>",
1293
+ "lstrip": false,
1294
+ "normalized": false,
1295
+ "rstrip": false,
1296
+ "single_word": false,
1297
+ "special": true
1298
+ },
1299
+ "128162": {
1300
+ "content": "<|reserved_special_token_154|>",
1301
+ "lstrip": false,
1302
+ "normalized": false,
1303
+ "rstrip": false,
1304
+ "single_word": false,
1305
+ "special": true
1306
+ },
1307
+ "128163": {
1308
+ "content": "<|reserved_special_token_155|>",
1309
+ "lstrip": false,
1310
+ "normalized": false,
1311
+ "rstrip": false,
1312
+ "single_word": false,
1313
+ "special": true
1314
+ },
1315
+ "128164": {
1316
+ "content": "<|reserved_special_token_156|>",
1317
+ "lstrip": false,
1318
+ "normalized": false,
1319
+ "rstrip": false,
1320
+ "single_word": false,
1321
+ "special": true
1322
+ },
1323
+ "128165": {
1324
+ "content": "<|reserved_special_token_157|>",
1325
+ "lstrip": false,
1326
+ "normalized": false,
1327
+ "rstrip": false,
1328
+ "single_word": false,
1329
+ "special": true
1330
+ },
1331
+ "128166": {
1332
+ "content": "<|reserved_special_token_158|>",
1333
+ "lstrip": false,
1334
+ "normalized": false,
1335
+ "rstrip": false,
1336
+ "single_word": false,
1337
+ "special": true
1338
+ },
1339
+ "128167": {
1340
+ "content": "<|reserved_special_token_159|>",
1341
+ "lstrip": false,
1342
+ "normalized": false,
1343
+ "rstrip": false,
1344
+ "single_word": false,
1345
+ "special": true
1346
+ },
1347
+ "128168": {
1348
+ "content": "<|reserved_special_token_160|>",
1349
+ "lstrip": false,
1350
+ "normalized": false,
1351
+ "rstrip": false,
1352
+ "single_word": false,
1353
+ "special": true
1354
+ },
1355
+ "128169": {
1356
+ "content": "<|reserved_special_token_161|>",
1357
+ "lstrip": false,
1358
+ "normalized": false,
1359
+ "rstrip": false,
1360
+ "single_word": false,
1361
+ "special": true
1362
+ },
1363
+ "128170": {
1364
+ "content": "<|reserved_special_token_162|>",
1365
+ "lstrip": false,
1366
+ "normalized": false,
1367
+ "rstrip": false,
1368
+ "single_word": false,
1369
+ "special": true
1370
+ },
1371
+ "128171": {
1372
+ "content": "<|reserved_special_token_163|>",
1373
+ "lstrip": false,
1374
+ "normalized": false,
1375
+ "rstrip": false,
1376
+ "single_word": false,
1377
+ "special": true
1378
+ },
1379
+ "128172": {
1380
+ "content": "<|reserved_special_token_164|>",
1381
+ "lstrip": false,
1382
+ "normalized": false,
1383
+ "rstrip": false,
1384
+ "single_word": false,
1385
+ "special": true
1386
+ },
1387
+ "128173": {
1388
+ "content": "<|reserved_special_token_165|>",
1389
+ "lstrip": false,
1390
+ "normalized": false,
1391
+ "rstrip": false,
1392
+ "single_word": false,
1393
+ "special": true
1394
+ },
1395
+ "128174": {
1396
+ "content": "<|reserved_special_token_166|>",
1397
+ "lstrip": false,
1398
+ "normalized": false,
1399
+ "rstrip": false,
1400
+ "single_word": false,
1401
+ "special": true
1402
+ },
1403
+ "128175": {
1404
+ "content": "<|reserved_special_token_167|>",
1405
+ "lstrip": false,
1406
+ "normalized": false,
1407
+ "rstrip": false,
1408
+ "single_word": false,
1409
+ "special": true
1410
+ },
1411
+ "128176": {
1412
+ "content": "<|reserved_special_token_168|>",
1413
+ "lstrip": false,
1414
+ "normalized": false,
1415
+ "rstrip": false,
1416
+ "single_word": false,
1417
+ "special": true
1418
+ },
1419
+ "128177": {
1420
+ "content": "<|reserved_special_token_169|>",
1421
+ "lstrip": false,
1422
+ "normalized": false,
1423
+ "rstrip": false,
1424
+ "single_word": false,
1425
+ "special": true
1426
+ },
1427
+ "128178": {
1428
+ "content": "<|reserved_special_token_170|>",
1429
+ "lstrip": false,
1430
+ "normalized": false,
1431
+ "rstrip": false,
1432
+ "single_word": false,
1433
+ "special": true
1434
+ },
1435
+ "128179": {
1436
+ "content": "<|reserved_special_token_171|>",
1437
+ "lstrip": false,
1438
+ "normalized": false,
1439
+ "rstrip": false,
1440
+ "single_word": false,
1441
+ "special": true
1442
+ },
1443
+ "128180": {
1444
+ "content": "<|reserved_special_token_172|>",
1445
+ "lstrip": false,
1446
+ "normalized": false,
1447
+ "rstrip": false,
1448
+ "single_word": false,
1449
+ "special": true
1450
+ },
1451
+ "128181": {
1452
+ "content": "<|reserved_special_token_173|>",
1453
+ "lstrip": false,
1454
+ "normalized": false,
1455
+ "rstrip": false,
1456
+ "single_word": false,
1457
+ "special": true
1458
+ },
1459
+ "128182": {
1460
+ "content": "<|reserved_special_token_174|>",
1461
+ "lstrip": false,
1462
+ "normalized": false,
1463
+ "rstrip": false,
1464
+ "single_word": false,
1465
+ "special": true
1466
+ },
1467
+ "128183": {
1468
+ "content": "<|reserved_special_token_175|>",
1469
+ "lstrip": false,
1470
+ "normalized": false,
1471
+ "rstrip": false,
1472
+ "single_word": false,
1473
+ "special": true
1474
+ },
1475
+ "128184": {
1476
+ "content": "<|reserved_special_token_176|>",
1477
+ "lstrip": false,
1478
+ "normalized": false,
1479
+ "rstrip": false,
1480
+ "single_word": false,
1481
+ "special": true
1482
+ },
1483
+ "128185": {
1484
+ "content": "<|reserved_special_token_177|>",
1485
+ "lstrip": false,
1486
+ "normalized": false,
1487
+ "rstrip": false,
1488
+ "single_word": false,
1489
+ "special": true
1490
+ },
1491
+ "128186": {
1492
+ "content": "<|reserved_special_token_178|>",
1493
+ "lstrip": false,
1494
+ "normalized": false,
1495
+ "rstrip": false,
1496
+ "single_word": false,
1497
+ "special": true
1498
+ },
1499
+ "128187": {
1500
+ "content": "<|reserved_special_token_179|>",
1501
+ "lstrip": false,
1502
+ "normalized": false,
1503
+ "rstrip": false,
1504
+ "single_word": false,
1505
+ "special": true
1506
+ },
1507
+ "128188": {
1508
+ "content": "<|reserved_special_token_180|>",
1509
+ "lstrip": false,
1510
+ "normalized": false,
1511
+ "rstrip": false,
1512
+ "single_word": false,
1513
+ "special": true
1514
+ },
1515
+ "128189": {
1516
+ "content": "<|reserved_special_token_181|>",
1517
+ "lstrip": false,
1518
+ "normalized": false,
1519
+ "rstrip": false,
1520
+ "single_word": false,
1521
+ "special": true
1522
+ },
1523
+ "128190": {
1524
+ "content": "<|reserved_special_token_182|>",
1525
+ "lstrip": false,
1526
+ "normalized": false,
1527
+ "rstrip": false,
1528
+ "single_word": false,
1529
+ "special": true
1530
+ },
1531
+ "128191": {
1532
+ "content": "<|reserved_special_token_183|>",
1533
+ "lstrip": false,
1534
+ "normalized": false,
1535
+ "rstrip": false,
1536
+ "single_word": false,
1537
+ "special": true
1538
+ },
1539
+ "128192": {
1540
+ "content": "<|reserved_special_token_184|>",
1541
+ "lstrip": false,
1542
+ "normalized": false,
1543
+ "rstrip": false,
1544
+ "single_word": false,
1545
+ "special": true
1546
+ },
1547
+ "128193": {
1548
+ "content": "<|reserved_special_token_185|>",
1549
+ "lstrip": false,
1550
+ "normalized": false,
1551
+ "rstrip": false,
1552
+ "single_word": false,
1553
+ "special": true
1554
+ },
1555
+ "128194": {
1556
+ "content": "<|reserved_special_token_186|>",
1557
+ "lstrip": false,
1558
+ "normalized": false,
1559
+ "rstrip": false,
1560
+ "single_word": false,
1561
+ "special": true
1562
+ },
1563
+ "128195": {
1564
+ "content": "<|reserved_special_token_187|>",
1565
+ "lstrip": false,
1566
+ "normalized": false,
1567
+ "rstrip": false,
1568
+ "single_word": false,
1569
+ "special": true
1570
+ },
1571
+ "128196": {
1572
+ "content": "<|reserved_special_token_188|>",
1573
+ "lstrip": false,
1574
+ "normalized": false,
1575
+ "rstrip": false,
1576
+ "single_word": false,
1577
+ "special": true
1578
+ },
1579
+ "128197": {
1580
+ "content": "<|reserved_special_token_189|>",
1581
+ "lstrip": false,
1582
+ "normalized": false,
1583
+ "rstrip": false,
1584
+ "single_word": false,
1585
+ "special": true
1586
+ },
1587
+ "128198": {
1588
+ "content": "<|reserved_special_token_190|>",
1589
+ "lstrip": false,
1590
+ "normalized": false,
1591
+ "rstrip": false,
1592
+ "single_word": false,
1593
+ "special": true
1594
+ },
1595
+ "128199": {
1596
+ "content": "<|reserved_special_token_191|>",
1597
+ "lstrip": false,
1598
+ "normalized": false,
1599
+ "rstrip": false,
1600
+ "single_word": false,
1601
+ "special": true
1602
+ },
1603
+ "128200": {
1604
+ "content": "<|reserved_special_token_192|>",
1605
+ "lstrip": false,
1606
+ "normalized": false,
1607
+ "rstrip": false,
1608
+ "single_word": false,
1609
+ "special": true
1610
+ },
1611
+ "128201": {
1612
+ "content": "<|reserved_special_token_193|>",
1613
+ "lstrip": false,
1614
+ "normalized": false,
1615
+ "rstrip": false,
1616
+ "single_word": false,
1617
+ "special": true
1618
+ },
1619
+ "128202": {
1620
+ "content": "<|reserved_special_token_194|>",
1621
+ "lstrip": false,
1622
+ "normalized": false,
1623
+ "rstrip": false,
1624
+ "single_word": false,
1625
+ "special": true
1626
+ },
1627
+ "128203": {
1628
+ "content": "<|reserved_special_token_195|>",
1629
+ "lstrip": false,
1630
+ "normalized": false,
1631
+ "rstrip": false,
1632
+ "single_word": false,
1633
+ "special": true
1634
+ },
1635
+ "128204": {
1636
+ "content": "<|reserved_special_token_196|>",
1637
+ "lstrip": false,
1638
+ "normalized": false,
1639
+ "rstrip": false,
1640
+ "single_word": false,
1641
+ "special": true
1642
+ },
1643
+ "128205": {
1644
+ "content": "<|reserved_special_token_197|>",
1645
+ "lstrip": false,
1646
+ "normalized": false,
1647
+ "rstrip": false,
1648
+ "single_word": false,
1649
+ "special": true
1650
+ },
1651
+ "128206": {
1652
+ "content": "<|reserved_special_token_198|>",
1653
+ "lstrip": false,
1654
+ "normalized": false,
1655
+ "rstrip": false,
1656
+ "single_word": false,
1657
+ "special": true
1658
+ },
1659
+ "128207": {
1660
+ "content": "<|reserved_special_token_199|>",
1661
+ "lstrip": false,
1662
+ "normalized": false,
1663
+ "rstrip": false,
1664
+ "single_word": false,
1665
+ "special": true
1666
+ },
1667
+ "128208": {
1668
+ "content": "<|reserved_special_token_200|>",
1669
+ "lstrip": false,
1670
+ "normalized": false,
1671
+ "rstrip": false,
1672
+ "single_word": false,
1673
+ "special": true
1674
+ },
1675
+ "128209": {
1676
+ "content": "<|reserved_special_token_201|>",
1677
+ "lstrip": false,
1678
+ "normalized": false,
1679
+ "rstrip": false,
1680
+ "single_word": false,
1681
+ "special": true
1682
+ },
1683
+ "128210": {
1684
+ "content": "<|reserved_special_token_202|>",
1685
+ "lstrip": false,
1686
+ "normalized": false,
1687
+ "rstrip": false,
1688
+ "single_word": false,
1689
+ "special": true
1690
+ },
1691
+ "128211": {
1692
+ "content": "<|reserved_special_token_203|>",
1693
+ "lstrip": false,
1694
+ "normalized": false,
1695
+ "rstrip": false,
1696
+ "single_word": false,
1697
+ "special": true
1698
+ },
1699
+ "128212": {
1700
+ "content": "<|reserved_special_token_204|>",
1701
+ "lstrip": false,
1702
+ "normalized": false,
1703
+ "rstrip": false,
1704
+ "single_word": false,
1705
+ "special": true
1706
+ },
1707
+ "128213": {
1708
+ "content": "<|reserved_special_token_205|>",
1709
+ "lstrip": false,
1710
+ "normalized": false,
1711
+ "rstrip": false,
1712
+ "single_word": false,
1713
+ "special": true
1714
+ },
1715
+ "128214": {
1716
+ "content": "<|reserved_special_token_206|>",
1717
+ "lstrip": false,
1718
+ "normalized": false,
1719
+ "rstrip": false,
1720
+ "single_word": false,
1721
+ "special": true
1722
+ },
1723
+ "128215": {
1724
+ "content": "<|reserved_special_token_207|>",
1725
+ "lstrip": false,
1726
+ "normalized": false,
1727
+ "rstrip": false,
1728
+ "single_word": false,
1729
+ "special": true
1730
+ },
1731
+ "128216": {
1732
+ "content": "<|reserved_special_token_208|>",
1733
+ "lstrip": false,
1734
+ "normalized": false,
1735
+ "rstrip": false,
1736
+ "single_word": false,
1737
+ "special": true
1738
+ },
1739
+ "128217": {
1740
+ "content": "<|reserved_special_token_209|>",
1741
+ "lstrip": false,
1742
+ "normalized": false,
1743
+ "rstrip": false,
1744
+ "single_word": false,
1745
+ "special": true
1746
+ },
1747
+ "128218": {
1748
+ "content": "<|reserved_special_token_210|>",
1749
+ "lstrip": false,
1750
+ "normalized": false,
1751
+ "rstrip": false,
1752
+ "single_word": false,
1753
+ "special": true
1754
+ },
1755
+ "128219": {
1756
+ "content": "<|reserved_special_token_211|>",
1757
+ "lstrip": false,
1758
+ "normalized": false,
1759
+ "rstrip": false,
1760
+ "single_word": false,
1761
+ "special": true
1762
+ },
1763
+ "128220": {
1764
+ "content": "<|reserved_special_token_212|>",
1765
+ "lstrip": false,
1766
+ "normalized": false,
1767
+ "rstrip": false,
1768
+ "single_word": false,
1769
+ "special": true
1770
+ },
1771
+ "128221": {
1772
+ "content": "<|reserved_special_token_213|>",
1773
+ "lstrip": false,
1774
+ "normalized": false,
1775
+ "rstrip": false,
1776
+ "single_word": false,
1777
+ "special": true
1778
+ },
1779
+ "128222": {
1780
+ "content": "<|reserved_special_token_214|>",
1781
+ "lstrip": false,
1782
+ "normalized": false,
1783
+ "rstrip": false,
1784
+ "single_word": false,
1785
+ "special": true
1786
+ },
1787
+ "128223": {
1788
+ "content": "<|reserved_special_token_215|>",
1789
+ "lstrip": false,
1790
+ "normalized": false,
1791
+ "rstrip": false,
1792
+ "single_word": false,
1793
+ "special": true
1794
+ },
1795
+ "128224": {
1796
+ "content": "<|reserved_special_token_216|>",
1797
+ "lstrip": false,
1798
+ "normalized": false,
1799
+ "rstrip": false,
1800
+ "single_word": false,
1801
+ "special": true
1802
+ },
1803
+ "128225": {
1804
+ "content": "<|reserved_special_token_217|>",
1805
+ "lstrip": false,
1806
+ "normalized": false,
1807
+ "rstrip": false,
1808
+ "single_word": false,
1809
+ "special": true
1810
+ },
1811
+ "128226": {
1812
+ "content": "<|reserved_special_token_218|>",
1813
+ "lstrip": false,
1814
+ "normalized": false,
1815
+ "rstrip": false,
1816
+ "single_word": false,
1817
+ "special": true
1818
+ },
1819
+ "128227": {
1820
+ "content": "<|reserved_special_token_219|>",
1821
+ "lstrip": false,
1822
+ "normalized": false,
1823
+ "rstrip": false,
1824
+ "single_word": false,
1825
+ "special": true
1826
+ },
1827
+ "128228": {
1828
+ "content": "<|reserved_special_token_220|>",
1829
+ "lstrip": false,
1830
+ "normalized": false,
1831
+ "rstrip": false,
1832
+ "single_word": false,
1833
+ "special": true
1834
+ },
1835
+ "128229": {
1836
+ "content": "<|reserved_special_token_221|>",
1837
+ "lstrip": false,
1838
+ "normalized": false,
1839
+ "rstrip": false,
1840
+ "single_word": false,
1841
+ "special": true
1842
+ },
1843
+ "128230": {
1844
+ "content": "<|reserved_special_token_222|>",
1845
+ "lstrip": false,
1846
+ "normalized": false,
1847
+ "rstrip": false,
1848
+ "single_word": false,
1849
+ "special": true
1850
+ },
1851
+ "128231": {
1852
+ "content": "<|reserved_special_token_223|>",
1853
+ "lstrip": false,
1854
+ "normalized": false,
1855
+ "rstrip": false,
1856
+ "single_word": false,
1857
+ "special": true
1858
+ },
1859
+ "128232": {
1860
+ "content": "<|reserved_special_token_224|>",
1861
+ "lstrip": false,
1862
+ "normalized": false,
1863
+ "rstrip": false,
1864
+ "single_word": false,
1865
+ "special": true
1866
+ },
1867
+ "128233": {
1868
+ "content": "<|reserved_special_token_225|>",
1869
+ "lstrip": false,
1870
+ "normalized": false,
1871
+ "rstrip": false,
1872
+ "single_word": false,
1873
+ "special": true
1874
+ },
1875
+ "128234": {
1876
+ "content": "<|reserved_special_token_226|>",
1877
+ "lstrip": false,
1878
+ "normalized": false,
1879
+ "rstrip": false,
1880
+ "single_word": false,
1881
+ "special": true
1882
+ },
1883
+ "128235": {
1884
+ "content": "<|reserved_special_token_227|>",
1885
+ "lstrip": false,
1886
+ "normalized": false,
1887
+ "rstrip": false,
1888
+ "single_word": false,
1889
+ "special": true
1890
+ },
1891
+ "128236": {
1892
+ "content": "<|reserved_special_token_228|>",
1893
+ "lstrip": false,
1894
+ "normalized": false,
1895
+ "rstrip": false,
1896
+ "single_word": false,
1897
+ "special": true
1898
+ },
1899
+ "128237": {
1900
+ "content": "<|reserved_special_token_229|>",
1901
+ "lstrip": false,
1902
+ "normalized": false,
1903
+ "rstrip": false,
1904
+ "single_word": false,
1905
+ "special": true
1906
+ },
1907
+ "128238": {
1908
+ "content": "<|reserved_special_token_230|>",
1909
+ "lstrip": false,
1910
+ "normalized": false,
1911
+ "rstrip": false,
1912
+ "single_word": false,
1913
+ "special": true
1914
+ },
1915
+ "128239": {
1916
+ "content": "<|reserved_special_token_231|>",
1917
+ "lstrip": false,
1918
+ "normalized": false,
1919
+ "rstrip": false,
1920
+ "single_word": false,
1921
+ "special": true
1922
+ },
1923
+ "128240": {
1924
+ "content": "<|reserved_special_token_232|>",
1925
+ "lstrip": false,
1926
+ "normalized": false,
1927
+ "rstrip": false,
1928
+ "single_word": false,
1929
+ "special": true
1930
+ },
1931
+ "128241": {
1932
+ "content": "<|reserved_special_token_233|>",
1933
+ "lstrip": false,
1934
+ "normalized": false,
1935
+ "rstrip": false,
1936
+ "single_word": false,
1937
+ "special": true
1938
+ },
1939
+ "128242": {
1940
+ "content": "<|reserved_special_token_234|>",
1941
+ "lstrip": false,
1942
+ "normalized": false,
1943
+ "rstrip": false,
1944
+ "single_word": false,
1945
+ "special": true
1946
+ },
1947
+ "128243": {
1948
+ "content": "<|reserved_special_token_235|>",
1949
+ "lstrip": false,
1950
+ "normalized": false,
1951
+ "rstrip": false,
1952
+ "single_word": false,
1953
+ "special": true
1954
+ },
1955
+ "128244": {
1956
+ "content": "<|reserved_special_token_236|>",
1957
+ "lstrip": false,
1958
+ "normalized": false,
1959
+ "rstrip": false,
1960
+ "single_word": false,
1961
+ "special": true
1962
+ },
1963
+ "128245": {
1964
+ "content": "<|reserved_special_token_237|>",
1965
+ "lstrip": false,
1966
+ "normalized": false,
1967
+ "rstrip": false,
1968
+ "single_word": false,
1969
+ "special": true
1970
+ },
1971
+ "128246": {
1972
+ "content": "<|reserved_special_token_238|>",
1973
+ "lstrip": false,
1974
+ "normalized": false,
1975
+ "rstrip": false,
1976
+ "single_word": false,
1977
+ "special": true
1978
+ },
1979
+ "128247": {
1980
+ "content": "<|reserved_special_token_239|>",
1981
+ "lstrip": false,
1982
+ "normalized": false,
1983
+ "rstrip": false,
1984
+ "single_word": false,
1985
+ "special": true
1986
+ },
1987
+ "128248": {
1988
+ "content": "<|reserved_special_token_240|>",
1989
+ "lstrip": false,
1990
+ "normalized": false,
1991
+ "rstrip": false,
1992
+ "single_word": false,
1993
+ "special": true
1994
+ },
1995
+ "128249": {
1996
+ "content": "<|reserved_special_token_241|>",
1997
+ "lstrip": false,
1998
+ "normalized": false,
1999
+ "rstrip": false,
2000
+ "single_word": false,
2001
+ "special": true
2002
+ },
2003
+ "128250": {
2004
+ "content": "<|reserved_special_token_242|>",
2005
+ "lstrip": false,
2006
+ "normalized": false,
2007
+ "rstrip": false,
2008
+ "single_word": false,
2009
+ "special": true
2010
+ },
2011
+ "128251": {
2012
+ "content": "<|reserved_special_token_243|>",
2013
+ "lstrip": false,
2014
+ "normalized": false,
2015
+ "rstrip": false,
2016
+ "single_word": false,
2017
+ "special": true
2018
+ },
2019
+ "128252": {
2020
+ "content": "<|reserved_special_token_244|>",
2021
+ "lstrip": false,
2022
+ "normalized": false,
2023
+ "rstrip": false,
2024
+ "single_word": false,
2025
+ "special": true
2026
+ },
2027
+ "128253": {
2028
+ "content": "<|reserved_special_token_245|>",
2029
+ "lstrip": false,
2030
+ "normalized": false,
2031
+ "rstrip": false,
2032
+ "single_word": false,
2033
+ "special": true
2034
+ },
2035
+ "128254": {
2036
+ "content": "<|reserved_special_token_246|>",
2037
+ "lstrip": false,
2038
+ "normalized": false,
2039
+ "rstrip": false,
2040
+ "single_word": false,
2041
+ "special": true
2042
+ },
2043
+ "128255": {
2044
+ "content": "<|reserved_special_token_247|>",
2045
+ "lstrip": false,
2046
+ "normalized": false,
2047
+ "rstrip": false,
2048
+ "single_word": false,
2049
+ "special": true
2050
+ },
2051
+ "128256": {
2052
+ "content": "<audio>",
2053
+ "lstrip": false,
2054
+ "normalized": false,
2055
+ "rstrip": false,
2056
+ "single_word": false,
2057
+ "special": true
2058
+ }
2059
+ },
2060
+ "additional_special_tokens": [
2061
+ "<audio>"
2062
+ ],
2063
+ "bos_token": null,
2064
+ "clean_up_tokenization_spaces": true,
2065
+ "eos_token": "<|im_end|>",
2066
+ "extra_special_tokens": {},
2067
+ "fast": false,
2068
+ "model_input_names": [
2069
+ "input_ids",
2070
+ "attention_mask"
2071
+ ],
2072
+ "model_max_length": 131072,
2073
+ "pad_token": "<|finetune_right_pad_id|>",
2074
+ "tokenizer_class": "PreTrainedTokenizerFast"
2075
+ }