Update custom model files, README, and requirements

Browse files

Files changed (5) hide show

README.md +28 -22
asr_config.py +1 -15
asr_modeling.py +10 -46
asr_pipeline.py +18 -1
projectors.py +527 -0

README.md CHANGED Viewed

@@ -14,40 +14,41 @@ tags:
 - audio
 - smollm
 - whisper
-- moe
 ---
-# Tiny Audio Model Card
-This model was born from a simple idea: what if anyone could train a powerful, modern speech recognition model for the price of a few coffees? This model is the result of the [Tiny Audio course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md), a free, hands-on guide to building your own ASR system from scratch.
-## The Story of this Model
-This model isn't the product of a massive research lab with an unlimited budget. It's the result of a 24-hour training run on a single GPU, made possible by an efficient projector-only training approach. By combining the strengths of OpenAI's Whisper encoder (`openai/whisper-large-v3-turbo`) and a powerful language model (`HuggingFaceTB/SmolLM3-3B`), and only training a Mixture of Simple Adapters (MOSA) projector between them, we can create a high-quality ASR model with minimal resources.
-This model is a testament to the power of open-source and the incredible tools and models that are now available to everyone.
 ## Architecture
 ```
-Audio (16kHz) → Whisper Encoder (frozen) → MoE Projector (trainable) → SmolLM3-3B (frozen) → Text
 ```
-**MoE Projector (MOSA):**
 - Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
-- Router: Linear→ReLU→Linear with dense softmax over 4 experts
-- Experts: 4 adapters, each Linear→ReLU→Linear (2048→4096→2048)
 - Output normalization: RMSNorm
-## Intended Use
-This model is for you. It's for the curious, the builders, the learners. It's for anyone who wants to understand how modern AI works by getting their hands dirty. Use it to transcribe your podcasts, your meetings, your voice memos. But more importantly, use it as a starting point. Fork it, fine-tune it, break it, and make it your own.
 ## Performance
-This model achieves a Word Error Rate (WER) of **12.14%** on the LoquaciousSet test set. It's not perfect, but it's a solid baseline that you can build on. See how it compares to other models on the [community leaderboard](https://github.com/alexkroman/tiny-audio#leaderboard).
-## How to Use
 ```python
 from transformers import pipeline
@@ -58,10 +59,15 @@ result = pipe("path/to/audio.wav")
 print(result["text"])
 ```
-## How to Get Involved
-This project is more than just a model; it's a community. Here's how you can get involved:
-- **Take the course**: The best way to start is to go through the [free 6-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) and train your own model.
-- **Share your results**: Add your model to the [leaderboard](https://github.com/alexkroman/tiny-audio#leaderboard) and share what you've learned.
-- **Join the conversation**: Ask questions, share your ideas, and connect with other builders in the [GitHub Discussions](https://github.com/alexkroman/tiny-audio/discussions).

 - audio
 - smollm
 - whisper
+- mlp
 ---
+# Tiny Audio
+A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with the [Tiny Audio](https://github.com/alexkroman/tiny-audio) codebase—a minimal, hackable framework for training ASR models.
 ## Architecture
 ```
+Audio (16kHz) → Whisper Encoder (frozen) → MLP Projector (trained) → SmolLM3-3B (frozen) → Text
 ```
+**MLP Projector:**
 - Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
+- Linear (1280 → 2048) → GELU → Linear (2048 → 2048)
 - Output normalization: RMSNorm
+## Training Details
+| | |
+|---|---|
+| **Dataset** | LoquaciousSet (25,000 hours) |
+| **Hardware** | Single NVIDIA A40 40GB |
+| **Training Time** | ~24 hours |
+| **Cost** | ~$12 |
+| **Trainable Parameters** | ~12M (projector only) |
 ## Performance
+**Word Error Rate (WER): 12.14%** on LoquaciousSet test set.
+See the [community leaderboard](https://github.com/alexkroman/tiny-audio#leaderboard) for comparisons.
+## Usage
 ```python
 from transformers import pipeline
 print(result["text"])
 ```
+## Limitations
+- English only
+- Optimized for 16kHz audio; other sample rates are resampled automatically
+- Performance may degrade on heavily accented speech, noisy environments, or domain-specific jargon
+- Maximum audio length limited by context window
+## Learn More
+- **[Train your own model](https://github.com/alexkroman/tiny-audio)** — The full codebase with training scripts
+- **[Free 3-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md)** — Build your own ASR system from scratch
+- **[Submit to leaderboard](https://github.com/alexkroman/tiny-audio#leaderboard)** — Share your trained model

asr_config.py CHANGED Viewed

@@ -37,24 +37,17 @@ class ASRConfig(transformers.PretrainedConfig):
         inference_warmup_tokens: int = 10,
         max_new_tokens: Optional[int] = None,
         min_new_tokens: Optional[int] = None,
-        do_sample: Optional[bool] = None,
-        temperature: Optional[float] = None,
-        top_k: Optional[int] = None,
-        top_p: Optional[float] = None,
         repetition_penalty: Optional[float] = None,
         length_penalty: Optional[float] = None,
         no_repeat_ngram_size: Optional[int] = None,
-        early_stopping: Optional[bool] = None,
         use_cache: Optional[bool] = None,
         **kwargs,
     ):
-        # Set default generation parameters
         generation_defaults = {
             "num_beams": 1,
             "max_new_tokens": 96,
             "min_new_tokens": 0,
-            "do_sample": False,
-            "temperature": 0.1,
             "repetition_penalty": 1.0,
             "length_penalty": 1.0,
             "no_repeat_ngram_size": 0,
@@ -98,7 +91,6 @@ class ASRConfig(transformers.PretrainedConfig):
         self.min_new_tokens = (
             min_new_tokens if min_new_tokens is not None else generation_defaults["min_new_tokens"]
         )
-        self.do_sample = do_sample if do_sample is not None else generation_defaults["do_sample"]
         self.repetition_penalty = (
             repetition_penalty
             if repetition_penalty is not None
@@ -113,12 +105,6 @@ class ASRConfig(transformers.PretrainedConfig):
             else generation_defaults["no_repeat_ngram_size"]
         )
         self.use_cache = use_cache if use_cache is not None else generation_defaults["use_cache"]
-        self.temperature = (
-            temperature if temperature is not None else generation_defaults["temperature"]
-        )
-        self.top_k = top_k
-        self.top_p = top_p
-        self.early_stopping = early_stopping
         if "audio_config" not in kwargs:
             self.audio_config = transformers.AutoConfig.from_pretrained(audio_model_id)

         inference_warmup_tokens: int = 10,
         max_new_tokens: Optional[int] = None,
         min_new_tokens: Optional[int] = None,
         repetition_penalty: Optional[float] = None,
         length_penalty: Optional[float] = None,
         no_repeat_ngram_size: Optional[int] = None,
         use_cache: Optional[bool] = None,
         **kwargs,
     ):
+        # Set default generation parameters (greedy decoding only)
         generation_defaults = {
             "num_beams": 1,
             "max_new_tokens": 96,
             "min_new_tokens": 0,
             "repetition_penalty": 1.0,
             "length_penalty": 1.0,
             "no_repeat_ngram_size": 0,
         self.min_new_tokens = (
             min_new_tokens if min_new_tokens is not None else generation_defaults["min_new_tokens"]
         )
         self.repetition_penalty = (
             repetition_penalty
             if repetition_penalty is not None
             else generation_defaults["no_repeat_ngram_size"]
         )
         self.use_cache = use_cache if use_cache is not None else generation_defaults["use_cache"]
         if "audio_config" not in kwargs:
             self.audio_config = transformers.AutoConfig.from_pretrained(audio_model_id)

asr_modeling.py CHANGED Viewed

@@ -19,27 +19,10 @@ from transformers.models.whisper.modeling_whisper import (
 try:
     from .asr_config import ASRConfig
-    from .mlp_projector import MLPAudioProjector
-    from .moe_projector import MoEAudioProjector
-    from .residual_projector import ResidualAudioProjector
-    from .shared_moe_projector import SharedMoEAudioProjector
-    from .swiglu_projector import AudioProjector
 except ImportError:
     from asr_config import ASRConfig  # type: ignore[no-redef]
-    from mlp_projector import MLPAudioProjector  # type: ignore[no-redef]
-    from moe_projector import MoEAudioProjector  # type: ignore[no-redef]
-    from residual_projector import ResidualAudioProjector  # type: ignore[no-redef]
-    from shared_moe_projector import SharedMoEAudioProjector  # type: ignore[no-redef]
-    from swiglu_projector import AudioProjector  # type: ignore[no-redef]
-# Map projector type names to classes
-PROJECTOR_CLASSES = {
-    "swiglu": AudioProjector,
-    "residual": ResidualAudioProjector,
-    "moe": MoEAudioProjector,
-    "shared_moe": SharedMoEAudioProjector,
-    "mlp": MLPAudioProjector,
-}
 class ASRModel(PreTrainedModel, GenerationMixin):
@@ -112,26 +95,15 @@ class ASRModel(PreTrainedModel, GenerationMixin):
         # Initialize tokenizer and special tokens
         self._init_tokenizer(config)
-        # Set up generation config with our defaults
         self.generation_config = self.language_model.generation_config
         self.generation_config.max_new_tokens = config.max_new_tokens
         self.generation_config.num_beams = config.num_beams
-        self.generation_config.do_sample = config.do_sample
         self.generation_config.use_cache = config.use_cache
         self.generation_config.length_penalty = config.length_penalty
         self.generation_config.repetition_penalty = config.repetition_penalty
         self.generation_config.no_repeat_ngram_size = config.no_repeat_ngram_size
-        # Only set sampling params when do_sample=True, otherwise clear them
-        if config.do_sample:
-            self.generation_config.temperature = config.temperature
-            if config.top_k is not None:
-                self.generation_config.top_k = config.top_k
-            if config.top_p is not None:
-                self.generation_config.top_p = config.top_p
-        else:
-            self.generation_config.temperature = None
-            self.generation_config.top_k = None
-            self.generation_config.top_p = None
         self.generation_config.eos_token_id = self.tokenizer.convert_tokens_to_ids("<|im_end|>")
         self.generation_config.pad_token_id = self.tokenizer.pad_token_id
@@ -209,7 +181,7 @@ class ASRModel(PreTrainedModel, GenerationMixin):
                 raise ValueError("Could not auto-detect llm_dim. Please specify in config.")
         # Select projector type based on config
-        projector_type = getattr(config, "projector_type", "moe")
         projector_class = PROJECTOR_CLASSES.get(projector_type)
         if projector_class is None:
             raise ValueError(
@@ -262,7 +234,9 @@ class ASRModel(PreTrainedModel, GenerationMixin):
         if hasattr(self.language_model, "_set_gradient_checkpointing"):
             self.language_model._set_gradient_checkpointing(enable, gradient_checkpointing_func)
         elif hasattr(self.language_model, "gradient_checkpointing_enable") and enable:
-            self.language_model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
         elif hasattr(self.language_model, "gradient_checkpointing_disable") and not enable:
             self.language_model.gradient_checkpointing_disable()
@@ -562,18 +536,8 @@ class ASRModel(PreTrainedModel, GenerationMixin):
         src_dir = PathlibPath(__file__).parent
         for asr_file in src_dir.glob("asr_*.py"):
             shutil.copy(asr_file, save_dir / asr_file.name)
-        # Copy projector files
-        projector_files = [
-            "mlp_projector.py",
-            "moe_projector.py",
-            "residual_projector.py",
-            "swiglu_projector.py",
-            "shared_moe_projector.py",
-        ]
-        for projector_file in projector_files:
-            src_path = src_dir / projector_file
-            if src_path.exists():
-                shutil.copy(src_path, save_dir / projector_file)
 # Register with transformers Auto classes

 try:
     from .asr_config import ASRConfig
+    from .projectors import PROJECTOR_CLASSES
 except ImportError:
     from asr_config import ASRConfig  # type: ignore[no-redef]
+    from projectors import PROJECTOR_CLASSES  # type: ignore[no-redef]
 class ASRModel(PreTrainedModel, GenerationMixin):
         # Initialize tokenizer and special tokens
         self._init_tokenizer(config)
+        # Set up generation config with greedy decoding defaults
         self.generation_config = self.language_model.generation_config
         self.generation_config.max_new_tokens = config.max_new_tokens
         self.generation_config.num_beams = config.num_beams
+        self.generation_config.do_sample = False
         self.generation_config.use_cache = config.use_cache
         self.generation_config.length_penalty = config.length_penalty
         self.generation_config.repetition_penalty = config.repetition_penalty
         self.generation_config.no_repeat_ngram_size = config.no_repeat_ngram_size
         self.generation_config.eos_token_id = self.tokenizer.convert_tokens_to_ids("<|im_end|>")
         self.generation_config.pad_token_id = self.tokenizer.pad_token_id
                 raise ValueError("Could not auto-detect llm_dim. Please specify in config.")
         # Select projector type based on config
+        projector_type = getattr(config, "projector_type", "mlp")
         projector_class = PROJECTOR_CLASSES.get(projector_type)
         if projector_class is None:
             raise ValueError(
         if hasattr(self.language_model, "_set_gradient_checkpointing"):
             self.language_model._set_gradient_checkpointing(enable, gradient_checkpointing_func)
         elif hasattr(self.language_model, "gradient_checkpointing_enable") and enable:
+            self.language_model.gradient_checkpointing_enable(
+                gradient_checkpointing_kwargs={"use_reentrant": False}
+            )
         elif hasattr(self.language_model, "gradient_checkpointing_disable") and not enable:
             self.language_model.gradient_checkpointing_disable()
         src_dir = PathlibPath(__file__).parent
         for asr_file in src_dir.glob("asr_*.py"):
             shutil.copy(asr_file, save_dir / asr_file.name)
+        # Copy projectors module
+        shutil.copy(src_dir / "projectors.py", save_dir / "projectors.py")
 # Register with transformers Auto classes

asr_pipeline.py CHANGED Viewed

@@ -1,5 +1,6 @@
 from typing import Any
 import torch
 import transformers
@@ -9,6 +10,14 @@ except ImportError:
     from asr_modeling import ASRModel  # type: ignore[no-redef]
 class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
     """ASR Pipeline for audio-to-text transcription."""
@@ -28,10 +37,18 @@ class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
     def preprocess(self, inputs, **preprocess_params):
         # Handle dict with "array" key (from datasets)
         if isinstance(inputs, dict) and "array" in inputs:
             inputs = {
-                "raw": inputs["array"],
                 "sampling_rate": inputs.get("sampling_rate", self.feature_extractor.sampling_rate),
             }
         for item in super().preprocess(inputs, **preprocess_params):
             if "is_last" not in item:

 from typing import Any
+import numpy as np
 import torch
 import transformers
     from asr_modeling import ASRModel  # type: ignore[no-redef]
+def normalize_audio(audio: np.ndarray, target_peak: float = 0.95) -> np.ndarray:
+    """Normalize audio to target peak amplitude for consistent input levels."""
+    max_val = np.abs(audio).max()
+    if max_val > 0:
+        return audio / max_val * target_peak
+    return audio
 class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
     """ASR Pipeline for audio-to-text transcription."""
     def preprocess(self, inputs, **preprocess_params):
         # Handle dict with "array" key (from datasets)
         if isinstance(inputs, dict) and "array" in inputs:
+            audio = inputs["array"]
+            if isinstance(audio, np.ndarray):
+                audio = normalize_audio(audio)
             inputs = {
+                "raw": audio,
                 "sampling_rate": inputs.get("sampling_rate", self.feature_extractor.sampling_rate),
             }
+        # Handle dict with "raw" key
+        elif isinstance(inputs, dict) and "raw" in inputs:
+            audio = inputs["raw"]
+            if isinstance(audio, np.ndarray):
+                inputs["raw"] = normalize_audio(audio)
         for item in super().preprocess(inputs, **preprocess_params):
             if "is_last" not in item:

projectors.py ADDED Viewed

	@@ -0,0 +1,527 @@

+"""Audio projector modules for bridging encoder and decoder embeddings.
+This module contains all projector architectures:
+- MLPAudioProjector: Simple 2-layer MLP with conv downsampling
+- MoEAudioProjector: MOSA-style dense mixture of experts
+- SwiGLUAudioProjector: SwiGLU-based projector with temporal pooling
+- ResidualAudioProjector: Residual MLP blocks with linear projection
+- SharedMoEAudioProjector: Shared expert + sparse routed experts
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
+from transformers.models.llama.modeling_llama import LlamaRMSNorm
+# =============================================================================
+# MLP Projector
+# =============================================================================
+class MLPAudioProjector(nn.Module):
+    """2-layer MLP projector with conv-based 2x temporal downsampling."""
+    def __init__(self, config):
+        super().__init__()
+        encoder_dim = getattr(config, "encoder_dim", 768)
+        llm_dim = getattr(config, "llm_dim", 2048)
+        self.downsample = nn.Conv1d(
+            encoder_dim, encoder_dim, kernel_size=3, stride=2, padding=1, bias=False
+        )
+        self.linear_1 = nn.Linear(encoder_dim, llm_dim, bias=False)
+        self.act = nn.GELU()
+        self.linear_2 = nn.Linear(llm_dim, llm_dim, bias=False)
+        self.apply(self._init_weights)
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            nn.init.normal_(module.weight, mean=0.0, std=0.02)
+        elif isinstance(module, nn.Conv1d):
+            nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+    def forward(self, x):
+        """
+        x: [Batch, Seq_Len, Dim]
+        Returns: [Batch, Seq_Len // 2, llm_dim]
+        """
+        # Conv1d expects [Batch, Channels, Seq_Len]
+        x = x.transpose(1, 2)
+        x = self.downsample(x)
+        x = x.transpose(1, 2)
+        x = self.linear_1(x)
+        x = self.act(x)
+        return self.linear_2(x)
+# =============================================================================
+# MoE Projector (MOSA-style)
+# =============================================================================
+class SimpleAdapter(nn.Module):
+    """Simple adapter: Linear -> ReLU -> Dropout -> Linear."""
+    def __init__(self, in_features, hidden_features, out_features, dropout=0.0):
+        super().__init__()
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.relu = nn.ReLU()
+        self.dropout = nn.Dropout(dropout)
+        self.fc2 = nn.Linear(hidden_features, out_features)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.relu(x)
+        x = self.dropout(x)
+        return self.fc2(x)
+class MoEAudioProjector(nn.Module):
+    """
+    MOSA-style projector: Mixture of Simple Adapters.
+    From paper (arXiv:2508.18998):
+    - Dense mixture (softmax over ALL experts) instead of sparse Top-K
+    - Simple Linear->ReLU->Linear adapters
+    - No auxiliary losses - just cross-entropy on transcripts
+    - Conv downsampling: stride 4 total (two conv layers, stride 2 each)
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.encoder_dim = config.encoder_dim
+        self.llm_dim = config.llm_dim
+        self.num_experts = getattr(config, "num_experts", 4)
+        adapter_hidden = getattr(config, "projector_hidden_dim", None) or 4096
+        self.dropout_rate = getattr(config, "projector_dropout", 0.1)
+        # Convolutional Subsampling (stride 4 total)
+        self.conv = nn.Sequential(
+            nn.Conv1d(self.encoder_dim, self.llm_dim, kernel_size=3, stride=2, padding=1),
+            nn.ReLU(),
+            nn.Conv1d(self.llm_dim, self.llm_dim, kernel_size=3, stride=2, padding=1),
+            nn.ReLU(),
+        )
+        # Router
+        router_hidden = 512
+        self.router = nn.Sequential(
+            nn.Linear(self.encoder_dim, router_hidden),
+            nn.ReLU(),
+            nn.Linear(router_hidden, self.num_experts),
+        )
+        # Experts
+        self.experts = nn.ModuleList(
+            [
+                SimpleAdapter(self.llm_dim, adapter_hidden, self.llm_dim, dropout=self.dropout_rate)
+                for _ in range(self.num_experts)
+            ]
+        )
+        self.ln_post = LlamaRMSNorm(self.llm_dim, eps=1e-6)
+        self._init_weights()
+    def _init_weights(self):
+        std = 0.02
+        with torch.no_grad():
+            for module in self.conv:
+                if isinstance(module, nn.Conv1d):
+                    nn.init.normal_(module.weight, mean=0.0, std=std)
+                    if module.bias is not None:
+                        nn.init.zeros_(module.bias)
+            for module in self.router:
+                if isinstance(module, nn.Linear):
+                    nn.init.normal_(module.weight, mean=0.0, std=std)
+                    if module.bias is not None:
+                        nn.init.zeros_(module.bias)
+            for expert in self.experts:
+                nn.init.normal_(expert.fc1.weight, mean=0.0, std=std)
+                nn.init.normal_(expert.fc2.weight, mean=0.0, std=std)
+                if expert.fc1.bias is not None:
+                    nn.init.zeros_(expert.fc1.bias)
+                if expert.fc2.bias is not None:
+                    nn.init.zeros_(expert.fc2.bias)
+            self.ln_post.weight.data.fill_(1.0)
+    def forward(self, x):
+        batch_size, seq_len, _ = x.shape
+        # Pad to be divisible by stride (4)
+        pad_amt = (4 - (seq_len % 4)) % 4
+        if pad_amt > 0:
+            x = F.pad(x, (0, 0, 0, pad_amt))
+            seq_len = x.shape[1]
+        # Convolutional Downsampling
+        h_conv = self.conv(x.permute(0, 2, 1)).permute(0, 2, 1)
+        # Router on high-res input, then downsample weights
+        router_logits = self.router(x)
+        router_logits = router_logits.view(batch_size, seq_len // 4, 4, self.num_experts).mean(
+            dim=2
+        )
+        routing_weights = F.softmax(router_logits, dim=-1)
+        # Weighted sum of expert outputs
+        final_out = torch.zeros_like(h_conv)
+        for i, expert in enumerate(self.experts):
+            expert_out = expert(h_conv)
+            expert_weight = routing_weights[:, :, i : i + 1]
+            final_out.add_(expert_out * expert_weight)
+        return self.ln_post(final_out)
+    def get_aux_loss(self) -> torch.Tensor:
+        """Return auxiliary loss (none for dense MoE)."""
+        return torch.tensor(0.0)
+# =============================================================================
+# SwiGLU Projector
+# =============================================================================
+class SwiGLU(nn.Module):
+    def __init__(self, in_features, hidden_features, out_features, bias=False, dropout=0.0):
+        super().__init__()
+        self.w1 = nn.Linear(in_features, hidden_features, bias=bias)
+        self.w2 = nn.Linear(in_features, hidden_features, bias=bias)
+        self.w3 = nn.Linear(hidden_features, out_features, bias=bias)
+        self.act = nn.SiLU()
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        x_gate = self.act(self.w1(x))
+        x_val = self.w2(x)
+        x = x_gate * x_val
+        x = self.dropout(x)
+        return self.w3(x)
+class SwiGLUAudioProjector(nn.Module):
+    """SwiGLU-based projector with temporal pooling."""
+    def __init__(self, config):
+        super().__init__()
+        self.k = getattr(config, "projector_pool_stride", 4)
+        in_dim = config.encoder_dim * self.k
+        out_dim = config.llm_dim
+        hidden_dim = config.projector_hidden_dim
+        if hidden_dim is None:
+            hidden_dim = config.encoder_dim * 2
+        dropout_rate = getattr(config, "projector_dropout", 0.0)
+        self.proj1 = SwiGLU(in_dim, hidden_dim, hidden_dim, dropout=dropout_rate)
+        self.proj2 = SwiGLU(hidden_dim, hidden_dim, out_dim, dropout=dropout_rate)
+        self.output_dropout = nn.Dropout(dropout_rate)
+        with torch.no_grad():
+            std = getattr(config, "projector_init_std", 0.02)
+            nn.init.normal_(self.proj1.w1.weight, mean=0.0, std=std)
+            nn.init.normal_(self.proj1.w2.weight, mean=0.0, std=std)
+            nn.init.normal_(self.proj1.w3.weight, mean=0.0, std=std)
+            nn.init.normal_(self.proj2.w1.weight, mean=0.0, std=std)
+            nn.init.normal_(self.proj2.w2.weight, mean=0.0, std=std)
+            nn.init.normal_(self.proj2.w3.weight, mean=0.0, std=std)
+    def forward(self, x):
+        batch_size, seq_len, dim = x.size()
+        target_dtype = self.proj1.w1.weight.dtype
+        if x.dtype != target_dtype:
+            x = x.to(target_dtype)
+        remainder = seq_len % self.k
+        if remainder:
+            pad_len = self.k - remainder
+            x = F.pad(x, (0, 0, 0, pad_len))
+        x = x.contiguous().view(batch_size, -1, dim * self.k)
+        x = self.proj1(x)
+        x = self.proj2(x)
+        return self.output_dropout(x)
+# Alias for backwards compatibility
+AudioProjector = SwiGLUAudioProjector
+# =============================================================================
+# Residual Projector
+# =============================================================================
+class ResidualMLP(nn.Module):
+    """MLP block with residual connection: Output = x + MLP(x)."""
+    def __init__(self, dim, hidden_dim, dropout=0.0):
+        super().__init__()
+        self.fc1 = nn.Linear(dim, hidden_dim)
+        self.fc2 = nn.Linear(hidden_dim, dim)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        residual = x
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return residual + x
+class ResidualAudioProjector(nn.Module):
+    """Residual MLP projector for audio-to-LLM feature translation."""
+    def __init__(self, config):
+        super().__init__()
+        self.k = getattr(config, "projector_pool_stride", 4)
+        in_dim = config.encoder_dim * self.k
+        out_dim = config.llm_dim
+        hidden_dim = getattr(config, "projector_hidden_dim", None) or out_dim * 4
+        self.num_layers = getattr(config, "projector_num_layers", 2)
+        dropout_rate = getattr(config, "projector_dropout", 0.0)
+        self.input_proj = nn.Linear(in_dim, out_dim)
+        self.ln_input = LlamaRMSNorm(out_dim, eps=1e-6)
+        self.layers = nn.ModuleList(
+            [ResidualMLP(out_dim, hidden_dim, dropout=dropout_rate) for _ in range(self.num_layers)]
+        )
+        self.layer_norms = nn.ModuleList(
+            [LlamaRMSNorm(out_dim, eps=1e-6) for _ in range(self.num_layers)]
+        )
+        self.output_dropout = nn.Dropout(dropout_rate)
+        self._init_weights(config)
+    def _init_weights(self, config):
+        std = getattr(config, "projector_init_std", 0.02)
+        with torch.no_grad():
+            nn.init.normal_(self.input_proj.weight, mean=0.0, std=std)
+            if self.input_proj.bias is not None:
+                nn.init.zeros_(self.input_proj.bias)
+            self.ln_input.weight.data.fill_(1.0)
+            for ln in self.layer_norms:
+                ln.weight.data.fill_(1.0)
+            for layer in self.layers:
+                nn.init.normal_(layer.fc1.weight, mean=0.0, std=std)
+                nn.init.normal_(layer.fc2.weight, mean=0.0, std=std * 0.1)
+                if layer.fc1.bias is not None:
+                    nn.init.zeros_(layer.fc1.bias)
+                if layer.fc2.bias is not None:
+                    nn.init.zeros_(layer.fc2.bias)
+    def forward(self, x):
+        batch_size, seq_len, dim = x.size()
+        target_dtype = self.input_proj.weight.dtype
+        if x.dtype != target_dtype:
+            x = x.to(target_dtype)
+        remainder = seq_len % self.k
+        if remainder:
+            pad_len = self.k - remainder
+            x = F.pad(x, (0, 0, 0, pad_len))
+        x = x.contiguous().view(batch_size, -1, dim * self.k)
+        x = self.input_proj(x)
+        x = self.ln_input(x)
+        for layer, ln in zip(self.layers, self.layer_norms):
+            x = layer(x)
+            x = ln(x)
+        return self.output_dropout(x)
+# =============================================================================
+# Shared MoE Projector
+# =============================================================================
+class SwiGLUExpert(nn.Module):
+    """SwiGLU expert MLP."""
+    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
+        super().__init__()
+        self.gate_proj = nn.Linear(input_dim, hidden_dim, bias=False)
+        self.up_proj = nn.Linear(input_dim, hidden_dim, bias=False)
+        self.down_proj = nn.Linear(hidden_dim, output_dim, bias=False)
+        self.act = nn.SiLU()
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.down_proj(self.act(self.gate_proj(x)) * self.up_proj(x))
+class SharedMoEBlock(nn.Module):
+    """MoE block with shared expert + sparse routed experts."""
+    def __init__(
+        self,
+        input_dim: int,
+        hidden_dim: int,
+        output_dim: int,
+        num_experts: int = 4,
+        top_k: int = 2,
+    ):
+        super().__init__()
+        self.num_experts = num_experts
+        self.top_k = top_k
+        self.output_dim = output_dim
+        self.router = nn.Linear(input_dim, num_experts, bias=False)
+        nn.init.zeros_(self.router.weight)
+        self.shared_expert = SwiGLUExpert(input_dim, hidden_dim, output_dim)
+        self.experts = nn.ModuleList(
+            [SwiGLUExpert(input_dim, hidden_dim, output_dim) for _ in range(num_experts)]
+        )
+        self.last_router_logits = None
+        self.last_router_probs = None
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        batch_size, seq_len, dim = hidden_states.shape
+        shared_out = self.shared_expert(hidden_states)
+        flat_hidden = hidden_states.view(-1, dim)
+        router_logits = self.router(flat_hidden)
+        router_probs = F.softmax(router_logits.float(), dim=-1)
+        self.last_router_logits = router_logits
+        self.last_router_probs = router_probs
+        top_k_weights, top_k_indices = torch.topk(router_probs, self.top_k, dim=-1)
+        top_k_weights = top_k_weights / top_k_weights.sum(dim=-1, keepdim=True)
+        top_k_weights = top_k_weights.to(hidden_states.dtype)
+        routed_out = self._dispatch_experts(flat_hidden, top_k_indices, top_k_weights)
+        routed_out = routed_out.view(batch_size, seq_len, -1)
+        return shared_out + routed_out
+    def _dispatch_experts(
+        self,
+        hidden_states: torch.Tensor,
+        top_k_indices: torch.Tensor,
+        top_k_weights: torch.Tensor,
+    ) -> torch.Tensor:
+        num_tokens = hidden_states.shape[0]
+        output = torch.zeros(
+            num_tokens, self.output_dim, device=hidden_states.device, dtype=hidden_states.dtype
+        )
+        for expert_idx, expert in enumerate(self.experts):
+            expert_mask = top_k_indices == expert_idx
+            if not expert_mask.any():
+                continue
+            token_indices, slot_indices = torch.where(expert_mask)
+            expert_input = hidden_states[token_indices]
+            expert_output = expert(expert_input)
+            weights = top_k_weights[token_indices, slot_indices].unsqueeze(-1)
+            output.index_add_(0, token_indices, expert_output * weights)
+        return output
+def load_balancing_loss(router_probs: torch.Tensor, num_experts: int, top_k: int) -> torch.Tensor:
+    """Auxiliary loss to encourage balanced expert usage."""
+    _, selected = torch.topk(router_probs, top_k, dim=-1)
+    expert_mask = F.one_hot(selected, num_experts).float()
+    tokens_per_expert = expert_mask.mean(dim=(0, 1))
+    prob_per_expert = router_probs.mean(dim=0)
+    return (tokens_per_expert * prob_per_expert).sum() * num_experts
+def z_loss(router_logits: torch.Tensor) -> torch.Tensor:
+    """Z-loss to prevent router logits from growing too large."""
+    return torch.logsumexp(router_logits.float(), dim=-1).square().mean()
+class SharedMoEAudioProjector(nn.Module):
+    """Shared expert + sparse routed experts projector."""
+    def __init__(self, config):
+        super().__init__()
+        self.k = getattr(config, "projector_pool_stride", 4)
+        encoder_dim = config.encoder_dim
+        in_dim = encoder_dim * self.k
+        out_dim = config.llm_dim
+        hidden_dim = getattr(config, "projector_hidden_dim", None) or in_dim
+        self.num_experts = getattr(config, "num_experts", 4)
+        self.top_k = getattr(config, "num_experts_per_tok", 2)
+        self.aux_loss_coef = getattr(config, "router_aux_loss_coef", 0.02)
+        self.z_loss_coef = getattr(config, "router_z_loss_coef", 0.001)
+        self.moe = SharedMoEBlock(in_dim, hidden_dim, out_dim, self.num_experts, self.top_k)
+        self._init_weights(in_dim)
+    def _init_weights(self, in_dim: int):
+        with torch.no_grad():
+            nn.init.orthogonal_(self.moe.shared_expert.gate_proj.weight)
+            nn.init.orthogonal_(self.moe.shared_expert.up_proj.weight)
+            nn.init.orthogonal_(self.moe.shared_expert.down_proj.weight, gain=0.5)
+            for expert in self.moe.experts:
+                nn.init.orthogonal_(expert.gate_proj.weight)
+                nn.init.orthogonal_(expert.up_proj.weight)
+                nn.init.orthogonal_(expert.down_proj.weight, gain=0.01)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        batch_size, seq_len, dim = x.size()
+        target_dtype = self.moe.shared_expert.gate_proj.weight.dtype
+        if x.dtype != target_dtype:
+            x = x.to(target_dtype)
+        if seq_len % self.k:
+            x = F.pad(x, (0, 0, 0, self.k - seq_len % self.k))
+        x = x.view(batch_size, -1, dim * self.k)
+        return self.moe(x)
+    def get_aux_loss(self) -> torch.Tensor:
+        if self.moe.last_router_logits is None:
+            return torch.tensor(0.0, device=self.moe.router.weight.device)
+        balance = load_balancing_loss(self.moe.last_router_probs, self.num_experts, self.top_k)
+        z = z_loss(self.moe.last_router_logits)
+        return self.aux_loss_coef * balance + self.z_loss_coef * z
+# =============================================================================
+# Projector Registry
+# =============================================================================
+PROJECTOR_CLASSES = {
+    "mlp": MLPAudioProjector,
+    "moe": MoEAudioProjector,
+    "swiglu": SwiGLUAudioProjector,
+    "residual": ResidualAudioProjector,
+    "shared_moe": SharedMoEAudioProjector,
+}