Upload distilled speech model

Browse files

Files changed (10) hide show

README.md +79 -0
__pycache__/configuration_distilled_speech.cpython-311.pyc +0 -0
__pycache__/feature_extraction_distilled_speech.cpython-311.pyc +0 -0
__pycache__/modeling_distilled_speech.cpython-311.pyc +0 -0
config.json +56 -0
configuration_distilled_speech.py +148 -0
feature_extraction_distilled_speech.py +150 -0
modeling_distilled_speech.py +525 -0
preprocessor_config.json +6 -0
pytorch_model.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,79 @@

+---
+license: apache-2.0
+language:
+- en
+tags:
+- speech
+- audio
+- data2vec
+- distillation
+- feature-extraction
+library_name: transformers
+pipeline_tag: feature-extraction
+---
+# Distilled Speech Encoder
+A Data2Vec-style bidirectional speech encoder trained via distillation from AuriStream models.
+## Model Details
+- **Architecture**: 12-layer transformer with RoPE positional encoding
+- **Hidden size**: 768
+- **Attention heads**: 12
+- **Parameters**: ~85M
+- **Teacher model**: `TuKoResearch/AuriStream100M_40Pred_BigAudioDataset_500k`
+- **Training step**: 100000
+- **Input**: 16kHz raw audio waveform
+- **Output**: 50Hz contextualized representations (768-dim)
+## Usage
+```python
+from transformers import AutoModel, AutoFeatureExtractor
+import torch
+# Load model and feature extractor
+model = AutoModel.from_pretrained("TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960", trust_remote_code=True)
+feature_extractor = AutoFeatureExtractor.from_pretrained("TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960", trust_remote_code=True)
+# Prepare audio (16kHz, mono)
+audio = torch.randn(16000)  # 1 second of audio
+# Extract features
+inputs = feature_extractor(audio, return_tensors="pt", sample_rate=16000)
+outputs = model(inputs.input_values, output_hidden_states=True)
+# Get representations
+last_hidden = outputs.last_hidden_state  # (1, 50, 768) for 1 second
+all_hidden = outputs.hidden_states  # Tuple of 13 tensors
+```
+## Hidden States
+When `output_hidden_states=True`, the model returns hidden states from all layers:
+- `hidden_states[0]`: Feature projection output (after conv encoder + projection)
+- `hidden_states[1]` to `hidden_states[12]`: Transformer layer outputs
+- `hidden_states[12]`: Final layer output (same as `last_hidden_state`)
+This makes the model suitable for linear probing experiments at different layers.
+## Training
+This model was trained using Data2Vec-style distillation:
+1. A frozen AuriStream teacher model generates target representations
+2. The student sees masked audio and learns to predict teacher representations
+3. Loss is computed only on masked positions
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{distilled_speech_encoder,
+  title={Distilled Speech Encoder},
+  author={TuKo Research},
+  year={2025},
+  url={https://huggingface.co/TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960}
+}
+```

__pycache__/configuration_distilled_speech.cpython-311.pyc ADDED Viewed

Binary file (6.37 kB). View file

__pycache__/feature_extraction_distilled_speech.cpython-311.pyc ADDED Viewed

Binary file (7.38 kB). View file

__pycache__/modeling_distilled_speech.cpython-311.pyc ADDED Viewed

Binary file (28.8 kB). View file

config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "hidden_size": 768,
+  "num_hidden_layers": 12,
+  "num_attention_heads": 12,
+  "intermediate_size": 3072,
+  "hidden_dropout": 0.1,
+  "attention_dropout": 0.1,
+  "activation_dropout": 0.0,
+  "layer_norm_eps": 1e-05,
+  "feat_extract_norm": "group",
+  "feat_extract_activation": "gelu",
+  "feat_proj_dropout": 0.0,
+  "use_rope": true,
+  "rope_theta": 10000.0,
+  "sample_rate": 16000,
+  "teacher_model_name": "TuKoResearch/AuriStream100M_40Pred_BigAudioDataset_500k",
+  "teacher_hidden_size": 768,
+  "conv_dim": [
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512
+  ],
+  "conv_stride": [
+    5,
+    2,
+    2,
+    2,
+    2,
+    2,
+    2
+  ],
+  "conv_kernel": [
+    10,
+    3,
+    3,
+    3,
+    3,
+    2,
+    2
+  ],
+  "conv_bias": false,
+  "model_type": "distilled_speech",
+  "auto_map": {
+    "AutoConfig": "configuration_distilled_speech.DistilledSpeechConfig",
+    "AutoModel": "modeling_distilled_speech.DistilledSpeechModel",
+    "AutoFeatureExtractor": "feature_extraction_distilled_speech.DistilledSpeechFeatureExtractor"
+  },
+  "architectures": [
+    "DistilledSpeechModel"
+  ],
+  "training_step": 100000
+}

configuration_distilled_speech.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""
+HuggingFace Configuration for Distilled Speech Encoder.
+This is a Data2Vec-style bidirectional speech encoder distilled from AuriStream.
+"""
+from transformers import PretrainedConfig
+class DistilledSpeechConfig(PretrainedConfig):
+    """
+    Configuration class for DistilledSpeechModel.
+    This is a bidirectional transformer encoder for speech, trained via
+    Data2Vec-style distillation from AuriStream models.
+    Architecture:
+        - 7-layer convolutional feature encoder (16kHz -> 50Hz)
+        - N-layer bidirectional transformer with RoPE
+        - Optional projection head (for distillation training)
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (feed-forward) layer.
+        hidden_act (`str`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function in the encoder.
+        hidden_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers.
+        attention_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-5):
+            The epsilon used by the layer normalization layers.
+        conv_dim (`tuple`, *optional*):
+            Tuple of integers defining the number of channels in each conv layer.
+        conv_stride (`tuple`, *optional*):
+            Tuple of integers defining the stride of each conv layer.
+        conv_kernel (`tuple`, *optional*):
+            Tuple of integers defining the kernel size of each conv layer.
+        conv_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use bias in conv layers.
+        feat_extract_norm (`str`, *optional*, defaults to `"group"`):
+            Normalization type for first conv layer ("group" or "layer").
+        feat_extract_activation (`str`, *optional*, defaults to `"gelu"`):
+            Activation function for conv layers.
+        feat_proj_dropout (`float`, *optional*, defaults to 0.0):
+            Dropout for feature projection layer.
+        use_rope (`bool`, *optional*, defaults to `True`):
+            Whether to use Rotary Position Embeddings (RoPE).
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            Base frequency for RoPE.
+        mask_time_prob (`float`, *optional*, defaults to 0.065):
+            Probability of masking time steps (for training).
+        mask_time_length (`int`, *optional*, defaults to 10):
+            Length of masked time spans (for training).
+    """
+    model_type = "distilled_speech"
+    def __init__(
+        self,
+        # Transformer architecture
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout: float = 0.1,
+        attention_dropout: float = 0.1,
+        activation_dropout: float = 0.0,
+        layer_norm_eps: float = 1e-5,
+        # Convolutional feature encoder
+        conv_dim: tuple = (512, 512, 512, 512, 512, 512, 512),
+        conv_stride: tuple = (5, 2, 2, 2, 2, 2, 2),
+        conv_kernel: tuple = (10, 3, 3, 3, 3, 2, 2),
+        conv_bias: bool = False,
+        feat_extract_norm: str = "group",
+        feat_extract_activation: str = "gelu",
+        feat_proj_dropout: float = 0.0,
+        # Positional encoding
+        use_rope: bool = True,
+        rope_theta: float = 10000.0,
+        # Masking (for training, disabled by default for inference)
+        mask_time_prob: float = 0.065,
+        mask_time_length: int = 10,
+        mask_time_min_masks: int = 2,
+        # Teacher info (for reference, not used in inference)
+        teacher_model_name: str = None,
+        teacher_hidden_size: int = None,
+        # Audio
+        sample_rate: int = 16000,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout = hidden_dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.layer_norm_eps = layer_norm_eps
+        # Conv encoder
+        self.conv_dim = list(conv_dim)
+        self.conv_stride = list(conv_stride)
+        self.conv_kernel = list(conv_kernel)
+        self.conv_bias = conv_bias
+        self.feat_extract_norm = feat_extract_norm
+        self.feat_extract_activation = feat_extract_activation
+        self.feat_proj_dropout = feat_proj_dropout
+        # Position encoding
+        self.use_rope = use_rope
+        self.rope_theta = rope_theta
+        # Masking
+        self.mask_time_prob = mask_time_prob
+        self.mask_time_length = mask_time_length
+        self.mask_time_min_masks = mask_time_min_masks
+        # Teacher info
+        self.teacher_model_name = teacher_model_name
+        self.teacher_hidden_size = teacher_hidden_size
+        # Audio
+        self.sample_rate = sample_rate
+    @property
+    def output_hz(self) -> int:
+        """Output frequency of the model in Hz."""
+        stride_product = 1
+        for s in self.conv_stride:
+            stride_product *= s
+        return self.sample_rate // stride_product  # 50 Hz for default config

feature_extraction_distilled_speech.py ADDED Viewed

	@@ -0,0 +1,150 @@

+"""
+Feature extractor for Distilled Speech Model.
+Handles audio preprocessing: normalization to zero mean and unit variance.
+"""
+from typing import List, Optional, Union
+import numpy as np
+import torch
+class DistilledSpeechFeatureExtractor:
+    """
+    Feature extractor for DistilledSpeechModel.
+    Normalizes audio to zero mean and unit variance (per-sample).
+    Expected input: 16kHz mono audio.
+    Example:
+        >>> extractor = DistilledSpeechFeatureExtractor()
+        >>> audio = np.random.randn(16000)  # 1 second
+        >>> inputs = extractor(audio, return_tensors="pt", sample_rate=16000)
+        >>> inputs.input_values.shape
+        torch.Size([1, 16000])
+    """
+    def __init__(
+        self,
+        sampling_rate: int = 16000,
+        do_normalize: bool = True,
+        return_attention_mask: bool = False,
+    ):
+        self.sampling_rate = sampling_rate
+        self.do_normalize = do_normalize
+        self.return_attention_mask = return_attention_mask
+    def __call__(
+        self,
+        raw_speech: Union[np.ndarray, List[float], torch.Tensor],
+        return_tensors: Optional[str] = "pt",
+        sample_rate: Optional[int] = None,
+        **kwargs,
+    ):
+        """
+        Process raw audio into model inputs.
+        Args:
+            raw_speech: Raw audio waveform (1D array or tensor)
+            return_tensors: "pt" for PyTorch tensors, "np" for numpy
+            sample_rate: Sample rate of input audio (for validation)
+        Returns:
+            Object with input_values attribute
+        """
+        # Validate sample rate
+        if sample_rate is not None and sample_rate != self.sampling_rate:
+            raise ValueError(
+                f"Expected sample rate {self.sampling_rate}, got {sample_rate}. "
+                f"Please resample your audio to {self.sampling_rate}Hz."
+            )
+        # Convert to numpy if needed
+        if isinstance(raw_speech, torch.Tensor):
+            raw_speech = raw_speech.numpy()
+        elif isinstance(raw_speech, list):
+            raw_speech = np.array(raw_speech)
+        raw_speech = np.asarray(raw_speech, dtype=np.float32)
+        # Ensure 1D
+        if raw_speech.ndim > 1:
+            raw_speech = raw_speech.squeeze()
+        if raw_speech.ndim != 1:
+            raise ValueError(f"Expected 1D audio, got shape {raw_speech.shape}")
+        # Normalize
+        if self.do_normalize:
+            raw_speech = (raw_speech - raw_speech.mean()) / (raw_speech.std() + 1e-7)
+        # Add batch dimension
+        raw_speech = raw_speech[np.newaxis, :]
+        # Convert to tensors
+        if return_tensors == "pt":
+            input_values = torch.from_numpy(raw_speech)
+        else:
+            input_values = raw_speech
+        return FeatureExtractorOutput(input_values=input_values)
+    def to_dict(self):
+        """Serialize to dict for saving."""
+        return {
+            "sampling_rate": self.sampling_rate,
+            "do_normalize": self.do_normalize,
+            "return_attention_mask": self.return_attention_mask,
+            "feature_extractor_type": "DistilledSpeechFeatureExtractor",
+        }
+    @classmethod
+    def from_dict(cls, config_dict):
+        """Load from dict."""
+        return cls(
+            sampling_rate=config_dict.get("sampling_rate", 16000),
+            do_normalize=config_dict.get("do_normalize", True),
+            return_attention_mask=config_dict.get("return_attention_mask", False),
+        )
+    def save_pretrained(self, save_directory: str):
+        """Save feature extractor config."""
+        import json
+        import os
+        os.makedirs(save_directory, exist_ok=True)
+        with open(os.path.join(save_directory, "preprocessor_config.json"), "w") as f:
+            json.dump(self.to_dict(), f, indent=2)
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):
+        """Load feature extractor from directory or hub."""
+        import json
+        import os
+        if os.path.isdir(pretrained_model_name_or_path):
+            config_path = os.path.join(pretrained_model_name_or_path, "preprocessor_config.json")
+        else:
+            # Try to download from hub
+            from huggingface_hub import hf_hub_download
+            config_path = hf_hub_download(
+                repo_id=pretrained_model_name_or_path,
+                filename="preprocessor_config.json",
+            )
+        with open(config_path, "r") as f:
+            config = json.load(f)
+        return cls.from_dict(config)
+class FeatureExtractorOutput:
+    """Simple container for feature extractor output."""
+    def __init__(self, input_values):
+        self.input_values = input_values
+    def to(self, device):
+        """Move tensors to device."""
+        if isinstance(self.input_values, torch.Tensor):
+            self.input_values = self.input_values.to(device)
+        return self

modeling_distilled_speech.py ADDED Viewed

	@@ -0,0 +1,525 @@

+"""
+HuggingFace Model for Distilled Speech Encoder.
+A Data2Vec-style bidirectional speech encoder distilled from AuriStream.
+Returns hidden states from all layers for downstream probing/finetuning.
+"""
+import math
+from dataclasses import dataclass
+from typing import Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import PreTrainedModel
+from transformers.modeling_outputs import BaseModelOutput
+try:
+    # When used as a HuggingFace model (trust_remote_code=True)
+    from configuration_distilled_speech import DistilledSpeechConfig
+except ImportError:
+    # When used as part of a package
+    from .configuration_distilled_speech import DistilledSpeechConfig
+@dataclass
+class DistilledSpeechOutput(BaseModelOutput):
+    """
+    Output type for DistilledSpeechModel.
+    Args:
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for each layer)
+            of shape `(batch_size, sequence_length, hidden_size)`.
+        extract_features (`torch.FloatTensor` of shape `(batch_size, sequence_length, conv_dim[-1])`):
+            Output of the convolutional feature encoder (before projection).
+    """
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    extract_features: Optional[torch.FloatTensor] = None
+# ==============================================================================
+# Convolutional Feature Encoder
+# ==============================================================================
+class GroupNorm1D(nn.Module):
+    """Group normalization for 1D convolutions (B, C, T) -> (B, C, T)."""
+    def __init__(self, num_groups: int, num_channels: int, eps: float = 1e-5):
+        super().__init__()
+        self.norm = nn.GroupNorm(num_groups, num_channels, eps=eps)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.norm(x)
+class ConvLayer(nn.Module):
+    """Single convolutional layer with normalization and activation."""
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: int,
+        stride: int,
+        bias: bool = False,
+        norm: str = "group",
+        activation: str = "gelu",
+    ):
+        super().__init__()
+        self.conv = nn.Conv1d(
+            in_channels,
+            out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            bias=bias,
+        )
+        if norm == "group":
+            self.norm = GroupNorm1D(num_groups=out_channels, num_channels=out_channels)
+        elif norm == "layer":
+            self.norm = nn.LayerNorm(out_channels)
+        else:
+            self.norm = None
+        if activation == "gelu":
+            self.activation = nn.GELU()
+        elif activation == "relu":
+            self.activation = nn.ReLU()
+        else:
+            self.activation = None
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.conv(x)
+        if self.norm is not None:
+            if isinstance(self.norm, nn.LayerNorm):
+                x = x.transpose(1, 2)
+                x = self.norm(x)
+                x = x.transpose(1, 2)
+            else:
+                x = self.norm(x)
+        if self.activation is not None:
+            x = self.activation(x)
+        return x
+class ConvFeatureEncoder(nn.Module):
+    """
+    7-layer convolutional feature encoder.
+    Transforms raw 16kHz audio into 50Hz feature representations.
+    Total stride: 5 * 2 * 2 * 2 * 2 * 2 * 2 = 320 (16kHz / 320 = 50Hz)
+    """
+    def __init__(self, config: DistilledSpeechConfig):
+        super().__init__()
+        conv_layers = []
+        in_channels = 1
+        for i, (out_channels, kernel, stride) in enumerate(
+            zip(config.conv_dim, config.conv_kernel, config.conv_stride)
+        ):
+            norm = "group" if i > 0 else config.feat_extract_norm
+            conv_layers.append(
+                ConvLayer(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    kernel_size=kernel,
+                    stride=stride,
+                    bias=config.conv_bias,
+                    norm=norm,
+                    activation=config.feat_extract_activation,
+                )
+            )
+            in_channels = out_channels
+        self.conv_layers = nn.ModuleList(conv_layers)
+        self.output_dim = config.conv_dim[-1]
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            x: Raw audio waveform (B, T) or (B, 1, T)
+        Returns:
+            Features (B, T', C) where T' = T // 320
+        """
+        if x.dim() == 2:
+            x = x.unsqueeze(1)
+        for conv_layer in self.conv_layers:
+            x = conv_layer(x)
+        x = x.transpose(1, 2)
+        return x
+class FeatureProjection(nn.Module):
+    """Projects conv features to transformer hidden size."""
+    def __init__(self, config: DistilledSpeechConfig):
+        super().__init__()
+        self.layer_norm = nn.LayerNorm(config.conv_dim[-1], eps=config.layer_norm_eps)
+        self.projection = nn.Linear(config.conv_dim[-1], config.hidden_size)
+        self.dropout = nn.Dropout(config.feat_proj_dropout)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.layer_norm(x)
+        x = self.projection(x)
+        x = self.dropout(x)
+        return x
+# ==============================================================================
+# Rotary Position Embeddings
+# ==============================================================================
+class RotaryEmbedding(nn.Module):
+    """Rotary Position Embedding (RoPE)."""
+    def __init__(self, dim: int, theta: float = 10000.0, max_seq_len: int = 8192):
+        super().__init__()
+        self.dim = dim
+        self.theta = theta
+        self.max_seq_len = max_seq_len
+        inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self._cos_cached = None
+        self._sin_cached = None
+        self._seq_len_cached = 0
+    def _update_cache(self, seq_len: int, device: torch.device, dtype: torch.dtype):
+        if seq_len > self._seq_len_cached or self._cos_cached is None:
+            self._seq_len_cached = max(seq_len, self.max_seq_len)
+            t = torch.arange(self._seq_len_cached, device=device, dtype=dtype)
+            freqs = torch.outer(t, self.inv_freq.to(device))
+            emb = torch.cat((freqs, freqs), dim=-1)
+            self._cos_cached = emb.cos()
+            self._sin_cached = emb.sin()
+    def forward(self, x: torch.Tensor, seq_len: int) -> Tuple[torch.Tensor, torch.Tensor]:
+        self._update_cache(seq_len, x.device, x.dtype)
+        return (
+            self._cos_cached[:seq_len].to(x.dtype),
+            self._sin_cached[:seq_len].to(x.dtype),
+        )
+def rotate_half(x: torch.Tensor) -> torch.Tensor:
+    """Rotate half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(
+    q: torch.Tensor, k: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Apply rotary position embedding to query and key tensors."""
+    cos = cos.unsqueeze(0).unsqueeze(0)
+    sin = sin.unsqueeze(0).unsqueeze(0)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+# ==============================================================================
+# Transformer Layers
+# ==============================================================================
+class MultiHeadAttention(nn.Module):
+    """Multi-head self-attention with RoPE support."""
+    def __init__(self, config: DistilledSpeechConfig):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = config.hidden_size // config.num_attention_heads
+        assert self.head_dim * self.num_heads == self.hidden_size
+        self.q_proj = nn.Linear(config.hidden_size, config.hidden_size)
+        self.k_proj = nn.Linear(config.hidden_size, config.hidden_size)
+        self.v_proj = nn.Linear(config.hidden_size, config.hidden_size)
+        self.out_proj = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.attention_dropout)
+        self.use_rope = config.use_rope
+    def forward(
+        self,
+        x: torch.Tensor,
+        cos: Optional[torch.Tensor] = None,
+        sin: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        B, T, _ = x.shape
+        q = self.q_proj(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
+        k = self.k_proj(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
+        v = self.v_proj(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
+        if self.use_rope and cos is not None and sin is not None:
+            q, k = apply_rotary_pos_emb(q, k, cos, sin)
+        # Scaled dot-product attention
+        attn_output = F.scaled_dot_product_attention(
+            q, k, v,
+            attn_mask=attention_mask,
+            dropout_p=self.dropout.p if self.training else 0.0,
+        )
+        attn_output = attn_output.transpose(1, 2).contiguous().view(B, T, self.hidden_size)
+        attn_output = self.out_proj(attn_output)
+        return attn_output
+class FeedForward(nn.Module):
+    """Feed-forward network with GELU activation."""
+    def __init__(self, config: DistilledSpeechConfig):
+        super().__init__()
+        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.activation = nn.GELU()
+        self.dropout = nn.Dropout(config.activation_dropout)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.fc1(x)
+        x = self.activation(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        return x
+class TransformerLayer(nn.Module):
+    """Single transformer encoder layer with pre-norm."""
+    def __init__(self, config: DistilledSpeechConfig):
+        super().__init__()
+        self.attention = MultiHeadAttention(config)
+        self.feed_forward = FeedForward(config)
+        self.attention_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.ffn_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout)
+    def forward(
+        self,
+        x: torch.Tensor,
+        cos: Optional[torch.Tensor] = None,
+        sin: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        # Self-attention with pre-norm
+        residual = x
+        x = self.attention_norm(x)
+        x = self.attention(x, cos, sin, attention_mask)
+        x = self.dropout(x)
+        x = residual + x
+        # Feed-forward with pre-norm
+        residual = x
+        x = self.ffn_norm(x)
+        x = self.feed_forward(x)
+        x = self.dropout(x)
+        x = residual + x
+        return x
+class TransformerEncoder(nn.Module):
+    """Stack of transformer encoder layers with hidden state collection."""
+    def __init__(self, config: DistilledSpeechConfig):
+        super().__init__()
+        self.config = config
+        self.layers = nn.ModuleList([
+            TransformerLayer(config) for _ in range(config.num_hidden_layers)
+        ])
+        if config.use_rope:
+            self.rotary_emb = RotaryEmbedding(
+                dim=config.hidden_size // config.num_attention_heads,
+                theta=config.rope_theta,
+            )
+        else:
+            self.rotary_emb = None
+    def forward(
+        self,
+        x: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_hidden_states: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, ...]]]:
+        """
+        Args:
+            x: Input tensor (B, T, D)
+            attention_mask: Optional attention mask
+            output_hidden_states: Whether to return all hidden states
+        Returns:
+            Tuple of (last_hidden_state, all_hidden_states)
+            all_hidden_states: tuple of (num_layers + 1) tensors if output_hidden_states=True
+                - hidden_states[0]: input to first transformer layer
+                - hidden_states[i]: output of transformer layer i-1 (for i > 0)
+        """
+        B, T, _ = x.shape
+        cos, sin = None, None
+        if self.rotary_emb is not None:
+            cos, sin = self.rotary_emb(x, T)
+        all_hidden_states = () if output_hidden_states else None
+        # Collect hidden state before first layer (embedding output)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (x,)
+        for layer in self.layers:
+            x = layer(x, cos, sin, attention_mask)
+            # Collect hidden state after each layer
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (x,)
+        return x, all_hidden_states
+# ==============================================================================
+# Main Model
+# ==============================================================================
+class DistilledSpeechModel(PreTrainedModel):
+    """
+    Distilled Speech Encoder Model.
+    A Data2Vec-style bidirectional transformer encoder for speech,
+    trained via distillation from AuriStream models.
+    This model takes raw audio waveforms as input and outputs contextualized
+    representations at 50Hz (20ms stride). It returns hidden states from all
+    transformer layers, making it suitable for downstream probing and finetuning.
+    Hidden states structure (for 12-layer model, output_hidden_states=True):
+        - hidden_states[0]: Feature projection output (input to transformer)
+        - hidden_states[1]: Output of transformer layer 0
+        - hidden_states[2]: Output of transformer layer 1
+        - ...
+        - hidden_states[12]: Output of transformer layer 11
+        Total: 13 hidden states (1 embedding + 12 layers)
+    Example usage:
+        >>> from transformers import AutoModel, AutoFeatureExtractor
+        >>> model = AutoModel.from_pretrained("your-model-name", trust_remote_code=True)
+        >>> processor = AutoFeatureExtractor.from_pretrained("your-model-name", trust_remote_code=True)
+        >>> audio = torch.randn(16000)  # 1 second of audio at 16kHz
+        >>> inputs = processor(audio, return_tensors="pt", sample_rate=16000)
+        >>> outputs = model(inputs.input_values, output_hidden_states=True)
+        >>> last_hidden = outputs.last_hidden_state  # (1, 50, 768)
+        >>> all_hidden = outputs.hidden_states  # Tuple of 13 tensors
+        >>> # Or use dict-style access:
+        >>> all_hidden = outputs["hidden_states"]
+    """
+    config_class = DistilledSpeechConfig
+    base_model_prefix = "distilled_speech"
+    main_input_name = "input_values"
+    supports_gradient_checkpointing = True
+    def __init__(self, config: DistilledSpeechConfig):
+        super().__init__(config)
+        self.config = config
+        # Feature extraction
+        self.conv_encoder = ConvFeatureEncoder(config)
+        self.feature_projection = FeatureProjection(config)
+        # Transformer encoder
+        self.encoder = TransformerEncoder(config)
+        self.final_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        # Initialize weights
+        self.post_init()
+    def _init_weights(self, module):
+        """Initialize the weights."""
+        if isinstance(module, nn.Linear):
+            nn.init.trunc_normal_(module.weight, std=0.02)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.LayerNorm):
+            nn.init.ones_(module.weight)
+            nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Conv1d):
+            nn.init.kaiming_normal_(module.weight, mode="fan_out", nonlinearity="relu")
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+    def forward(
+        self,
+        input_values: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, DistilledSpeechOutput]:
+        """
+        Forward pass through the model.
+        Args:
+            input_values (`torch.Tensor` of shape `(batch_size, sequence_length)`):
+                Raw audio waveform, normalized to zero mean and unit variance.
+                Expected sample rate: 16kHz.
+            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding tokens.
+            output_hidden_states (`bool`, *optional*):
+                Whether to return hidden states from all layers.
+            return_dict (`bool`, *optional*):
+                Whether to return a ModelOutput instead of a plain tuple.
+        Returns:
+            `DistilledSpeechOutput` or `tuple`:
+                - last_hidden_state: (B, T', hidden_size) where T' = T // 320
+                - hidden_states: Tuple of (B, T', hidden_size) for each layer if output_hidden_states=True
+                - extract_features: (B, T', conv_dim[-1]) raw conv features
+        """
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Conv encoder: (B, T) -> (B, T', conv_dim)
+        extract_features = self.conv_encoder(input_values)
+        # Feature projection: (B, T', conv_dim) -> (B, T', hidden_size)
+        hidden_states = self.feature_projection(extract_features)
+        # Transformer encoder
+        encoder_output, all_hidden_states = self.encoder(
+            hidden_states,
+            attention_mask=attention_mask,
+            output_hidden_states=output_hidden_states,
+        )
+        # Final layer norm
+        last_hidden_state = self.final_layer_norm(encoder_output)
+        if not return_dict:
+            outputs = (last_hidden_state,)
+            if output_hidden_states:
+                outputs = outputs + (all_hidden_states,)
+            outputs = outputs + (extract_features,)
+            return outputs
+        return DistilledSpeechOutput(
+            last_hidden_state=last_hidden_state,
+            hidden_states=all_hidden_states,
+            extract_features=extract_features,
+        )

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "sampling_rate": 16000,
+  "do_normalize": true,
+  "return_attention_mask": false,
+  "feature_extractor_type": "DistilledSpeechFeatureExtractor"
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5d818dc5701dedd635879dcc3a5df3056714f5f53ba80d90d11843e9b62fdc3d
+size 358700726