File size: 2,463 Bytes

---
license: apache-2.0
language:
- en
tags:
- speech
- audio
- data2vec
- distillation
- feature-extraction
library_name: transformers
pipeline_tag: feature-extraction
---

# Distilled Speech Encoder

A Data2Vec-style bidirectional speech encoder trained via distillation from AuriStream models.

## Model Details

- **Architecture**: 12-layer transformer with RoPE positional encoding
- **Hidden size**: 768
- **Attention heads**: 12
- **Parameters**: ~85M
- **Teacher model**: `TuKoResearch/AuriStream100M_40Pred_BigAudioDataset_500k`
- **Training step**: 100000
- **Input**: 16kHz raw audio waveform
- **Output**: 50Hz contextualized representations (768-dim)

## Usage

```python
from transformers import AutoModel, Wav2Vec2FeatureExtractor
import torch

# Load model and feature extractor
model = AutoModel.from_pretrained("TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960", trust_remote_code=True)
model.eval()  # Important for inference!
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960")

# Prepare audio (16kHz, mono)
audio = torch.randn(16000).numpy()  # 1 second of audio

# Extract features
inputs = feature_extractor(audio, return_tensors="pt", sampling_rate=16000)
with torch.no_grad():
    outputs = model(inputs.input_values, output_hidden_states=True)

# Get representations
last_hidden = outputs.last_hidden_state  # (1, 50, 768) for 1 second
all_hidden = outputs.hidden_states  # Tuple of 13 tensors
```

## Hidden States

When `output_hidden_states=True`, the model returns hidden states from all layers:
- `hidden_states[0]`: Feature projection output (after conv encoder + projection)
- `hidden_states[1]` to `hidden_states[12]`: Transformer layer outputs
- `hidden_states[12]`: Final layer output (same as `last_hidden_state`)

This makes the model suitable for linear probing experiments at different layers.

## Training

This model was trained using Data2Vec-style distillation:
1. A frozen AuriStream teacher model generates target representations
2. The student sees masked audio and learns to predict teacher representations
3. Loss is computed only on masked positions

## Citation

If you use this model, please cite:

```bibtex
@misc{distilled_speech_encoder,
  title={Distilled Speech Encoder},
  author={TuKo Research},
  year={2025},
  url={https://huggingface.co/TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960}
}
```