metadata
license: apache-2.0
language:
- en
tags:
- speech
- audio
- data2vec
- distillation
- feature-extraction
library_name: transformers
pipeline_tag: feature-extraction
Distilled Speech Encoder
A Data2Vec-style bidirectional speech encoder trained via distillation from AuriStream models.
Model Details
- Architecture: 24-layer transformer with RoPE positional encoding
- Hidden size: 1024
- Attention heads: 16
- Parameters: ~302M
- Teacher model:
TuKoResearch/AuriStream100M_40Pred_BigAudioDataset_500k - Training step: 100000
- Input: 16kHz raw audio waveform
- Output: 50Hz contextualized representations (1024-dim)
Usage
from transformers import AutoModel, Wav2Vec2FeatureExtractor
import torch
# Load model and feature extractor
model = AutoModel.from_pretrained("TuKoResearch/AuriStreamDistillLarge_100M40PredTeacher_bad", trust_remote_code=True)
model.eval() # Important for inference!
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("TuKoResearch/AuriStreamDistillLarge_100M40PredTeacher_bad")
# Prepare audio (16kHz, mono)
audio = torch.randn(16000).numpy() # 1 second of audio
# Extract features
inputs = feature_extractor(audio, return_tensors="pt", sampling_rate=16000)
with torch.no_grad():
outputs = model(inputs.input_values, output_hidden_states=True)
# Get representations
last_hidden = outputs.last_hidden_state # (1, 50, 1024) for 1 second
all_hidden = outputs.hidden_states # Tuple of 25 tensors
Hidden States
When output_hidden_states=True, the model returns hidden states from all layers:
hidden_states[0]: Feature projection output (after conv encoder + projection)hidden_states[1]tohidden_states[24]: Transformer layer outputshidden_states[24]: Final layer output (same aslast_hidden_state)
This makes the model suitable for linear probing experiments at different layers.
Training
This model was trained using Data2Vec-style distillation:
- A frozen AuriStream teacher model generates target representations
- The student sees masked audio and learns to predict teacher representations
- Loss is computed only on masked positions
Citation
If you use this model, please cite:
@misc{distilled_speech_encoder,
title={Distilled Speech Encoder},
author={TuKo Research},
year={2025},
url={https://huggingface.co/TuKoResearch/AuriStreamDistillLarge_100M40PredTeacher_bad}
}