|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- speech |
|
|
- audio |
|
|
- data2vec |
|
|
- distillation |
|
|
- feature-extraction |
|
|
library_name: transformers |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
|
|
|
# Distilled Speech Encoder |
|
|
|
|
|
A Data2Vec-style bidirectional speech encoder trained via distillation from AuriStream models. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture**: 12-layer transformer with RoPE positional encoding |
|
|
- **Hidden size**: 768 |
|
|
- **Attention heads**: 12 |
|
|
- **Parameters**: ~85M |
|
|
- **Teacher model**: `TuKoResearch/AuriStream100M_40Pred_BigAudioDataset_500k` |
|
|
- **Training step**: 100000 |
|
|
- **Input**: 16kHz raw audio waveform |
|
|
- **Output**: 50Hz contextualized representations (768-dim) |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, Wav2Vec2FeatureExtractor |
|
|
import torch |
|
|
|
|
|
# Load model and feature extractor |
|
|
model = AutoModel.from_pretrained("TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960", trust_remote_code=True) |
|
|
model.eval() # Important for inference! |
|
|
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960") |
|
|
|
|
|
# Prepare audio (16kHz, mono) |
|
|
audio = torch.randn(16000).numpy() # 1 second of audio |
|
|
|
|
|
# Extract features |
|
|
inputs = feature_extractor(audio, return_tensors="pt", sampling_rate=16000) |
|
|
with torch.no_grad(): |
|
|
outputs = model(inputs.input_values, output_hidden_states=True) |
|
|
|
|
|
# Get representations |
|
|
last_hidden = outputs.last_hidden_state # (1, 50, 768) for 1 second |
|
|
all_hidden = outputs.hidden_states # Tuple of 13 tensors |
|
|
``` |
|
|
|
|
|
## Hidden States |
|
|
|
|
|
When `output_hidden_states=True`, the model returns hidden states from all layers: |
|
|
- `hidden_states[0]`: Feature projection output (after conv encoder + projection) |
|
|
- `hidden_states[1]` to `hidden_states[12]`: Transformer layer outputs |
|
|
- `hidden_states[12]`: Final layer output (same as `last_hidden_state`) |
|
|
|
|
|
This makes the model suitable for linear probing experiments at different layers. |
|
|
|
|
|
## Training |
|
|
|
|
|
This model was trained using Data2Vec-style distillation: |
|
|
1. A frozen AuriStream teacher model generates target representations |
|
|
2. The student sees masked audio and learns to predict teacher representations |
|
|
3. Loss is computed only on masked positions |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{distilled_speech_encoder, |
|
|
title={Distilled Speech Encoder}, |
|
|
author={TuKo Research}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960} |
|
|
} |
|
|
``` |
|
|
|