TuKoResearch
/

AuriStreamDistill_100M40PredTeacher_librispeech960

Feature Extraction

distilled_speech

Model card Files Files and versions

AuriStreamDistill_100M40PredTeacher_librispeech960 / README.md

klemenk's picture

Upload distilled speech model

2965b7e verified 12 days ago

|

history blame contribute delete

2.46 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- speech
	- audio
	- data2vec
	- distillation
	- feature-extraction
	library_name: transformers
	pipeline_tag: feature-extraction
	---

	# Distilled Speech Encoder

	A Data2Vec-style bidirectional speech encoder trained via distillation from AuriStream models.

	## Model Details

	- Architecture: 12-layer transformer with RoPE positional encoding
	- Hidden size: 768
	- Attention heads: 12
	- Parameters: ~85M
	- Teacher model: `TuKoResearch/AuriStream100M_40Pred_BigAudioDataset_500k`
	- Training step: 100000
	- Input: 16kHz raw audio waveform
	- Output: 50Hz contextualized representations (768-dim)

	## Usage

	```python
	from transformers import AutoModel, Wav2Vec2FeatureExtractor
	import torch

	# Load model and feature extractor
	model = AutoModel.from_pretrained("TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960", trust_remote_code=True)
	model.eval() # Important for inference!
	feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960")

	# Prepare audio (16kHz, mono)
	audio = torch.randn(16000).numpy() # 1 second of audio

	# Extract features
	inputs = feature_extractor(audio, return_tensors="pt", sampling_rate=16000)
	with torch.no_grad():
	outputs = model(inputs.input_values, output_hidden_states=True)

	# Get representations
	last_hidden = outputs.last_hidden_state # (1, 50, 768) for 1 second
	all_hidden = outputs.hidden_states # Tuple of 13 tensors
	```

	## Hidden States

	When `output_hidden_states=True`, the model returns hidden states from all layers:
	- `hidden_states[0]`: Feature projection output (after conv encoder + projection)
	- `hidden_states[1]` to `hidden_states[12]`: Transformer layer outputs
	- `hidden_states[12]`: Final layer output (same as `last_hidden_state`)

	This makes the model suitable for linear probing experiments at different layers.

	## Training

	This model was trained using Data2Vec-style distillation:
	1. A frozen AuriStream teacher model generates target representations
	2. The student sees masked audio and learns to predict teacher representations
	3. Loss is computed only on masked positions

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{distilled_speech_encoder,
	title={Distilled Speech Encoder},
	author={TuKo Research},
	year={2025},
	url={https://huggingface.co/TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960}
	}
	```