--- license: apache-2.0 language: - en tags: - speech - audio - data2vec - distillation - feature-extraction library_name: transformers pipeline_tag: feature-extraction --- # Distilled Speech Encoder A Data2Vec-style bidirectional speech encoder trained via distillation from AuriStream models. ## Model Details - **Architecture**: 12-layer transformer with RoPE positional encoding - **Hidden size**: 768 - **Attention heads**: 12 - **Parameters**: ~85M - **Teacher model**: `TuKoResearch/AuriStream100M_40Pred_BigAudioDataset_500k` - **Training step**: 100000 - **Input**: 16kHz raw audio waveform - **Output**: 50Hz contextualized representations (768-dim) ## Usage ```python from transformers import AutoModel, Wav2Vec2FeatureExtractor import torch # Load model and feature extractor model = AutoModel.from_pretrained("TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960", trust_remote_code=True) model.eval() # Important for inference! feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960") # Prepare audio (16kHz, mono) audio = torch.randn(16000).numpy() # 1 second of audio # Extract features inputs = feature_extractor(audio, return_tensors="pt", sampling_rate=16000) with torch.no_grad(): outputs = model(inputs.input_values, output_hidden_states=True) # Get representations last_hidden = outputs.last_hidden_state # (1, 50, 768) for 1 second all_hidden = outputs.hidden_states # Tuple of 13 tensors ``` ## Hidden States When `output_hidden_states=True`, the model returns hidden states from all layers: - `hidden_states[0]`: Feature projection output (after conv encoder + projection) - `hidden_states[1]` to `hidden_states[12]`: Transformer layer outputs - `hidden_states[12]`: Final layer output (same as `last_hidden_state`) This makes the model suitable for linear probing experiments at different layers. ## Training This model was trained using Data2Vec-style distillation: 1. A frozen AuriStream teacher model generates target representations 2. The student sees masked audio and learns to predict teacher representations 3. Loss is computed only on masked positions ## Citation If you use this model, please cite: ```bibtex @misc{distilled_speech_encoder, title={Distilled Speech Encoder}, author={TuKo Research}, year={2025}, url={https://huggingface.co/TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960} } ```