WavLM-L3 (Large, layer 3 truncation)
A 4-layer truncation of WavLM-Large used as a frozen SSL prior in the FRAME zero-shot voice cloning system. Only the first 4 transformer layers are retained; the layer-3 output is exposed for frame-level reference attention.
Why a truncated WavLM?
The full WavLM-Large model is 316M parameters; the truncated 4-layer variant is 63.5M (~20% of the original). For zero-shot TTS where WavLM is needed only at voice registration (once per voice, output cached to a ~2 MB artifact), running a shallower variant cuts registration latency without measurable quality loss.
Layer 3 was chosen empirically as a sweet spot between acoustic richness (low layers) and phonetic abstraction (high layers); see the FRAME paper for the ablation.
Usage
from transformers import WavLMModel
import torch
model = WavLMModel.from_pretrained("kdrkdrkdr/wavlm_l3")
model.eval()
# Input: mono 16 kHz waveform, shape [B, T]
wav = torch.randn(1, 16000 * 5) # 5-second clip
with torch.no_grad():
out = model(wav, output_hidden_states=True)
features = out.hidden_states[3] # layer-3 output, [B, T_frames, 1024]
Architecture
- Base model: microsoft/wavlm-large
- Layers retained: 0--3 (first 4 transformer blocks)
- Hidden size: 1024
- Frame rate: 50 Hz (20 ms hop at 16 kHz input)
- Parameters: ~63.5M
Citation
If you use this in research, please cite the FRAME paper.
% (paper bibtex placeholder; will be filled after publication)
Original WavLM:
@article{chen2022wavlm,
title={WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing},
author={Chen, Sanyuan and Wang, Chengyi and Chen, Zhengyang and others},
journal={IEEE Journal of Selected Topics in Signal Processing},
year={2022},
}
- Downloads last month
- 27