WavLM-L3 (Large, layer 3 truncation)

A 4-layer truncation of WavLM-Large used as a frozen SSL prior in the FRAME zero-shot voice cloning system. Only the first 4 transformer layers are retained; the layer-3 output is exposed for frame-level reference attention.

Why a truncated WavLM?

The full WavLM-Large model is 316M parameters; the truncated 4-layer variant is 63.5M (~20% of the original). For zero-shot TTS where WavLM is needed only at voice registration (once per voice, output cached to a ~2 MB artifact), running a shallower variant cuts registration latency without measurable quality loss.

Layer 3 was chosen empirically as a sweet spot between acoustic richness (low layers) and phonetic abstraction (high layers); see the FRAME paper for the ablation.

Usage

from transformers import WavLMModel
import torch

model = WavLMModel.from_pretrained("kdrkdrkdr/wavlm_l3")
model.eval()

# Input: mono 16 kHz waveform, shape [B, T]
wav = torch.randn(1, 16000 * 5)  # 5-second clip
with torch.no_grad():
    out = model(wav, output_hidden_states=True)
features = out.hidden_states[3]  # layer-3 output, [B, T_frames, 1024]

Architecture

Base model: microsoft/wavlm-large
Layers retained: 0--3 (first 4 transformer blocks)
Hidden size: 1024
Frame rate: 50 Hz (20 ms hop at 16 kHz input)
Parameters: ~63.5M

Citation

If you use this in research, please cite the FRAME paper.

% (paper bibtex placeholder; will be filled after publication)

Original WavLM:

@article{chen2022wavlm,
  title={WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing},
  author={Chen, Sanyuan and Wang, Chengyi and Chen, Zhengyang and others},
  journal={IEEE Journal of Selected Topics in Signal Processing},
  year={2022},
}

Downloads last month: 27