TuKoResearch
/

GSLM-HuBERT200

Model card Files Files and versions

klemenk commited on Jul 30, 2025

Commit

32e40f0

·

verified ·

1 Parent(s): 526a387

Update README.md

Files changed (1) hide show

README.md +0 -96

README.md CHANGED Viewed

@@ -1,96 +0,0 @@
----
-license: mit
-tags:
-- gslm
-- speech
-- language-model
-- hubert
-- fairseq
----
-# GSLM Unit Language Model - HuBERT 200
-This is a PyTorch implementation of the Unit Language Model (ULM) from the [Generative Spoken Language Modeling (GSLM)](https://arxiv.org/abs/2102.01192) paper, trained on HuBERT units with 200 clusters.
-## Model Details
-- **Architecture**: Transformer Language Model (transformer_lm_big)
-- **Parameters**: ~215M
-- **Vocab Size**: 204 (200 HuBERT units + special tokens)
-- **Embedding Dimension**: 1024
-- **Layers**: 12
-- **Attention Heads**: 16
-- **FFN Dimension**: 4096
-- **Max Sequence Length**: 3072
-## Usage
-```python
-import torch
-from safetensors.torch import load_file
-from gslm_ulm import TransformerLanguageModel
-# Load model
-model = TransformerLanguageModel(
-    vocab_size=204,
-    d_model=1024,
-    nhead=16,
-    num_layers=12,
-    dim_feedforward=4096,
-    max_seq_length=3072
-)
-# Load weights
-state_dict = load_file("gslm_hubert200_ulm.safetensors")
-model.load_state_dict(state_dict)
-model.eval()
-# Generate sequences
-prompt = torch.tensor([[1, 5, 10, 15]])  # Example HuBERT unit sequence
-generated = model.generate(
-    prompt,
-    max_length=100,
-    temperature=0.8,
-    top_k=50
-)
-```
-## Model Architecture
-The model follows the transformer_lm_big configuration from fairseq:
-- Pre-normalization (layer norm before attention/FFN)
-- Sinusoidal positional encoding
-- Shared input/output embeddings
-- Causal attention mask for autoregressive generation
-## Training Details
-- Trained on LibriSpeech using HuBERT-Base features quantized to 200 clusters
-- 6000 hours of unlabeled speech data
-- Trained as a causal language model on sequences of discrete units
-## Complete GSLM Pipeline
-This is the Unit Language Model component of GSLM. For the complete pipeline:
-1. **Speech2Unit**: Convert raw audio → discrete units (HuBERT + k-means)
-2. **Unit LM**: Generate/continue unit sequences (this model)
-3. **Unit2Speech**: Convert units → speech (Tacotron2 + WaveGlow)
-## Original Paper
-```bibtex
-@article{lakhotia2021generative,
-  title={On Generative Spoken Language Modeling from Raw Audio},
-  author={Lakhotia, Kushal and Kharitonov, Eugene and Hsu, Wei-Ning and Adi, Yossi and Polyak, Adam and Bolte, Benjamin and Nguyen, Tu-Anh and Copet, Jade and Baevski, Alexei and Mohamed, Abdelrahman and Dupoux, Emmanuel},
-  journal={Transactions of the Association for Computational Linguistics},
-  volume={9},
-  pages={1336--1354},
-  year={2021}
-}
-```
-## Notes
-- This model uses sinusoidal positional encoding instead of learned positional embeddings (functionally equivalent)
-- The model expects discrete unit indices as input (not raw audio)
-- Units range from 0-199, with additional special tokens (200: EOS, 201: BOS, 202: PAD, 203: UNK)