klemenk commited on
Commit
32e40f0
·
verified ·
1 Parent(s): 526a387

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -96
README.md CHANGED
@@ -1,96 +0,0 @@
1
- ---
2
- license: mit
3
- tags:
4
- - gslm
5
- - speech
6
- - language-model
7
- - hubert
8
- - fairseq
9
- ---
10
-
11
- # GSLM Unit Language Model - HuBERT 200
12
-
13
- This is a PyTorch implementation of the Unit Language Model (ULM) from the [Generative Spoken Language Modeling (GSLM)](https://arxiv.org/abs/2102.01192) paper, trained on HuBERT units with 200 clusters.
14
-
15
- ## Model Details
16
-
17
- - **Architecture**: Transformer Language Model (transformer_lm_big)
18
- - **Parameters**: ~215M
19
- - **Vocab Size**: 204 (200 HuBERT units + special tokens)
20
- - **Embedding Dimension**: 1024
21
- - **Layers**: 12
22
- - **Attention Heads**: 16
23
- - **FFN Dimension**: 4096
24
- - **Max Sequence Length**: 3072
25
-
26
- ## Usage
27
-
28
- ```python
29
- import torch
30
- from safetensors.torch import load_file
31
- from gslm_ulm import TransformerLanguageModel
32
-
33
- # Load model
34
- model = TransformerLanguageModel(
35
- vocab_size=204,
36
- d_model=1024,
37
- nhead=16,
38
- num_layers=12,
39
- dim_feedforward=4096,
40
- max_seq_length=3072
41
- )
42
-
43
- # Load weights
44
- state_dict = load_file("gslm_hubert200_ulm.safetensors")
45
- model.load_state_dict(state_dict)
46
- model.eval()
47
-
48
- # Generate sequences
49
- prompt = torch.tensor([[1, 5, 10, 15]]) # Example HuBERT unit sequence
50
- generated = model.generate(
51
- prompt,
52
- max_length=100,
53
- temperature=0.8,
54
- top_k=50
55
- )
56
- ```
57
-
58
- ## Model Architecture
59
-
60
- The model follows the transformer_lm_big configuration from fairseq:
61
- - Pre-normalization (layer norm before attention/FFN)
62
- - Sinusoidal positional encoding
63
- - Shared input/output embeddings
64
- - Causal attention mask for autoregressive generation
65
-
66
- ## Training Details
67
-
68
- - Trained on LibriSpeech using HuBERT-Base features quantized to 200 clusters
69
- - 6000 hours of unlabeled speech data
70
- - Trained as a causal language model on sequences of discrete units
71
-
72
- ## Complete GSLM Pipeline
73
-
74
- This is the Unit Language Model component of GSLM. For the complete pipeline:
75
- 1. **Speech2Unit**: Convert raw audio → discrete units (HuBERT + k-means)
76
- 2. **Unit LM**: Generate/continue unit sequences (this model)
77
- 3. **Unit2Speech**: Convert units → speech (Tacotron2 + WaveGlow)
78
-
79
- ## Original Paper
80
-
81
- ```bibtex
82
- @article{lakhotia2021generative,
83
- title={On Generative Spoken Language Modeling from Raw Audio},
84
- author={Lakhotia, Kushal and Kharitonov, Eugene and Hsu, Wei-Ning and Adi, Yossi and Polyak, Adam and Bolte, Benjamin and Nguyen, Tu-Anh and Copet, Jade and Baevski, Alexei and Mohamed, Abdelrahman and Dupoux, Emmanuel},
85
- journal={Transactions of the Association for Computational Linguistics},
86
- volume={9},
87
- pages={1336--1354},
88
- year={2021}
89
- }
90
- ```
91
-
92
- ## Notes
93
-
94
- - This model uses sinusoidal positional encoding instead of learned positional embeddings (functionally equivalent)
95
- - The model expects discrete unit indices as input (not raw audio)
96
- - Units range from 0-199, with additional special tokens (200: EOS, 201: BOS, 202: PAD, 203: UNK)