Tabahi commited on
Commit
2be2dc3
·
verified ·
1 Parent(s): 0eef6aa

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ plots/ug20_multilingual_mswc38.png filter=lfs diff=lfs merge=lfs -text
37
+ plots/uh03b_confusion_probs_heatmap_libri_dev_en.png filter=lfs diff=lfs merge=lfs -text
38
+ plots/where_they_went_timeline.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,2 +1,91 @@
1
  # CUPE: Contextless Universal Phoneme Encoder
2
- pytorch model for contexless-phoneme prediction from speech audio
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # CUPE: Contextless Universal Phoneme Encoder
2
+
3
+ A PyTorch model for contextless phoneme prediction from speech audio. CUPE processes 120ms frames independently, ensuring each frame's embeddings are acoustically pure—unlike transformer models that mix context across frames.
4
+
5
+ ## Trained Models
6
+
7
+ Two 30.1M parameter models are available in the [checkpoints directory](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt).
8
+
9
+ ## Datasets
10
+
11
+ - **LibriSpeech ASR corpus (SR12):** 960 hours of English speech from train-100, train-360, and train-500 splits.
12
+ - **Multilingual LibriSpeech (MLS) (SLR94):** 800 hours total, with 100 hours each for 8 languages: `pl`, `pt`, `it`, `es`, `fr`, `nl`, `de`, `en`. Dataset's train/test/val splits.
13
+ - **MSWC Multilingual Spoken Words Corpus:** 240 hours from 50 languages (max 10 hours/language).
14
+ - **Training:** 38 languages (`en`, `de`, `fr`, `ca`, `es`, `fa`, `it`, `ru`, `pl`, `eu`, `cy`, `eo`, `nl`, `pt`, `tt`, `cs`, `tr`, `et`, `ky`, `id`, `sv-SE`, `ar`, `el`, `ro`, `lv`, `sl`, `zh-CN`, `ga-IE`, `ta`, `vi`, `gn`, `or`)
15
+ - **Testing:** 6 languages (`lt`, `mt`, `ia`, `sk`, `ka`, `as`)
16
+
17
+ ## Metrics
18
+
19
+ **English ([en_libri1000_uj01d](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/en_libri1000_uj01d_e199_val_GER=0.2307.ckpt)):**
20
+ - **PER:** 0.25 (Phoneme Error Rate)
21
+ - **GER:** 0.23 (Phoneme Group Error Rate)
22
+
23
+
24
+ **Multilingual MLS ([multi_MLS8_uh02](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/multi_MLS8_uh02_e36_val_GER=0.2334.ckpt)):**
25
+ - **PER:** 0.31
26
+ - **GER:** 0.26
27
+
28
+ **Multilingual MSWC ([multi_mswc38_ug20](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/multi_mswc38_ug20_e59_val_GER=0.5611.ckpt)):**
29
+ - **PER:** 0.49
30
+ - **GER:** 0.39
31
+
32
+ ---
33
+
34
+ # Usage
35
+
36
+ See [run.py](https://huggingface.co/Tabahi/CUPE-2i/blob/main/run.py) for a complete example.
37
+
38
+ ```python
39
+ import torch
40
+ import torchaudio
41
+ from model2i import CUPEEmbeddingsExtractor # Main CUPE model feature extractor
42
+ import windowing # Provides slice_windows, stich_window_predictions
43
+
44
+ cupe_ckpt_path = "./ckpt/en_libri1000_uj01d_e199_val_GER=0.2307.ckpt"
45
+ extractor = CUPEEmbeddingsExtractor(cupe_ckpt_path, device="cuda")
46
+
47
+ dummy_wav = torch.zeros(1, max_wav_len, dtype=torch.float32, device="cpu")
48
+ audio_batch = dummy_wav.unsqueeze(0) # Add batch dimension
49
+
50
+ # Window the audio
51
+ windowed_audio = windowing.slice_windows(
52
+ audio_batch.to("cuda"),
53
+ sample_rate,
54
+ window_size_ms,
55
+ stride_ms
56
+ )
57
+ batch_size, num_windows, window_size = windowed_audio.shape
58
+ windows_flat = windowed_audio.reshape(-1, window_size)
59
+
60
+ logits, _ = extractor.predict(windows_flat, return_embeddings=False, groups_only=False)
61
+
62
+ # Reshape and stitch window predictions
63
+ logits = logits.reshape(batch_size, num_windows, frames_per_window, -1)
64
+ logits = windowing.stich_window_predictions(
65
+ logits,
66
+ original_audio_length=audio_batch.size(2),
67
+ cnn_output_size=frames_per_window,
68
+ sample_rate=sample_rate,
69
+ window_size_ms=window_size_ms,
70
+ stride_ms=stride_ms
71
+ )
72
+
73
+ print(logits.shape) # [B, T, 66]
74
+ ```
75
+
76
+ # Use Cases
77
+
78
+ - Timestamp alignment (examples coming soon)
79
+ - Speech analysis
80
+
81
+ ## Sample probabilties timeline
82
+
83
+ ![Sample output logits plot](plots/where_they_went_timeline.png)
84
+
85
+ ## Multilingual Confusion Plot (Counts)
86
+
87
+ ![Multilingual Confusion Plot (counts)](plots/ug20_multilingual_mswc38.png)
88
+
89
+ ## English-only Confusion Plot (Probabilities)
90
+
91
+ ![English-only Confusion Plot (probabiltities)](plots/uh03b_confusion_probs_heatmap_libri_dev_en.png)
ckpt/en_libri1000_uj01d_e199_val_GER=0.2307.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f9129933e707c5ad4da3213f832b81f7b0a57df8597ff57babfac25493bcf8a7
3
+ size 120485062
ckpt/en_libri1000_uj01d_e62_val_GER=0.2438.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7925c404640eae093b584c40123426ef28c5aedc85bddee478e0bc5d2db32c92
3
+ size 120485062
ckpt/multi_MLS8_uh02_e36_val_GER=0.2334.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf4fb0a387074514db0a17871e059ef08e21fc057b8ee283d5229c14f9c4933a
3
+ size 120485126
ckpt/multi_mswc38_ug20_e59_val_GER=0.5611.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7f7a070462b4ae73be05f41ac7f7ee5734b1396da10ab1a2c6e7c72df4ee2f1a
3
+ size 120488006
plots/ug20_multilingual_mswc38.png ADDED

Git LFS Details

  • SHA256: 47c400994609a5190b02f698702235b794957a022e2efa094a6f1e942d88879e
  • Pointer size: 131 Bytes
  • Size of remote file: 534 kB
plots/uh03b_confusion_probs_heatmap_libri_dev_en.png ADDED

Git LFS Details

  • SHA256: 43fedf0588baef497cfdaf3b9697743b2c2b7e0faf892e4c6449c07b4d933f95
  • Pointer size: 131 Bytes
  • Size of remote file: 425 kB
plots/where_they_went_timeline.png ADDED

Git LFS Details

  • SHA256: b42ad98b7b7c6f6a9161f859c25e2c780eb5501894bd0f58e61e686b8c1ddaae
  • Pointer size: 131 Bytes
  • Size of remote file: 497 kB
run.py CHANGED
@@ -262,7 +262,7 @@ if __name__ == "__main__":
262
 
263
  torch.manual_seed(42)
264
 
265
- cupe_ckpt_path = "ckpt/m_uj01d_epoch=62_step=326088_val_GER=0.2438copy.ckpt"
266
  pipeline = EmbeddingsExtractionPipeline(cupe_ckpt_path, max_duration=10, device="cpu", verbose=False)
267
 
268
  audio_clip1_path = "samples/109867__timkahn__butterfly.wav.wav"
 
262
 
263
  torch.manual_seed(42)
264
 
265
+ cupe_ckpt_path = "ckpt/en_libri1000_uj01d_e199_val_GER=0.2307.ckpt"
266
  pipeline = EmbeddingsExtractionPipeline(cupe_ckpt_path, max_duration=10, device="cpu", verbose=False)
267
 
268
  audio_clip1_path = "samples/109867__timkahn__butterfly.wav.wav"