ONNX
niobures commited on
Commit
d13707b
Β·
verified Β·
1 Parent(s): e13bd1b

Desk Pai CTC Forced Aligner

Browse files
Files changed (3) hide show
  1. 04ac86b67129634da93aea76e0147ef3.onnx +3 -0
  2. README.md +172 -0
  3. source.txt +1 -0
04ac86b67129634da93aea76e0147ef3.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8bad67fd3533b3d3c145b0ca31bb15383945c13384dd8975baaa7b73f7b61ac
3
+ size 1262421764
README.md ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ ---
4
+
5
+ # 🎯 CTC Forced Aligner
6
+
7
+ We are open-sourcing the CTC forced aligner used in [Deskpai](https://www.deskpai.com).
8
+
9
+ With focus on production-ready model inference, it supports 18 different alignment models, including multilingual models(German, English, Spanish, French and Italian etc), and provides SRT and WebVTT alignment and generation out of box. It supports both ONNXRuntime and PyTorch for model serving.
10
+
11
+ [Home Page](https://github.com/deskpai/ctc_forced_aligner)
12
+
13
+ ## πŸš€ Installation
14
+
15
+ - CPU inference via ONNXRuntime
16
+
17
+ ```bash
18
+ pip install ctc_forced_aligner
19
+ ```
20
+
21
+ - GPU inference via ONNXRuntime
22
+
23
+ ```bash
24
+ pip install ctc_forced_aligner[gpu]
25
+ ```
26
+
27
+ - CPU/GPU inference via PyTorch
28
+
29
+ ```bash
30
+ pip install ctc_forced_aligner[torch]
31
+ ```
32
+
33
+ - Install all dependencies
34
+
35
+ ```bash
36
+ pip install ctc_forced_aligner[all]
37
+ ```
38
+
39
+ ## πŸ“ Sample Inference Code
40
+
41
+ - CPU/GPU inference via ONNXRuntime
42
+
43
+ ```python
44
+ from ctc_forced_aligner import AlignmentSingleton
45
+ alignment_service = AlignmentSingleton()
46
+ input_audio_path = "audio.mp3"
47
+ input_text_path = "input.txt"
48
+ output_srt_path = "output.srt"
49
+ ret = alignment_service.generate_srt(input_audio_path,
50
+ input_text_path,
51
+ output_srt_path)
52
+ if ret:
53
+ print(f"Aligned SRT is generated at {output_srt_path}")
54
+ output_vtt_path = "output.vtt"
55
+ ret = alignment_service.generate_webvtt(input_audio_path,
56
+ input_text_path,
57
+ output_vtt_path)
58
+ if ret:
59
+ print(f"aligned VTT is generated to {output_vtt_path}")
60
+ ```
61
+
62
+ - CPU/GPU inference via PyTorch
63
+
64
+ ```python
65
+ from ctc_forced_aligner import AlignmentTorch
66
+ at = AlignmentTorch()
67
+ ret = at.generate_srt(input_audio_path, input_text_path, output_srt_path)
68
+ if ret:
69
+ print(f"aligned srt is generated to {output_srt_path}")
70
+ ret = at.generate_webvtt(input_audio_path, input_text_path, output_vtt_path)
71
+ if ret:
72
+ print(f"aligned VTT is generated to {output_vtt_path}")
73
+ ```
74
+
75
+ - Inference with multiple models
76
+
77
+ ```python
78
+ from ctc_forced_aligner import AlignmentTorch
79
+ at = AlignmentTorch()
80
+ ret = at.generate_srt(input_audio_path, input_text_path, output_srt_path, model_type='WAV2VEC2_ASR_BASE_960H')
81
+ if ret:
82
+ print(f"aligned srt is generated to {output_srt_path}")
83
+ ret = at.generate_webvtt(input_audio_path, input_text_path, output_vtt_path, model_type='WAV2VEC2_ASR_BASE_960H')
84
+ if ret:
85
+ print(f"aligned VTT is generated to {output_vtt_path}")
86
+ ```
87
+
88
+ ## Models Supported
89
+
90
+ ### βœ… Wav2Vec2 Models
91
+
92
+ These are fine-tuned models with a **CTC-based ASR head**:
93
+ - `WAV2VEC2_ASR_BASE_960H`
94
+ - `WAV2VEC2_ASR_BASE_100H`
95
+ - `WAV2VEC2_ASR_BASE_10M`
96
+ - `WAV2VEC2_ASR_LARGE_10M`
97
+ - `WAV2VEC2_ASR_LARGE_100H`
98
+ - `WAV2VEC2_ASR_LARGE_960H`
99
+ - `WAV2VEC2_ASR_LARGE_LV60K_10M`
100
+ - `WAV2VEC2_ASR_LARGE_LV60K_100H`
101
+ - `WAV2VEC2_ASR_LARGE_LV60K_960H`
102
+
103
+ ### βœ… VoxPopuli Models (Multilingual)
104
+
105
+ These models are fine-tuned for **specific languages**:
106
+ - `VOXPOPULI_ASR_BASE_10K_DE` (German ASR)
107
+ - `VOXPOPULI_ASR_BASE_10K_EN` (English ASR)
108
+ - `VOXPOPULI_ASR_BASE_10K_ES` (Spanish ASR)
109
+ - `VOXPOPULI_ASR_BASE_10K_FR` (French ASR)
110
+ - `VOXPOPULI_ASR_BASE_10K_IT` (Italian ASR)
111
+
112
+ - Fine-tuned on **VoxPopuli** speech corpus.
113
+
114
+ ### βœ… HuBERT Models
115
+ - `HUBERT_ASR_LARGE`
116
+ - `HUBERT_ASR_XLARGE`
117
+
118
+
119
+ ## πŸ’‘ Which One and How to Use?
120
+
121
+ **For PyTorch serving**, use `AlignmentTorch` or `AlignmentTorchSingleton`.
122
+
123
+ - **For English ASR** β†’ `WAV2VEC2_ASR_LARGE_960H` or `HUBERT_ASR_LARGE`
124
+ - **For multilingual ASR** β†’ `VOXPOPULI_ASR_BASE_10K_*`
125
+ - **For low-resource ASR** β†’ `WAV2VEC2_ASR_BASE_10M` (smallest model)
126
+ - **For best accuracy** β†’ `WAV2VEC2_ASR_LARGE_LV60K_960H` or `HUBERT_ASR_XLARGE`
127
+
128
+ **For ONNXRuntime serving** with minimum dependencies, use `Alignment` or `AlignmentSingleton`.
129
+
130
+ Please contact [us](mailto:dev@deskpai.com) if you want to integrate your model into this package.
131
+
132
+ ## πŸ“„ License
133
+
134
+ ### Code
135
+
136
+ - This project includes code from [pytorch/audio](https://github.com/pytorch/audio), licensed under the `BSD-2-Clause` license.
137
+ - This project includes code from [MahmoudAshraf97/ctc-forced-aligner](https://github.com/MahmoudAshraf97/ctc-forced-aligner), licensed under the `BSD` license.`This project is licensed under the BSD License, note that the default model has CC-BY-NC 4.0 License, so make sure to use a different model for commercial usage.`
138
+ - Modifications and additional code are contributed by [Deskpai.com](https://www.deskpai.com) and licensed under the [DOSL-1.0 license](https://github.com/deskpai/deskpai/blob/main/LICENSE).
139
+
140
+ ### Model
141
+
142
+ - The following models are developed by Meta AI (formerly Facebook AI) under `MIT License` and redistributed with the same license.
143
+ - `WAV2VEC2_ASR_BASE_960H`
144
+ - `WAV2VEC2_ASR_BASE_100H`
145
+ - `WAV2VEC2_ASR_BASE_10M`
146
+ - `WAV2VEC2_ASR_LARGE_10M`
147
+ - `WAV2VEC2_ASR_LARGE_100H`
148
+ - `WAV2VEC2_ASR_LARGE_960H`
149
+ - `WAV2VEC2_ASR_LARGE_LV60K_10M`
150
+ - `WAV2VEC2_ASR_LARGE_LV60K_100H`
151
+ - `WAV2VEC2_ASR_LARGE_LV60K_960H`
152
+ - VoxPopuli and HuBERT models are also developed by Meta AI and are generally released under the MIT License. The specific licensing for these models can be found in their respective repositories or documentation. Please check it on your own.
153
+ - `VOXPOPULI_ASR_BASE_10K_DE`
154
+ - `VOXPOPULI_ASR_BASE_10K_EN`
155
+ - `VOXPOPULI_ASR_BASE_10K_ES`
156
+ - `VOXPOPULI_ASR_BASE_10K_FR`
157
+ - `VOXPOPULI_ASR_BASE_10K_IT`
158
+ - `HUBERT_ASR_LARGE`
159
+ - `HUBERT_ASR_XLARGE`
160
+ - The model `MMS_FA` is published by the authors of Scaling Speech Technology to 1,000+ Languages Pratap et al., 2023 under `CC-BY-NC 4.0 License`.
161
+ - The onnx model weights are created by [Deskpai.com](https://www.deskpai.com) based on the model of [MahmoudAshraf/mms-300m-1130-forced-aligner](https://huggingface.co/MahmoudAshraf/mms-300m-1130-forced-aligner) and under `CC-BY-NC 4.0 License`.
162
+
163
+ πŸ“ Note: It's essential to verify the licensing terms from the official repositories or documentation before using these models.
164
+
165
+ ## πŸ™ Reference
166
+
167
+ - [LESS PEAKY AND MORE ACCURATE CTC FORCED ALIGNMENT BY LABEL PRIORS](https://arxiv.org/pdf/2406.02560)
168
+ - [Montreal Forced Aligner User Guide](https://montreal-forced-aligner.readthedocs.io/en/stable/user_guide/index.html)
169
+ - [Forced Alignment with Wav2Vec2](https://pytorch.org/audio/main/tutorials/forced_alignment_tutorial.html)
170
+ - [NeuFA: Neural Network Based End-to-End Forced Aligner](https://arxiv.org/abs/2203.16838)
171
+ - [Tradition or Innovation: A Comparison of Modern ASR Methods for Forced
172
+ Alignment](https://arxiv.org/pdf/2406.19363v1)
source.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ https://huggingface.co/deskpai/ctc_forced_aligner