hynt commited on
Commit
d31b683
Β·
verified Β·
1 Parent(s): 14497b0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -15
README.md CHANGED
@@ -1,17 +1,17 @@
1
  ---
2
  license: cc-by-nc-nd-4.0
3
  ---
4
- # Vietnamese Speech-to-Text (ASR) β€” ZipFormer-30M-RNNT-6000h
5
 
6
  ## πŸ” Overview
7
- The **Vietnamese Speech-to-Text (ASR)** model is built on the **ZipFormer architecture** β€” an improved variant of the Conformer β€” featuring only **30 million parameters** yet achieving **exceptional performance** in both speed and accuracy.
8
- On CPU, the model can transcribe a **12-second audio clip in just 0.3 seconds**, significantly faster than most traditional ASR systems without requiring a GPU.
9
 
10
  ---
11
 
12
  ## πŸš€ Online Demo
13
 
14
- You can try the model directly here:
15
  πŸ‘‰ https://huggingface.co/spaces/hynt/k2-automatic-speech-recognition-demo
16
 
17
  ---
@@ -20,7 +20,8 @@ You can try the model directly here:
20
  - **Architecture:** ZipFormer
21
  - **Parameters:** ~30M
22
  - **Language:** Vietnamese
23
- - **Loss Function:** RNN-Transducer (RNNT Loss)
 
24
  - **Framework:** PyTorch + k2
25
  - **Training strategy**: Carefully preprocess the data, apply an augmentation strategy based on the distribution of out-of-vocabulary (OOV) tokens and refine the transcriptions using Whisper.
26
  - **Optimized for:** High-speed CPU inference
@@ -41,21 +42,21 @@ The model was trained on approximately **6000 hours of high-quality Vietnamese s
41
 
42
  ## πŸ§ͺ Evaluation Results
43
 
44
- | **Dataset** | **ZipFormer-30M-6000h** | **ChunkFormer-110M-3000h** | **PhoWhisper-Large-1.5B-800h** | **VietASR-ZipFormer-68M-70.000h** |
45
- |--------------|--------------------------|-----------------------------|--------------------------------|---------------------------------|
46
- | **VLSP2020-Test-T1** | **12.29** | 14.09 | 13.75 | 14.45 |
47
- | **VLSP2023-PublicTest** | **10.40** | 16.15 | 16.83 | 14.70 |
48
- | **VLSP2023-PrivateTest** | **11.10** | 17.12 | 17.10 | 15.07 |
49
- | **VLSP2025-PublicTest** | **7.97** | 15.55 | 16.14 | 13.55 |
50
- | **VLSP2025-PrivateTest** | **8.10** | 16.07 | 16.31 | 13.97 |
51
- | **GigaSpeech2-Test** | 7.56 | 10.35 | 10.00 | **6.88** |
52
 
53
  > Lower is better (WER %)
54
 
55
  ---
56
 
57
  ## πŸ† Achievements
58
- By training this model architecture on 4,000 hours of data, I **won First Place** in the **Vietnamese Language Speech Processing (VLSP)** competition **2025**.
59
  Comprehensive details about **training data**, **optimization strategies**, **architecture improvements**, and **evaluation methodologies** are available in the paper below:
60
 
61
  πŸ‘‰ [Read the full paper on Overleaf](https://www.overleaf.com/read/wjntrgchhbgv#48aa25)
@@ -76,9 +77,10 @@ Comprehensive details about **training data**, **optimization strategies**, **ar
76
  Please refer to the following guides for instructions on how to run and deploy this model:
77
  - **For Torch JIT Script:** [https://k2-fsa.github.io/sherpa/](https://k2-fsa.github.io/sherpa/)
78
  - **For ONNX:** [https://k2-fsa.github.io/sherpa/onnx/](https://k2-fsa.github.io/sherpa/onnx/)
 
79
 
80
  ## πŸ’¬ Summary
81
- The **ZipFormer-30M-RNNT-6000h** model demonstrates that a lightweight architecture can still achieve state-of-the-art accuracy for Vietnamese ASR.
82
  It is designed for **fast deployment on CPU-based systems**, making it ideal for **real-time speech recognition**, **callbots**, and **embedded speech interfaces**.
83
 
84
  ---
 
1
  ---
2
  license: cc-by-nc-nd-4.0
3
  ---
4
+ # Vietnamese Streaming Speech-to-Text (ASR) β€” ZipFormer-30M-RNNT-Streaming-6000h
5
 
6
  ## πŸ” Overview
7
+ The **Vietnamese Streaming Speech-to-Text (ASR)** model is built on the **ZipFormer architecture with chunk size 16,32,64** β€” an improved variant of the Conformer β€” featuring only **30 million parameters** yet achieving **exceptional performance** in both speed and accuracy.
8
+ On CPU, the none-streaming model can transcribe a **12-second audio clip in just 0.3 seconds**, significantly faster than most traditional ASR systems without requiring a GPU.
9
 
10
  ---
11
 
12
  ## πŸš€ Online Demo
13
 
14
+ You can test the streaming and none-streaming model directly here:
15
  πŸ‘‰ https://huggingface.co/spaces/hynt/k2-automatic-speech-recognition-demo
16
 
17
  ---
 
20
  - **Architecture:** ZipFormer
21
  - **Parameters:** ~30M
22
  - **Language:** Vietnamese
23
+ - **Loss Function:** RNN-Transducer (RNNT Loss)
24
+ - **Chunk Size:** 16, 32, 64
25
  - **Framework:** PyTorch + k2
26
  - **Training strategy**: Carefully preprocess the data, apply an augmentation strategy based on the distribution of out-of-vocabulary (OOV) tokens and refine the transcriptions using Whisper.
27
  - **Optimized for:** High-speed CPU inference
 
42
 
43
  ## πŸ§ͺ Evaluation Results
44
 
45
+ | **Dataset** | **ZipFormer-30M-6000h** | **ZipFormer-30M-Streaming-6000h** | **ChunkFormer-110M-3000h** | **PhoWhisper-Large-1.5B-800h** | **VietASR-ZipFormer-68M-70.000h** |
46
+ |--------------|--------------------------|------------------------------------|-----------------------------|--------------------------------|---------------------------------|
47
+ | **VLSP2020-Test-T1** | **12.29** | -- | 14.09 | 13.75 | 14.45 |
48
+ | **VLSP2023-PublicTest** | **10.40** | -- | 16.15 | 16.83 | 14.70 |
49
+ | **VLSP2023-PrivateTest** | **11.10** | -- | 17.12 | 17.10 | 15.07 |
50
+ | **VLSP2025-PublicTest** | **7.97** | -- | 15.55 | 16.14 | 13.55 |
51
+ | **VLSP2025-PrivateTest** | **8.10** | -- | 16.07 | 16.31 | 13.97 |
52
+ | **GigaSpeech2-Test** | 7.56 | -- | 10.35 | 10.00 | **6.88** |
53
 
54
  > Lower is better (WER %)
55
 
56
  ---
57
 
58
  ## πŸ† Achievements
59
+ By training this none-streaming model architecture on 4,000 hours of data, I **won First Place** in the **Vietnamese Language Speech Processing (VLSP)** competition **2025**.
60
  Comprehensive details about **training data**, **optimization strategies**, **architecture improvements**, and **evaluation methodologies** are available in the paper below:
61
 
62
  πŸ‘‰ [Read the full paper on Overleaf](https://www.overleaf.com/read/wjntrgchhbgv#48aa25)
 
77
  Please refer to the following guides for instructions on how to run and deploy this model:
78
  - **For Torch JIT Script:** [https://k2-fsa.github.io/sherpa/](https://k2-fsa.github.io/sherpa/)
79
  - **For ONNX:** [https://k2-fsa.github.io/sherpa/onnx/](https://k2-fsa.github.io/sherpa/onnx/)
80
+ - **For Streaming Web Test:**: [https://github.com/k2-fsa/sherpa/tree/master/sherpa/bin](https://github.com/k2-fsa/sherpa/tree/master/sherpa/bin)
81
 
82
  ## πŸ’¬ Summary
83
+ The **ZipFormer-30M-RNNT-6000h** and **ZipFormer-30M-RNNT-Streaming-6000h** model demonstrates that a lightweight architecture can still achieve state-of-the-art accuracy for Vietnamese ASR.
84
  It is designed for **fast deployment on CPU-based systems**, making it ideal for **real-time speech recognition**, **callbots**, and **embedded speech interfaces**.
85
 
86
  ---