hynt
/

Zipformer-30M-RNNT-Streaming-6000h

Model card Files Files and versions

hynt commited on Feb 2

Commit

c122fdc

·

verified ·

1 Parent(s): bbb7e3c

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -5,7 +5,7 @@ license: cc-by-nc-nd-4.0
 ## 🔍 Overview
 The **Vietnamese Streaming Speech-to-Text (ASR)** model is built on the **ZipFormer architecture with chunk size 16,32,64** — an improved variant of the Conformer — featuring only **30 million parameters** yet.
-On CPU, the model can transcribe a **12-second audio clip in just 0.3 seconds**, significantly faster than most traditional ASR systems without requiring a GPU.
 ---
@@ -67,8 +67,8 @@ Comprehensive details about **training data**, **optimization strategies**, **ar
 | **Device** | **Audio Length** | **Inference Time** |
 |-------------|------------------|--------------------|
-| CPU (Hugging Face Basic) | 12 seconds | **0.3 s** |
-| GPU (RTX 3090) | 12 seconds | **< 0.1 s** |
 ---

 ## 🔍 Overview
 The **Vietnamese Streaming Speech-to-Text (ASR)** model is built on the **ZipFormer architecture with chunk size 16,32,64** — an improved variant of the Conformer — featuring only **30 million parameters** yet.
+On CPU, the model can transcribe a **1-second audio chunk in just 0.05 seconds**, designed for streaming-based tasks with low latency requirements.
 ---
 | **Device** | **Audio Length** | **Inference Time** |
 |-------------|------------------|--------------------|
+| CPU (Hugging Face Basic) | 1 seconds audio chunk | **0.05 s** |
+| GPU (RTX 3090) | 1 seconds audio chunk | **< 0.01 s** |
 ---