DataoceanAI commited on
Commit
2f2f0dd
·
0 Parent(s):

Duplicate from DataoceanAI1/dolphin-cn-dialect-base-streaming

Browse files
Files changed (7) hide show
  1. .gitattributes +37 -0
  2. Dolphin-CN-Dialect.png +3 -0
  3. README.md +210 -0
  4. base.cn.streaming.pt +3 -0
  5. global_cmvn +1 -0
  6. train.yaml +113 -0
  7. units.txt +0 -0
.gitattributes ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ dolphin_fangyan_feature_poster_v3.png filter=lfs diff=lfs merge=lfs -text
37
+ Dolphin-CN-Dialect.png filter=lfs diff=lfs merge=lfs -text
Dolphin-CN-Dialect.png ADDED

Git LFS Details

  • SHA256: d3be711351ad48bac444210e782bc074e1b3d0910a5c895defa9745af0dcfb8f
  • Pointer size: 131 Bytes
  • Size of remote file: 524 kB
README.md ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ frameworks:
3
+ - ""
4
+ language:
5
+ - zh
6
+ license: apache-2.0
7
+ tags:
8
+ - speech
9
+ - asr
10
+ tasks: []
11
+ ---
12
+
13
+ # Dolphin-CN-Dialect
14
+
15
+ [Paper](https://arxiv.org/abs/2605.08961)
16
+ [Github](https://github.com/DataoceanAI/Dolphin)
17
+ [Huggingface](https://huggingface.co/DataoceanAI)
18
+ [Modelscope](https://www.modelscope.cn/organization/DataoceanAI)
19
+
20
+ **Dolphin-CN-Dialect** is a multi-dialect ASR model developed by Dataocean AI and Tsinghua University, with a strong focus on Chinese dialect recognition and real-world deployment scenarios. Compared with the previous Dolphin series, Dolphin-CN-Dialect introduces significant improvements in tokenizer design, dialect-balanced training, streaming capability, hotword biasing, and deployment efficiency.
21
+
22
+ The model supports Mandarin Chinese and 22 Chinese dialects, while also maintaining multilingual ASR capability inherited from Dolphin. Dolphin-CN-Dialect supports both streaming and non-streaming inference, enabling practical deployment in latency-sensitive applications such as real-time transcription and industrial speech recognition systems.
23
+
24
+
25
+ ## Approach
26
+
27
+ Dolphin-CN-Dialect is built upon the Dolphin architecture and follows a joint CTC-Attention framework with:
28
+
29
+ * Encoder: E-Branchformer
30
+ * Decoder: Transformer Decoder
31
+ * Training Objective: Joint CTC + Attention loss
32
+
33
+ Compared to Dolphin, Dolphin-CN-Dialect introduces several important improvements:
34
+
35
+ * Temperature-based data sampling for balancing standard Mandarin and low-resource dialects
36
+ * Redesigned tokenizer with:
37
+ * character-level modeling for Chinese
38
+ * BPE-based subword modeling for English
39
+ * extensible dialect tokens
40
+ * Streaming ASR support
41
+ * Hotword-biased decoding, including:
42
+ * encoder-level contextual biasing
43
+ * prompt-based decoder biasing
44
+
45
+ Experimental results show that Dolphin-CN-Dialect achieves:
46
+
47
+ * 38% improvement in dialect recognition accuracy
48
+ * 16.3% relative CER reduction over Dolphin
49
+ * Competitive performance with recent large-scale ASR systems while maintaining a smaller model size
50
+
51
+ ![Dolphin-CN-Dialect 特色海报](Dolphin-CN-Dialect.png)
52
+
53
+
54
+ See details in the [Paper](https://arxiv.org/abs/2503.20212).
55
+
56
+
57
+ ## Setup
58
+
59
+ Dolphin-CN-Dialect requires FFmpeg to convert audio files into WAV format. Please install FFmpeg first if it is not already installed on your system.
60
+
61
+ ```shell
62
+ # Ubuntu / Debian
63
+ sudo apt update && sudo apt install ffmpeg
64
+ # MacOS
65
+ brew install ffmpeg
66
+ # Windows
67
+ choco install ffmpeg
68
+ ```
69
+
70
+ Install Dolphin with pip:
71
+
72
+ ```shell
73
+ pip install -U dolphin
74
+ ```
75
+
76
+ Alternatively, install from source:
77
+
78
+ ```shell
79
+ pip install git+https://github.com/DataoceanAI/Dolphin.git
80
+ ```
81
+
82
+ ## Available Models
83
+
84
+ Currently, Dolphin-CN-Dialect provides multiple model sizes optimized for different deployment scenarios.
85
+
86
+ | Model | Parameters | Hotwords |
87
+ |:------:|:----------:|:----------:|
88
+ | base.cn | 0.1 B | ❌ |
89
+ | base.cn.streaming | 0.1 B |❌ |
90
+ | small.cn | 0.4 B | Encoder-biased Hotwords |
91
+ | small.cn.streaming | 0.4 B | Encoder-biased Hotwords |
92
+ | small.cn.prompt | 0.4 B | Prompt-based Hotwords |
93
+
94
+
95
+ ## Hotword Biasing
96
+
97
+ Dolphin-CN-Dialect supports two hotword biasing approaches.
98
+
99
+ **Encoder-Level Contextual Biasing**
100
+
101
+ * Supports both streaming and non-streaming models
102
+ * Integrates contextual embeddings into encoder representations
103
+ * Efficient adaptation without retraining the full model
104
+
105
+ **Prompt-Based Hotword Biasing**
106
+
107
+ * Designed for non-streaming models
108
+ * Injects hotwords directly into decoder prompts
109
+ * Particularly effective for long-tail and rare phrases
110
+
111
+ Experimental results show significant reductions in hotword error rates while maintaining strong overall ASR performance.
112
+
113
+
114
+
115
+ ## Supported Languages and Dialects
116
+
117
+ Dolphin-CN-Dialect primarily focuses on:
118
+
119
+ * Mandarin Chinese
120
+ * 22 Chinese dialects
121
+ * Regional accented Mandarin
122
+
123
+ Supported dialects include:
124
+
125
+ * Sichuan
126
+ * Wu
127
+ * Minnan
128
+ * Shanghai
129
+ * Gansu
130
+ * Guangdong
131
+ * Wenzhou
132
+ * Hunan
133
+ * Anhui
134
+ * Henan
135
+ * Fujian
136
+ * Hebei
137
+ * Liaoning
138
+ * Shaanxi
139
+ * Tianjin
140
+ * and more
141
+
142
+ For the complete language and dialect list, see [languages.md](./languages.md).
143
+
144
+ ## Supported Devices
145
+
146
+ | Device Type | Support Status |
147
+ |:-------------:|:----------------:|
148
+ |**CUDA**|✅Supported|
149
+ |**MPS (Apple)**|✅Supported|
150
+ |**CPU**|✅Supported|
151
+
152
+
153
+
154
+
155
+ ## Usage
156
+
157
+ ### Command-line usage
158
+
159
+ ```shell
160
+ dolphin audio.wav
161
+
162
+ # Download model and specify the model path
163
+ dolphin audio.wav --model small.cn --model_dir /data/models/dolphin/
164
+
165
+ # Specify language and region
166
+ dolphin audio.wav --model small.cn --model_dir /data/models/dolphin/ --lang_sym "zh" --region_sym "CN"
167
+
168
+ # Specify the hotwords file with Encoder-biased method
169
+ dolphin audio.wav --model small.cn --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_deep_biasing true
170
+
171
+ # Using prompt-based model
172
+ dolphin audio.wav --model small.cn.prompt --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_prompt_hotword true --use_two_stage_filter true
173
+
174
+ ```
175
+
176
+ ### Python usage
177
+
178
+ ```python
179
+ import dolphin
180
+ from dolphin import transcribe
181
+
182
+ model_name = 'small.cn'
183
+ model = dolphin.load_model(model_name, device="cuda")
184
+
185
+ result = transcribe(model, 'audio.wav')
186
+ print(result.text)
187
+
188
+ # Specify language
189
+ result = transcribe(model, 'audio.wav', lang_sym="zh")
190
+ print(result.text)
191
+
192
+ # Specify language and region and encoder-biased hotwords
193
+ result = transcribe(model, 'audio.wav', lang_sym="zh", region_sym="CN", hotwords=['诺香丹青牌科研胶囊'], use_deep_biasing=True, use_two_stage_filter=True)
194
+ print(result.text)
195
+
196
+ ## prompt-based hotwords
197
+
198
+ model_name = 'small.cn.prompt'
199
+ model = dolphin.load_model(model_name, device="cuda")
200
+
201
+ result = transcribe(model, 'audio.wav', hotwords=['诺香丹青牌科研胶囊'], use_prompt_hotword=True, use_two_stage_filter=True, decoding_method='attention')
202
+
203
+ print(result.text)
204
+
205
+ ```
206
+
207
+
208
+ ## License
209
+
210
+ Dolphin-CN-Dialect is released under the Apache 2.0 License.
base.cn.streaming.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:62e4c11fe1e0e42bd34e444172c5a05e792c4b5a03750f794fa3206fc0649cd7
3
+ size 447175723
global_cmvn ADDED
@@ -0,0 +1 @@
 
 
1
+ {"mean_stat": [533749120.0, 537379776.0, 553561472.0, 587164544.0, 631869696.0, 662598848.0, 684377024.0, 695393728.0, 692471168.0, 679433984.0, 666123200.0, 656323712.0, 665752576.0, 678693440.0, 681920896.0, 679622080.0, 669891840.0, 656595136.0, 653838528.0, 637679232.0, 628412096.0, 644836864.0, 638840960.0, 646180608.0, 639724352.0, 642756992.0, 637471744.0, 642369856.0, 643414976.0, 647382848.0, 649348672.0, 649294336.0, 650233920.0, 654485056.0, 660473792.0, 667416512.0, 673158464.0, 675675200.0, 675123648.0, 668017536.0, 670060160.0, 662626240.0, 663143808.0, 662504064.0, 666413696.0, 672262080.0, 678483904.0, 685386048.0, 692572416.0, 699064000.0, 700785280.0, 701202688.0, 702666560.0, 705441664.0, 706070720.0, 705989248.0, 702842816.0, 699316416.0, 696090176.0, 687561152.0, 675279808.0, 663676352.0, 662962880.0, 664298944.0, 666095808.0, 671681664.0, 676652224.0, 680097152.0, 683811072.0, 688700992.0, 692082880.0, 695787904.0, 701085376.0, 706388736.0, 711491584.0, 717637248.0, 719691456.0, 715812736.0, 696362624.0, 604648448.0], "var_stat": [5413307392.0, 5559845888.0, 6150984704.0, 6921248256.0, 7999779840.0, 8789867520.0, 9405782016.0, 9768041472.0, 9759789056.0, 9430661120.0, 9090545664.0, 8873148416.0, 9155918848.0, 9542536192.0, 9653540864.0, 9593434112.0, 9316643840.0, 8959277056.0, 8863545344.0, 8450634752.0, 8211585536.0, 8587086336.0, 8432618496.0, 8583947264.0, 8401719808.0, 8439344640.0, 8293782528.0, 8401505280.0, 8427503104.0, 8525163520.0, 8577082880.0, 8575110656.0, 8594999296.0, 8701685760.0, 8854966272.0, 9029483520.0, 9168757760.0, 9221463040.0, 9194539008.0, 8997074944.0, 9024589824.0, 8819394560.0, 8807888896.0, 8777241600.0, 8869670912.0, 9017397248.0, 9173403648.0, 9345572864.0, 9530641408.0, 9701232640.0, 9748996096.0, 9762760704.0, 9801994240.0, 9874428928.0, 9883272192.0, 9873506304.0, 9780680704.0, 9672627200.0, 9569440768.0, 9321866240.0, 8968148992.0, 8646342656.0, 8616977408.0, 8648623104.0, 8702088192.0, 8859208704.0, 8999405568.0, 9105936384.0, 9220425728.0, 9358615552.0, 9451428864.0, 9552728064.0, 9695461376.0, 9836660736.0, 9970957312.0, 10135880704.0, 10189387776.0, 10070480896.0, 9532967936.0, 7261238272.0], "frame_num": 54068199}
train.yaml ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ accum_grad: 4
2
+ cmvn: global_cmvn
3
+ cmvn_conf:
4
+ cmvn_file: data/train/global_cmvn
5
+ is_json_cmvn: true
6
+ ctc: ctc
7
+ ctc_conf:
8
+ ctc_blank_id: 0
9
+ dataset: asr
10
+ dataset_conf:
11
+ batch_conf:
12
+ batch_size: 32
13
+ batch_type: static
14
+ ctc_label: true
15
+ cycle: 100
16
+ fbank_conf:
17
+ dither: 0.1
18
+ frame_length: 25
19
+ frame_shift: 10
20
+ num_mel_bins: 80
21
+ filter_conf:
22
+ max_length: 3000
23
+ min_length: 0
24
+ token_max_length: 200
25
+ token_min_length: 1
26
+ no_time_idx: 3
27
+ remove_punctuation: true
28
+ remove_timestamp: true
29
+ resample_conf:
30
+ resample_rate: 16000
31
+ shuffle: true
32
+ shuffle_conf:
33
+ shuffle_size: 4096
34
+ sort: true
35
+ sort_conf:
36
+ sort_size: 2048
37
+ spec_aug: true
38
+ spec_aug_conf:
39
+ max_f: 10
40
+ max_t: 50
41
+ num_f_mask: 2
42
+ num_t_mask: 2
43
+ spec_trim: true
44
+ spec_trim_conf:
45
+ max_t: 30
46
+ speed_perturb: true
47
+ time_apply_prob: 0.0
48
+ decoder: transformer
49
+ decoder_conf:
50
+ attention_heads: 8
51
+ dropout_rate: 0.1
52
+ linear_units: 2048
53
+ num_blocks: 6
54
+ positional_dropout_rate: 0.1
55
+ self_attention_dropout_rate: 0.1
56
+ src_attention_dropout_rate: 0.1
57
+ use_sdpa: true
58
+ dtype: fp32
59
+ encoder: e_branchformer
60
+ encoder_conf:
61
+ activation_type: swish
62
+ attention_dropout_rate: 0.1
63
+ attention_heads: 8
64
+ causal: true
65
+ cgmlp_conv_kernel: 31
66
+ cgmlp_linear_units: 2048
67
+ dropout_rate: 0.1
68
+ gate_activation: identity
69
+ input_layer: conv2d
70
+ linear_units: 2048
71
+ merge_conv_kernel: 31
72
+ num_blocks: 6
73
+ output_size: 512
74
+ pos_enc_layer_type: rel_pos
75
+ positional_dropout_rate: 0.1
76
+ selfattention_layer_type: rel_selfattn
77
+ use_dynamic_chunk: true
78
+ use_dynamic_left_chunk: false
79
+ use_linear_after_conv: false
80
+ use_sdpa: true
81
+ grad_clip: 5
82
+ input_dim: 80
83
+ log_interval: 200
84
+ max_epoch: 100
85
+ model: asr_model
86
+ model_conf:
87
+ ctc_weight: 0.3
88
+ length_normalized_loss: false
89
+ lsm_weight: 0.1
90
+ model_dir: exp/dolphin_ebf_base_streaming_v4.3
91
+ optim: adam
92
+ optim_conf:
93
+ lr: 0.0005
94
+ output_dim: 18173
95
+ save_interval: 2000
96
+ save_states: model_only
97
+ scheduler: warmuplr
98
+ scheduler_conf:
99
+ warmup_steps: 2048
100
+ stats_dialect: true
101
+ tokenizer: char
102
+ tokenizer_conf:
103
+ special_tokens:
104
+ <asr>: 4
105
+ <blank>: 0
106
+ <eos>: 3
107
+ <sos>: 2
108
+ <unk>: 1
109
+ split_with_space: false
110
+ symbol_table_path: data/dict/units.txt
111
+ train_engine: torch_ddp
112
+ use_amp: false
113
+ vocab_size: 18173
units.txt ADDED
The diff for this file is too large to render. See raw diff