| --- |
| license: agpl-3.0 |
| base_model: |
| - Qwen/Qwen3-ASR-1.7B |
| pipeline_tag: automatic-speech-recognition |
| tags: |
| - rknn |
| - rkllm |
| - audio |
| - automatic-speech-recognition |
| --- |
| |
| # Qwen3-ASR-1.7B-RKLLM |
|
|
| ### (English README see below) |
|
|
| Qwen3-ASR 系列包含 Qwen3-ASR-1.7B 与 Qwen3-ASR-0.6B 两款模型,支持 52 种语言与方言的语种识别及语音转写。二者均基于大规模语音训练数据及其基础模型 Qwen3-Omni 强大的音频理解能力构建。实验表明,1.7B 版本在开源 ASR 模型中达到领先性能,并与最强商业私有 API 表现相当。以下是主要特性: |
|
|
| - 推理速度(RK3588): |
| + 音频编码器 (单NPU核, fp16): 每次推理(处理1s音频)耗时89.6ms,速度 ~11.2x,3NPU并行速度 ~30x |
| + LLM Prefill (3NPU核, fp16): ~130tps, 每个token对应1/13秒音频,速度 ~10x |
| + LLM Decode (3NPU核, fp16): ~7.5 tps |
| |
| - 大致内存占用(RK3588):约5GB |
|
|
| ## 使用方法 |
|
|
| 1. 克隆项目到本地 |
|
|
| 2. 安装依赖 |
|
|
| ```bash |
| pip install numpy scipy soundfile tqdm transformers ztu-somemodelruntime-ez-rknn-async |
| ``` |
|
|
| 3. 运行 |
|
|
| ```bash |
| python run_qwen3_asr_e2e.py --audio-path ./long_test.wav |
| ``` |
|
|
| ## 运行效果 |
|
|
| ```log |
| > python run_qwen3_asr_e2e.py --audio-path ./long_test.wav |
| I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 |
| I rkllm: loading rkllm model from rknn/language_model.rkllm |
| I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16 |
| I rkllm: Enabled cpus: [4, 5, 6, 7] |
| I rkllm: Enabled cpus num: 4 |
| I rkllm: reset chat template: |
| I rkllm: system_prompt: <|im_start|>system\n<|im_end|>\n |
| I rkllm: prompt_prefix: <|im_start|>user\n |
| I rkllm: prompt_postfix: <|im_end|>\n<|im_start|>assistant\n |
| W rkllm: Calling rkllm_set_chat_template will disable the internal automatic chat template parsing, including enable_thinking. Make sure your custom prompt is complete and valid. |
| input_feature_len: 4031 |
| audio_features: (524, 2048) |
| time_mel_sec: 2.532 |
| time_rkllm_init_sec: 4.404 |
| time_load_total_sec: 8.092 |
| time_audio_encoder_sec: 1.391 |
| language Chinese<asr_text>大家好呀!今天给大家分享的是在线一键语音生成网站的合集,能够更加方便大家选择自己想要生成的角色。进入网站可以看到所有的生成模型都在这里,选择你想要生成的角色,点击进入就来到了生成的页面,在文本框内输入你想要生成的内容,然后点击生成就好了。另外呢,因为每次的生成结果都会有一些不一样的地方,如果您觉得第一次的生成效果不好的话,可以尝试重新生成,也可以稍微调节一下相关的数值再生成试试。使用时一定要遵守法律法规,不可以损害刷人的形象哦!(finish) |
| I rkllm: -------------------------------------------------------------------------------------- |
| I rkllm: Model init time (ms) 3747.00 |
| I rkllm: -------------------------------------------------------------------------------------- |
| I rkllm: Stage Total Time (ms) Tokens Time per Token (ms) Tokens per Second |
| I rkllm: -------------------------------------------------------------------------------------- |
| I rkllm: Prefill 4193.03 539 7.78 128.55 |
| I rkllm: Generate 15643.47 118 132.57 7.54 |
| I rkllm: -------------------------------------------------------------------------------------- |
| I rkllm: Peak Memory Usage (GB) |
| I rkllm: 5.03 |
| I rkllm: -------------------------------------------------------------------------------------- |
| time_generate_sec: 19.872 |
| time_infer_total_sec: 23.673 |
| time_total_sec: 31.765 |
| ``` |
|
|
| ## 模型转换 |
|
|
| https://huggingface.co/happyme531/Qwen3-ASR-1.7B-RKLLM/blob/main/convert/README.md |
|
|
| ## 已知问题 |
|
|
| - 还没有实现流式推理。也许你可以看看 [qzxyz/qwen3asr_rk](https://huggingface.co/qzxyz/qwen3asr_rk) 这个实现? |
| - 还没去做量化。 |
|
|
|
|
| ## 参考 |
| - [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) |
|
|
| --- |
|
|
| ## English README |
|
|
| Qwen3-ASR series includes Qwen3-ASR-1.7B and Qwen3-ASR-0.6B, supporting language identification and speech transcription for 52 languages and dialects. Both models are built on large-scale speech training data and the powerful audio understanding capabilities of their base model Qwen3-Omni. Experiments show that the 1.7B version achieves leading performance among open-source ASR models and is on par with the strongest commercial proprietary APIs. Key highlights: |
|
|
| - Inference speed (RK3588): |
| + Audio encoder (single NPU core, fp16): ~89.6ms per inference (processing 1s audio), speed ~11.2x, 33x with 3 NPU cores. |
| + LLM Prefill (3 NPU cores, fp16): ~130 tps, each token corresponds to 1/13s of audio, speed ~10x |
| + LLM Decode (3 NPU cores, fp16): ~7.5 tps |
|
|
| - Approximate memory usage (RK3588): ~5GB |
|
|
| ### Usage |
|
|
| 1. Clone the repository |
|
|
| 2. Install dependencies |
|
|
| ```bash |
| pip install numpy scipy soundfile tqdm transformers ztu-somemodelruntime-ez-rknn-async |
| ``` |
|
|
| 3. Run |
|
|
| ```bash |
| python run_qwen3_asr_e2e.py --audio-path ./long_test.wav |
| ``` |
|
|
| ### Example Output |
|
|
| ``` |
| > python run_qwen3_asr_e2e.py --audio-path ./long_test.wav |
| I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 |
| I rkllm: loading rkllm model from rknn/language_model.rkllm |
| I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16 |
| I rkllm: Enabled cpus: [4, 5, 6, 7] |
| I rkllm: Enabled cpus num: 4 |
| I rkllm: reset chat template: |
| I rkllm: system_prompt: <|im_start|>system\n<|im_end|>\n |
| I rkllm: prompt_prefix: <|im_start|>user\n |
| I rkllm: prompt_postfix: <|im_end|>\n<|im_start|>assistant\n |
| W rkllm: Calling rkllm_set_chat_template will disable the internal automatic chat template parsing, including enable_thinking. Make sure your custom prompt is complete and valid. |
| input_feature_len: 4031 |
| audio_features: (524, 2048) |
| time_mel_sec: 2.532 |
| time_rkllm_init_sec: 4.404 |
| time_load_total_sec: 8.092 |
| time_audio_encoder_sec: 1.391 |
| language Chinese<asr_text>大家好呀!今天给大家分享的是在线一键语音生成网站的合集,能够更加方便大家选择自己想要生成的角色。进入网站可以看到所有的生成模型都在这里,选择你想要生成的角色,点击进入就来到了生成的页面,在文本框内输入你想要生成的内容,然后点击生成就好了。另外呢,因为每次的生成结果都会有一些不一样的地方,如果您觉得第一次的生成效果不好的话,可以尝试重新生成,也可以稍微调节一下相关的数值再生成试试。使用时一定要遵守法律法规,不可以损害刷人的形象哦!(finish) |
| I rkllm: -------------------------------------------------------------------------------------- |
| I rkllm: Model init time (ms) 3747.00 |
| I rkllm: -------------------------------------------------------------------------------------- |
| I rkllm: Stage Total Time (ms) Tokens Time per Token (ms) Tokens per Second |
| I rkllm: -------------------------------------------------------------------------------------- |
| I rkllm: Prefill 4193.03 539 7.78 128.55 |
| I rkllm: Generate 15643.47 118 132.57 7.54 |
| I rkllm: -------------------------------------------------------------------------------------- |
| I rkllm: Peak Memory Usage (GB) |
| I rkllm: 5.03 |
| I rkllm: -------------------------------------------------------------------------------------- |
| time_generate_sec: 19.872 |
| time_infer_total_sec: 23.673 |
| time_total_sec: 31.765 |
| ``` |
|
|
| ### Model Conversion |
|
|
| https://huggingface.co/happyme531/Qwen3-ASR-1.7B-RKLLM/blob/main/convert/README.md |
|
|
| ### Known Issues |
|
|
| - Streaming inference is not yet implemented. You might want to check out [qzxyz/qwen3asr_rk](https://huggingface.co/qzxyz/qwen3asr_rk) for a streaming implementation. |
| - Quantization has not been done yet. |
|
|
| ### References |
| - [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) |
|
|