|
|
--- |
|
|
license: other |
|
|
license_name: model-license |
|
|
license_link: https://github.com/modelscope/FunASR/blob/main/MODEL_LICENSE |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
- ja |
|
|
- ko |
|
|
library: funasr |
|
|
--- |
|
|
|
|
|
([简体中文](./README_zh.md)|English|[日本語](./README_ja.md)) |
|
|
|
|
|
# Introduction |
|
|
|
|
|
github [repo](https://github.com/FunAudioLLM/SenseVoice) : https://github.com/FunAudioLLM/SenseVoice |
|
|
|
|
|
SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech |
|
|
recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection ( |
|
|
AED). |
|
|
|
|
|
<img src="image/sensevoice2.png"> |
|
|
|
|
|
[//]: # (<div align="center"><img src="image/sensevoice.png" width="700"/> </div>) |
|
|
|
|
|
<div align="center"> |
|
|
<h4> |
|
|
<a href="https://fun-audio-llm.github.io/"> Homepage </a> |
|
|
|<a href="#What's News"> What's News </a> |
|
|
|<a href="#Benchmarks"> Benchmarks </a> |
|
|
|<a href="#Install"> Install </a> |
|
|
|<a href="#Usage"> Usage </a> |
|
|
|<a href="#Community"> Community </a> |
|
|
</h4> |
|
|
|
|
|
Model Zoo: |
|
|
[modelscope](https://www.modelscope.cn/models/iic/SenseVoiceSmall), [huggingface](https://huggingface.co/FunAudioLLM/SenseVoiceSmall) |
|
|
|
|
|
Online Demo: |
|
|
[modelscope demo](https://www.modelscope.cn/studios/iic/SenseVoice), [huggingface space](https://huggingface.co/spaces/FunAudioLLM/SenseVoice) |
|
|
|
|
|
|
|
|
</div> |
|
|
|
|
|
|
|
|
<a name="Highligts"></a> |
|
|
|
|
|
# Highlights 🎯 |
|
|
|
|
|
**SenseVoice** focuses on high-accuracy multilingual speech recognition, speech emotion recognition, and audio event |
|
|
detection. |
|
|
|
|
|
- **Multilingual Speech Recognition:** Trained with over 400,000 hours of data, supporting more than 50 languages, the |
|
|
recognition performance surpasses that of the Whisper model. |
|
|
- **Rich transcribe:** |
|
|
- Possess excellent emotion recognition capabilities, achieving and surpassing the effectiveness of the current best |
|
|
emotion recognition models on test data. |
|
|
- Offer sound event detection capabilities, supporting the detection of various common human-computer interaction |
|
|
events such as bgm, applause, laughter, crying, coughing, and sneezing. |
|
|
- **Efficient Inference:** The SenseVoice-Small model utilizes a non-autoregressive end-to-end framework, leading to |
|
|
exceptionally low inference latency. It requires only 70ms to process 10 seconds of audio, which is 15 times faster |
|
|
than Whisper-Large. |
|
|
- **Convenient Finetuning:** Provide convenient finetuning scripts and strategies, allowing users to easily address |
|
|
long-tail sample issues according to their business scenarios. |
|
|
- **Service Deployment:** Offer service deployment pipeline, supporting multi-concurrent requests, with client-side |
|
|
languages including Python, C++, HTML, Java, and C#, among others. |
|
|
|
|
|
<a name="What's News"></a> |
|
|
|
|
|
# What's New 🔥 |
|
|
|
|
|
- 2024/7: Added Export Features for [ONNX](https://github.com/FunAudioLLM/SenseVoice/demo_onnx.py) |
|
|
and [libtorch](https://github.com/FunAudioLLM/SenseVoice/demo_libtorch.py), as well as Python Version |
|
|
Runtimes: [funasr-onnx-0.4.0](https://pypi.org/project/funasr-onnx/), [funasr-torch-0.1.1](https://pypi.org/project/funasr-torch/) |
|
|
- 2024/7: The [SenseVoice-Small](https://www.modelscope.cn/models/iic/SenseVoiceSmall) voice understanding model is |
|
|
open-sourced, which offers high-precision multilingual speech recognition, emotion recognition, and audio event |
|
|
detection capabilities for Mandarin, Cantonese, English, Japanese, and Korean and leads to exceptionally low inference |
|
|
latency. |
|
|
- 2024/7: The CosyVoice for natural speech generation with multi-language, timbre, and emotion control. CosyVoice excels |
|
|
in multi-lingual voice generation, zero-shot voice generation, cross-lingual voice cloning, and instruction-following |
|
|
capabilities. [CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice) |
|
|
and [CosyVoice space](https://www.modelscope.cn/studios/iic/CosyVoice-300M). |
|
|
- 2024/7: [FunASR](https://github.com/modelscope/FunASR) is a fundamental speech recognition toolkit that offers a |
|
|
variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, |
|
|
Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR. |
|
|
|
|
|
<a name="Benchmarks"></a> |
|
|
|
|
|
# Benchmarks 📝 |
|
|
|
|
|
## Multilingual Speech Recognition |
|
|
|
|
|
We compared the performance of multilingual speech recognition between SenseVoice and Whisper on open-source benchmark |
|
|
datasets, including AISHELL-1, AISHELL-2, Wenetspeech, LibriSpeech, and Common Voice. In terms of Chinese and Cantonese |
|
|
recognition, the SenseVoice-Small model has advantages. |
|
|
|
|
|
<div align="center"> |
|
|
<img src="image/asr_results1.png" width="400" /><img src="image/asr_results2.png" width="400" /> |
|
|
</div> |
|
|
|
|
|
## Speech Emotion Recognition |
|
|
|
|
|
Due to the current lack of widely-used benchmarks and methods for speech emotion recognition, we conducted evaluations |
|
|
across various metrics on multiple test sets and performed a comprehensive comparison with numerous results from recent |
|
|
benchmarks. The selected test sets encompass data in both Chinese and English, and include multiple styles such as |
|
|
performances, films, and natural conversations. Without finetuning on the target data, SenseVoice was able to achieve |
|
|
and exceed the performance of the current best speech emotion recognition models. |
|
|
|
|
|
<div align="center"> |
|
|
<img src="image/ser_table.png" width="1000" /> |
|
|
</div> |
|
|
|
|
|
Furthermore, we compared multiple open-source speech emotion recognition models on the test sets, and the results |
|
|
indicate that the SenseVoice-Large model achieved the best performance on nearly all datasets, while the |
|
|
SenseVoice-Small model also surpassed other open-source models on the majority of the datasets. |
|
|
|
|
|
<div align="center"> |
|
|
<img src="image/ser_figure.png" width="500" /> |
|
|
</div> |
|
|
|
|
|
## Audio Event Detection |
|
|
|
|
|
Although trained exclusively on speech data, SenseVoice can still function as a standalone event detection model. We |
|
|
compared its performance on the environmental sound classification ESC-50 dataset against the widely used industry |
|
|
models BEATS and PANN. The SenseVoice model achieved commendable results on these tasks. However, due to limitations in |
|
|
training data and methodology, its event classification performance has some gaps compared to specialized AED models. |
|
|
|
|
|
<div align="center"> |
|
|
<img src="image/aed_figure.png" width="500" /> |
|
|
</div> |
|
|
|
|
|
## Computational Efficiency |
|
|
|
|
|
The SenseVoice-Small model deploys a non-autoregressive end-to-end architecture, resulting in extremely low inference |
|
|
latency. With a similar number of parameters to the Whisper-Small model, it infers more than 5 times faster than |
|
|
Whisper-Small and 15 times faster than Whisper-Large. |
|
|
|
|
|
<div align="center"> |
|
|
<img src="image/inference.png" width="1000" /> |
|
|
</div> |
|
|
|
|
|
# Requirements |
|
|
|
|
|
```shell |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
<a name="Usage"></a> |
|
|
|
|
|
# Usage |
|
|
|
|
|
## Inference |
|
|
|
|
|
Supports input of audio in any format and of any duration. |
|
|
|
|
|
```python |
|
|
from funasr import AutoModel |
|
|
from funasr.utils.postprocess_utils import rich_transcription_postprocess |
|
|
|
|
|
model_dir = "FunAudioLLM/SenseVoiceSmall" |
|
|
|
|
|
|
|
|
model = AutoModel( |
|
|
model=model_dir, |
|
|
vad_model="fsmn-vad", |
|
|
vad_kwargs={"max_single_segment_time": 30000}, |
|
|
device="cuda:0", |
|
|
hub="hf", |
|
|
) |
|
|
|
|
|
# en |
|
|
res = model.generate( |
|
|
input=f"{model.model_path}/example/en.mp3", |
|
|
cache={}, |
|
|
language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech" |
|
|
use_itn=True, |
|
|
batch_size_s=60, |
|
|
merge_vad=True, # |
|
|
merge_length_s=15, |
|
|
) |
|
|
text = rich_transcription_postprocess(res[0]["text"]) |
|
|
print(text) |
|
|
``` |
|
|
|
|
|
Parameter Description: |
|
|
|
|
|
- `model_dir`: The name of the model, or the path to the model on the local disk. |
|
|
- `vad_model`: This indicates the activation of VAD (Voice Activity Detection). The purpose of VAD is to split long |
|
|
audio into shorter clips. In this case, the inference time includes both VAD and SenseVoice total consumption, and |
|
|
represents the end-to-end latency. If you wish to test the SenseVoice model's inference time separately, the VAD model |
|
|
can be disabled. |
|
|
- `vad_kwargs`: Specifies the configurations for the VAD model. `max_single_segment_time`: denotes the maximum duration |
|
|
for audio segmentation by the `vad_model`, with the unit being milliseconds (ms). |
|
|
- `use_itn`: Whether the output result includes punctuation and inverse text normalization. |
|
|
- `batch_size_s`: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in |
|
|
seconds (s). |
|
|
- `merge_vad`: Whether to merge short audio fragments segmented by the VAD model, with the merged length |
|
|
being `merge_length_s`, in seconds (s). |
|
|
|
|
|
If all inputs are short audios (<30s), and batch inference is needed to speed up inference efficiency, the VAD model can |
|
|
be removed, and `batch_size` can be set accordingly. |
|
|
|
|
|
```python |
|
|
model = AutoModel(model=model_dir, device="cuda:0", hub="hf") |
|
|
|
|
|
res = model.generate( |
|
|
input=f"{model.model_path}/example/en.mp3", |
|
|
cache={}, |
|
|
language="zh", # "zn", "en", "yue", "ja", "ko", "nospeech" |
|
|
use_itn=False, |
|
|
batch_size=64, |
|
|
hub="hf", |
|
|
) |
|
|
``` |
|
|
|
|
|
For more usage, please refer to [docs](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md) |
|
|
|
|
|
### Inference directly |
|
|
|
|
|
Supports input of audio in any format, with an input duration limit of 30 seconds or less. |
|
|
|
|
|
```python |
|
|
from model import SenseVoiceSmall |
|
|
from funasr.utils.postprocess_utils import rich_transcription_postprocess |
|
|
|
|
|
model_dir = "FunAudioLLM/SenseVoiceSmall" |
|
|
m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0", hub="hf") |
|
|
m.eval() |
|
|
|
|
|
res = m.inference( |
|
|
data_in=f"{kwargs['model_path']}/example/en.mp3", |
|
|
language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech" |
|
|
use_itn=False, |
|
|
**kwargs, |
|
|
) |
|
|
|
|
|
text = rich_transcription_postprocess(res[0][0]["text"]) |
|
|
print(text) |
|
|
``` |
|
|
|
|
|
### Export and Test (*On going*) |
|
|
|
|
|
Ref to [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) |
|
|
|
|
|
## Service |
|
|
|
|
|
Ref to [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) |
|
|
|
|
|
## Finetune |
|
|
|
|
|
Ref to [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) |
|
|
|
|
|
## WebUI |
|
|
|
|
|
```shell |
|
|
python webui.py |
|
|
``` |
|
|
|
|
|
<div align="center"><img src="image/webui.png" width="700"/> </div> |
|
|
|
|
|
<a name="Community"></a> |
|
|
|
|
|
# Community |
|
|
|
|
|
If you encounter problems in use, you can directly raise Issues on the github page. |
|
|
|
|
|
You can also scan the following DingTalk group QR code to join the community group for communication and discussion. |
|
|
|
|
|
| FunAudioLLM | FunASR | |
|
|
|:----------------------------------------------------------------:|:--------------------------------------------------------:| |
|
|
| <div align="left"><img src="image/dingding_sv.png" width="250"/> | <img src="image/dingding_funasr.png" width="250"/></div> | |