| | --- |
| | license: cc-by-nc-4.0 |
| | language: |
| | - zh |
| | - en |
| | tags: |
| | - speech |
| | - asr |
| | - sensevoice |
| | - paralinguistic |
| | - nonverbal-vocalization |
| | datasets: |
| | - NV-Bench |
| | - amphion/Emilia-NV |
| | - nonverbalspeech/nonverbalspeech38k |
| | - deepvk/NonverbalTTS |
| | - xunyi/SMIIP-NV |
| | pipeline_tag: automatic-speech-recognition |
| | metrics: |
| | - cer |
| | base_model: |
| | - FunAudioLLM/SenseVoiceSmall |
| | --- |
| | |
| | # Multi-lingual NVASR |
| |
|
| | **Multi-lingual Nonverbal Vocalization Automatic Speech Recognition** |
| |
|
| | [](https://nvbench.github.io) |
| | [](https://huggingface.co/datasets/AnonyData/NV-Bench) |
| | [](https://huggingface.co/AnonyData/Multilingual-NVASR) |
| |
|
| | Multi-lingual NVASR is a speech recognition model fine-tuned from [SenseVoice-Small](https://github.com/FunAudioLLM/SenseVoice) for transcribing both regular speech and **nonverbal vocalizations (NVVs)** with a unified paralinguistic label taxonomy. It is a core component of the [NV-Bench](https://nvbench.github.io) evaluation pipeline. |
| |
|
| | ## Highlights |
| |
|
| | - π£οΈ **Multi-lingual Support** β Chinese (zh), English (en) |
| | - π― **NVV-Aware Transcription** β Accurately transcribes nonverbal vocalizations (laughter, coughs, sighs, etc.) as structured tags within text |
| | - π **High-Quality General ASR** β Maintains competitive CER on standard ASR benchmarks while significantly outperforming baselines on NVV-specific tasks |
| | - π·οΈ **Unified Label Taxonomy** β Consistent paralinguistic labels across all supported languages |
| |
|
| | ## NVV Taxonomy |
| |
|
| | NVVs are organized into three functional levels: |
| |
|
| | | Function | Categories | |
| | |----------|------------| |
| | | Vegetative | `[Cough]`, `[Sigh]`, `[Breathing]` | |
| | | Affect Burst | `[Surprise-oh]`, `[Surprise-ah]`, `[Dissatisfaction-hnn]`, `[Laughter]` | |
| | | Conversational Grunt | `[Uhm]`, `[Question-en/oh/ah/ei/huh]`, `[Confirmation-en]` | |
| |
|
| | > [!NOTE] |
| | > Mandarin supports 13 NVV categories; English supports 7 categories. |
| |
|
| | ## Usage |
| |
|
| | ### Quick Start with FunASR |
| |
|
| | ```python |
| | from funasr import AutoModel |
| | |
| | model = AutoModel(model="path/to/Multi-lingual-NVASR") |
| | |
| | # Single file inference |
| | res = model.generate( |
| | input="example/zh.mp3", |
| | language="auto", |
| | use_itn=True, |
| | ) |
| | print(res[0]["text"]) |
| | ``` |
| |
|
| | ## Evaluation Metrics |
| |
|
| | Multi-lingual NVASR supports the following evaluation metrics used in the NV-Bench pipeline: |
| |
|
| | | Metric | Description | |
| | |--------|-------------| |
| | | **OCER / OWER** | Overall Character/Word Error Rate (text + NVV tags) | |
| | | **PCER / PWER** | Paralinguistic CER/WER (NVV tags only) | |
| | | **CER / WER** | Text-only error rate (NVV tags removed) | |
| |
|
| | > Our NVASR model maintains high-quality general ASR while significantly outperforming baselines on NVV-specific tasks. β *NV-Bench* |
| |
|
| | ## File Structure |
| |
|
| | ``` |
| | Multi-lingual NVASR/ |
| | βββ model.pt # Model weights (~2.8 GB) |
| | βββ config.yaml # Model architecture configuration |
| | βββ configuration.json # FunASR pipeline configuration |
| | βββ am.mvn # Acoustic model mean-variance normalization |
| | βββ paralingustic_tokenizer.model # SentencePiece tokenizer with NVV vocabulary |
| | βββ example/ # Example audio files |
| | β βββ zh.mp3 # Chinese example |
| | β βββ en.mp3 # English example |
| | ``` |
| |
|
| | ## Related Resources |
| |
|
| | - **NV-Bench Project Page**: [https://nvbench.github.io](https://nvbench.github.io) |
| | - **NV-Bench Dataset**: [Hugging Face](https://huggingface.co/datasets/AnonyData/NV-Bench) |
| | - **SenseVoice**: [GitHub](https://github.com/FunAudioLLM/SenseVoice) |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | ```bibtex |
| | Coming soon |
| | ``` |
| |
|
| | ## License |
| |
|
| | This project is licensed under the [CC BY-NC-4.0 License](LICENSE). |