Update README.md

4157c08 verified 7 days ago

3.86 kB

license: cc-by-nc-4.0
language:
  - zh
  - en
tags:
  - speech
  - asr
  - sensevoice
  - paralinguistic
  - nonverbal-vocalization
datasets:
  - NV-Bench
  - amphion/Emilia-NV
  - nonverbalspeech/nonverbalspeech38k
  - deepvk/NonverbalTTS
  - xunyi/SMIIP-NV
pipeline_tag: automatic-speech-recognition
metrics:
  - cer
base_model:
  - FunAudioLLM/SenseVoiceSmall

Multi-lingual NVASR

Multi-lingual Nonverbal Vocalization Automatic Speech Recognition

Multi-lingual NVASR is a speech recognition model fine-tuned from SenseVoice-Small for transcribing both regular speech and nonverbal vocalizations (NVVs) with a unified paralinguistic label taxonomy. It is a core component of the NV-Bench evaluation pipeline.

Highlights

🗣️ Multi-lingual Support — Chinese (zh), English (en)
🎯 NVV-Aware Transcription — Accurately transcribes nonverbal vocalizations (laughter, coughs, sighs, etc.) as structured tags within text
📊 High-Quality General ASR — Maintains competitive CER on standard ASR benchmarks while significantly outperforming baselines on NVV-specific tasks
🏷️ Unified Label Taxonomy — Consistent paralinguistic labels across all supported languages

NVV Taxonomy

NVVs are organized into three functional levels:

Function	Categories
Vegetative	`[Cough]`, `[Sigh]`, `[Breathing]`
Affect Burst	`[Surprise-oh]`, `[Surprise-ah]`, `[Dissatisfaction-hnn]`, `[Laughter]`
Conversational Grunt	`[Uhm]`, `[Question-en/oh/ah/ei/huh]`, `[Confirmation-en]`

Mandarin supports 13 NVV categories; English supports 7 categories.

Usage

Quick Start with FunASR

from funasr import AutoModel

model = AutoModel(model="path/to/Multi-lingual-NVASR")

# Single file inference
res = model.generate(
    input="example/zh.mp3",
    language="auto",
    use_itn=True,
)
print(res[0]["text"])

Evaluation Metrics

Multi-lingual NVASR supports the following evaluation metrics used in the NV-Bench pipeline:

Metric	Description
OCER / OWER	Overall Character/Word Error Rate (text + NVV tags)
PCER / PWER	Paralinguistic CER/WER (NVV tags only)
CER / WER	Text-only error rate (NVV tags removed)

Our NVASR model maintains high-quality general ASR while significantly outperforming baselines on NVV-specific tasks. — NV-Bench

File Structure

Multi-lingual NVASR/
├── model.pt                        # Model weights (~2.8 GB)
├── config.yaml                     # Model architecture configuration
├── configuration.json              # FunASR pipeline configuration
├── am.mvn                          # Acoustic model mean-variance normalization
├── paralingustic_tokenizer.model   # SentencePiece tokenizer with NVV vocabulary
├── example/                        # Example audio files
│   ├── zh.mp3                      # Chinese example
│   ├── en.mp3                      # English example

Related Resources

NV-Bench Project Page: https://nvbench.github.io
NV-Bench Dataset: Hugging Face
SenseVoice: GitHub

Citation

If you use this model, please cite:

Coming soon

License

This project is licensed under the CC BY-NC-4.0 License.