Instructions to use MoYoYoTech/VoiceDialogue with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MoYoYoTech/VoiceDialogue with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-to-speech", model="MoYoYoTech/VoiceDialogue")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("MoYoYoTech/VoiceDialogue", dtype="auto")

llama-cpp-python

How to use MoYoYoTech/VoiceDialogue with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MoYoYoTech/VoiceDialogue",
	filename="assets/models/llm/qwen/Qwen3-8B-Q6_K.gguf",
)

llm.create_chat_completion(
	messages = "\"The answer to the universe is 42\""
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use MoYoYoTech/VoiceDialogue with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MoYoYoTech/VoiceDialogue:Q6_K
# Run inference directly in the terminal:
llama-cli -hf MoYoYoTech/VoiceDialogue:Q6_K

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MoYoYoTech/VoiceDialogue:Q6_K
# Run inference directly in the terminal:
llama-cli -hf MoYoYoTech/VoiceDialogue:Q6_K

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MoYoYoTech/VoiceDialogue:Q6_K
# Run inference directly in the terminal:
./llama-cli -hf MoYoYoTech/VoiceDialogue:Q6_K

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MoYoYoTech/VoiceDialogue:Q6_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MoYoYoTech/VoiceDialogue:Q6_K

Use Docker

docker model run hf.co/MoYoYoTech/VoiceDialogue:Q6_K

LM Studio
Jan
Ollama
How to use MoYoYoTech/VoiceDialogue with Ollama:
```
ollama run hf.co/MoYoYoTech/VoiceDialogue:Q6_K
```

Unsloth Studio new

How to use MoYoYoTech/VoiceDialogue with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MoYoYoTech/VoiceDialogue to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MoYoYoTech/VoiceDialogue to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MoYoYoTech/VoiceDialogue to start chatting

Pi new

How to use MoYoYoTech/VoiceDialogue with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf MoYoYoTech/VoiceDialogue:Q6_K

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "MoYoYoTech/VoiceDialogue:Q6_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use MoYoYoTech/VoiceDialogue with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf MoYoYoTech/VoiceDialogue:Q6_K

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default MoYoYoTech/VoiceDialogue:Q6_K

Run Hermes

hermes

Docker Model Runner
How to use MoYoYoTech/VoiceDialogue with Docker Model Runner:
```
docker model run hf.co/MoYoYoTech/VoiceDialogue:Q6_K
```

Lemonade

How to use MoYoYoTech/VoiceDialogue with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MoYoYoTech/VoiceDialogue:Q6_K

Run and chat with the model

lemonade run user.VoiceDialogue-Q6_K

List all available models

lemonade list

liumaolin commited on Jun 3, 2025

Commit

59603db

1 Parent(s): 89f7f05

Refactor ASR module: introduce modular structure with ASR interface, implement FunASR and Whisper clients, add registry, and consolidate utility functions for enhanced maintainability and extensibility.

Browse files

Files changed (8) hide show

src/VoiceDialogue/services/speech/asr/__init__.py +39 -0
src/VoiceDialogue/services/speech/asr/manager.py +315 -0
src/VoiceDialogue/services/speech/asr/models/__init__.py +21 -0
src/VoiceDialogue/services/speech/asr/models/base.py +63 -0
src/VoiceDialogue/services/speech/asr/models/funasr.py +63 -0
src/VoiceDialogue/services/speech/asr/models/whisper.py +59 -0
src/VoiceDialogue/services/speech/asr/utils.py +206 -0
src/VoiceDialogue/services/speech/asr_service.py +3 -156

src/VoiceDialogue/services/speech/asr/__init__.py ADDED Viewed

	@@ -0,0 +1,39 @@

+"""
+ASR Module
+提供自动语音识别(ASR)功能的完整解决方案，包括：
+- ASR管理器和注册系统
+- 多种ASR引擎支持
+- 配置管理
+- 运行时接口
+"""
+from .models import (
+    ASRInterface,
+)
+from .manager import (
+    ASRManager,
+    ASRRegistryTables,
+    asr_manager,
+    asr_tables,
+    register_all_asr
+)
+__version__ = "1.0.0"
+__all__ = [
+    # 管理器和注册表
+    'ASRManager',
+    'ASRRegistryTables',
+    'asr_manager',
+    'asr_tables',
+    'register_all_asr',
+    # 配置模型
+    # 运行时接口
+    'ASRInterface',
+]
+# 模块初始化时自动注册所有ASR实现
+# register_all_asr() 已在 asr_manager 模块中自动调用

src/VoiceDialogue/services/speech/asr/manager.py ADDED Viewed

	@@ -0,0 +1,315 @@

+import inspect
+import logging
+import re
+from dataclasses import dataclass
+from typing import Dict, Type, List, Literal, Optional
+from .models import ASRInterface
+@dataclass
+class ASRRegistryTables:
+    """ASR注册表系统，用于管理不同的ASR实现"""
+    asr_classes: Dict[str, Type[ASRInterface]] = None
+    def __post_init__(self):
+        if self.asr_classes is None:
+            self.asr_classes = {}
+    def print(self, key: str = None) -> None:
+        """打印已注册的ASR类"""
+        print("\nASR Registry Tables: \n")
+        headers = ["register name", "class name", "class location", "supported languages"]
+        if self.asr_classes and (key is None or "asr_classes" in key):
+            print(f"-----------    ** asr_classes **    --------------")
+            metas = []
+            for register_key, asr_class in self.asr_classes.items():
+                class_file = inspect.getfile(asr_class)
+                class_line = inspect.getsourcelines(asr_class)[1]
+                # 简化路径显示
+                pattern = r"^.+/VoiceDialogue/"
+                class_file = re.sub(pattern, "VoiceDialogue/", class_file)
+                # 获取支持的语言
+                try:
+                    supported_langs = asr_class.supported_langs
+                    supported_langs_str = ', '.join(supported_langs) if supported_langs else 'unknown'
+                except:
+                    supported_langs_str = 'unknown'
+                meta_data = [
+                    register_key,
+                    asr_class.__name__,
+                    f"{class_file}:{class_line}",
+                    supported_langs_str,
+                ]
+                metas.append(meta_data)
+            metas.sort(key=lambda x: x[0])
+            data = [headers] + metas
+            col_widths = [max(len(str(item)) for item in col) for col in zip(*data)]
+            for row in data:
+                print(
+                    "| "
+                    + " | ".join(str(item).ljust(width) for item, width in zip(row, col_widths))
+                    + " |"
+                )
+        print("\n")
+    def _get_asr_supported_languages(self, asr_key: str) -> List[str]:
+        """获取特定ASR引擎支持的语言列表"""
+        # 根据ASR类型返回支持的语言
+        language_mapping = {
+            'funasr': ['zh', 'auto'],
+            'whisper': ['en', 'zh', 'auto'],
+        }
+        return language_mapping.get(asr_key, ['auto'])
+    def register(self, register_table_key: str, key: str = None) -> callable:
+        """装饰器，用于注册ASR类"""
+        def decorator(target_class):
+            if not hasattr(self, register_table_key):
+                setattr(self, register_table_key, {})
+                logging.debug(f"New ASR registry table added: {register_table_key}")
+            registry = getattr(self, register_table_key)
+            registry_key = key if key is not None else target_class.__name__
+            if registry_key in registry:
+                logging.debug(
+                    f"Key {registry_key} already exists in {register_table_key}, re-register"
+                )
+            registry[registry_key] = target_class
+            logging.info(f"Registered ASR class: {registry_key} -> {target_class.__name__}")
+            return target_class
+        return decorator
+# 全局ASR注册表实例
+asr_tables = ASRRegistryTables()
+class ASRManager:
+    """ASR管理器，负责管理和创建ASR实例"""
+    def __init__(self):
+        self._asr_instances: Dict[str, ASRInterface] = {}
+        self._language_to_asr_mapping = {
+            'zh': 'funasr',  # 中文优先使用FunASR
+            'en': 'whisper',  # 英文优先使用Whisper
+            'auto': 'whisper',  # 自动检测默认使用Whisper
+        }
+    def create_asr(self, language: Literal['auto', 'zh', 'en']) -> ASRInterface:
+        """
+        根据语言配置创建ASR实例
+        Args:
+            language: 语言类型
+        Returns:
+            ASRInterface: ASR实例
+        Raises:
+            ValueError: 如果ASR类型未注册或语言不支持
+        """
+        try:
+            # 根据语言选择合适的ASR引擎
+            asr_type = self._get_asr_type_for_language(language)
+            if asr_type not in asr_tables.asr_classes:
+                raise ValueError(f"ASR类型 '{asr_type}' 未注册")
+            asr_class = asr_tables.asr_classes[asr_type]
+            instance = asr_class()
+            logging.info(f"成功创建ASR实例: {asr_type} for language: {language}")
+            return instance
+        except Exception as e:
+            logging.error(f"创建ASR实例失败: {e}")
+            raise
+    def get_or_create_asr(self, language: Literal['auto', 'zh', 'en']) -> ASRInterface:
+        """
+        获取或创建ASR实例（单例模式）
+        Args:
+            language: 语言类型
+        Returns:
+            ASRInterface: ASR实例
+        """
+        asr_type = self._get_asr_type_for_language(language)
+        instance_key = f"{asr_type}_{language}"
+        if instance_key not in self._asr_instances:
+            self._asr_instances[instance_key] = self.create_asr(language)
+        return self._asr_instances[instance_key]
+    def _get_asr_type_for_language(self, language: str) -> str:
+        """根据语言获取对应的ASR类型"""
+        asr_type = self._language_to_asr_mapping.get(language)
+        if not asr_type:
+            raise ValueError(f"不支持的语言类型: {language}")
+        return asr_type
+    def set_language_mapping(self, language: str, asr_type: str) -> None:
+        """
+        设置语言到ASR引擎的映射关系
+        Args:
+            language: 语言代码
+            asr_type: ASR引擎类型
+        """
+        if asr_type not in asr_tables.asr_classes:
+            raise ValueError(f"ASR类型 '{asr_type}' 未注册")
+        self._language_to_asr_mapping[language] = asr_type
+        logging.info(f"更新语言映射: {language} -> {asr_type}")
+    def list_registered_asr(self) -> Dict[str, Type[ASRInterface]]:
+        """列出所有已注册的ASR类型"""
+        return asr_tables.asr_classes.copy()
+    def is_asr_registered(self, asr_type: str) -> bool:
+        """检查指定ASR类型是否已注册"""
+        return asr_type in asr_tables.asr_classes
+    def get_supported_languages(self) -> Dict[str, List[str]]:
+        """
+        获取所有已注册ASR引擎支持的语言列表
+        Returns:
+            Dict[str, List[str]]: ASR引擎名称到支持语言列表的映射
+        """
+        supported_languages = {}
+        for asr_key in asr_tables.asr_classes.keys():
+            try:
+                languages = asr_tables._get_asr_supported_languages(asr_key)
+                supported_languages[asr_key] = languages
+            except Exception as e:
+                logging.warning(f"获取ASR引擎 '{asr_key}' 支持的语言失败: {e}")
+                supported_languages[asr_key] = ['unknown']
+        return supported_languages
+    def get_available_languages(self) -> List[str]:
+        """
+        获取当前可用的所有语言列表
+        Returns:
+            List[str]: 可用的语言代码列表
+        """
+        all_languages = set()
+        supported_langs = self.get_supported_languages()
+        for asr_key, languages in supported_langs.items():
+            all_languages.update(languages)
+        # 移除unknown标记
+        all_languages.discard('unknown')
+        return sorted(list(all_languages))
+    def validate_language_support(self, language: str) -> bool:
+        """
+        验证指定语言是否被支持
+        Args:
+            language: 语言代码
+        Returns:
+            bool: 是否支持该语言
+        """
+        available_languages = self.get_available_languages()
+        return language in available_languages
+    def get_optimal_asr_for_language(self, language: str) -> Optional[str]:
+        """
+        为指定语言获取最优的ASR引擎
+        Args:
+            language: 语言代码
+        Returns:
+            Optional[str]: 最优的ASR引擎名称，如果没有支持的引擎则返回None
+        """
+        # 检查当前映射
+        if language in self._language_to_asr_mapping:
+            asr_type = self._language_to_asr_mapping[language]
+            if self.is_asr_registered(asr_type):
+                return asr_type
+        # 查找支持该语言的ASR引擎
+        supported_langs = self.get_supported_languages()
+        for asr_key, languages in supported_langs.items():
+            if language in languages:
+                return asr_key
+        return None
+    def cleanup(self) -> None:
+        """清理所有ASR实例"""
+        logging.info("清理ASR实例...")
+        self._asr_instances.clear()
+        logging.info("ASR实例清理完成")
+    def print_registry(self) -> None:
+        """打印注册表信息"""
+        asr_tables.print()
+    def get_asr_statistics(self) -> Dict:
+        """
+        获取ASR管理器的统计信息
+        Returns:
+            Dict: 包含各种统计信息的字典
+        """
+        return {
+            'registered_asr_count': len(asr_tables.asr_classes),
+            'active_instances_count': len(self._asr_instances),
+            'supported_languages': self.get_available_languages(),
+            'language_mappings': self._language_to_asr_mapping.copy(),
+            'registered_asr_types': list(asr_tables.asr_classes.keys())
+        }
+# 全局ASR管理器实例
+asr_manager = ASRManager()
+def register_all_asr():
+    """自动发现并注册所有ASR实现"""
+    import importlib
+    from pathlib import Path
+    # 获取models目录路径
+    models_dir = Path(__file__).parent / "models"
+    # 扫描models目录中的Python文件
+    for py_file in models_dir.glob("*.py"):
+        if py_file.name in ["__init__.py", "base.py"]:
+            continue
+        module_name = py_file.stem
+        try:
+            # 动态导入模块
+            module = importlib.import_module(f".models.{module_name}",
+                                             package="VoiceDialogue.services.speech.asr")
+            logging.info(f"Successfully imported ASR module: {module_name}")
+        except ImportError as e:
+            logging.warning(f"Failed to import ASR module {module_name}: {e}")
+        except Exception as e:
+            logging.error(f"Unexpected error importing ASR module {module_name}: {e}")
+# 在模块导入时自动注册所有ASR
+register_all_asr()

src/VoiceDialogue/services/speech/asr/models/__init__.py ADDED Viewed

	@@ -0,0 +1,21 @@

+from .base import ASRInterface
+__all__ = ['ASRInterface']
+try:
+    from .funasr import FunASRClient
+    __all__.append('FunASRClient')
+except ImportError as e:
+    import logging
+    logging.warning(f"Failed to import some FunASR implementations: {e}")
+try:
+    from .whisper import WhisperCppClient
+    __all__.append('WhisperCppClient')
+except ImportError as e:
+    import logging
+    logging.warning(f"Failed to import some Whisper implementations: {e}")

src/VoiceDialogue/services/speech/asr/models/base.py ADDED Viewed

	@@ -0,0 +1,63 @@

+from abc import ABC, abstractmethod
+from enum import Enum
+import librosa
+import numpy as np
+from config import paths
+class ASRConfigType(Enum):
+    """ASR引擎类型枚举"""
+    FUNASR = 'funasr'
+    WHISPER_CPP = 'whisper_cpp'
+class Language(Enum):
+    """支持的语言枚举"""
+    AUTO = 'auto'
+    CHINESE = 'zh'
+    ENGLISH = 'en'
+class ASRInterface(ABC):
+    """ASR服务的抽象接口"""
+    supported_langs = []
+    def __init__(self):
+        warmup_audiofile = paths.RESOURCES_PATH / 'audio' / 'jfk.flac'
+        if warmup_audiofile.exists():
+            audiodata, _ = librosa.load(warmup_audiofile, sr=16000, mono=True)
+        else:
+            # 创建测试音频
+            audiodata = np.random.randn(16000).astype(np.float32) * 0.1  # 1秒的噪声
+        self.warmup_audiodata = audiodata
+    @abstractmethod
+    def setup(self, **kwargs) -> None:
+        """
+        初始化ASR服务
+        Args:
+            **kwargs: 额外的初始化参数
+        """
+        pass
+    @abstractmethod
+    def warmup(self) -> None:
+        """预热ASR引擎"""
+        pass
+    @abstractmethod
+    def transcribe(self, audio_array: np.ndarray, language: str = None) -> str:
+        """
+        将音频转换为文本
+        Args:
+            audio_array: 音频数据
+            language: 指定语言，如果为None则使用配置中的语言
+        Returns:
+            str: 识别结果文本
+        """
+        pass

src/VoiceDialogue/services/speech/asr/models/funasr.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import re
+import typing
+import numpy as np
+from funasr_onnx import SeacoParaformer, CT_Transformer
+from config import paths
+from .base import ASRInterface
+from ..manager import asr_tables
+from ..utils import ensure_minimum_audio_duration
+@asr_tables.register('asr_classes', 'funasr')
+class FunASRClient(ASRInterface):
+    """FunASR API客户端"""
+    supported_langs = ['zh']
+    def __init__(self):
+        super().__init__()
+        self.funasr_model: typing.Optional[SeacoParaformer] = None
+        self.punc_model: typing.Optional[CT_Transformer] = None
+    def setup(self, **kwargs) -> None:
+        # 设置模型缓存目录
+        models_dir = paths.MODELS_PATH / "asr"
+        asr_model_path = models_dir / "speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
+        punc_model_path = models_dir / "punc_ct-transformer_cn-en-common-vocab471067-large"
+        self.funasr_model = SeacoParaformer(asr_model_path, quantize=True)
+        self.punc_model = CT_Transformer(punc_model_path, quantize=True)
+    def warmup(self) -> None:
+        print('[INFO] Warming up FunASR model...')
+        try:
+            self.transcribe(self.warmup_audiodata)
+            print('[INFO] FunASR model warmed up.')
+        except Exception as e:
+            print(f'[WARNING] FunASR model warmup failed: {e}')
+    def _fix_spaced_uppercase(self, text: str) -> str:
+        """
+        修复类似 " G N O M E " 这样的大写字母间有空格的字符串，将其替换为 "GNOME"
+        """
+        # 匹配大写字母之间的空格模式，至少2个大写字母
+        pattern = r'([A-Z])\s+([A-Z](?:\s+[A-Z])*)'
+        def replace_func(match):
+            # 移除所有空格
+            return match.group(0).replace(' ', '')
+        return re.sub(pattern, replace_func, text)
+    def transcribe(self, audio_array: np.ndarray, language="auto"):
+        audio_array = ensure_minimum_audio_duration(audio_array)
+        segments = self.funasr_model(wav_content=audio_array, hotwords='')
+        transcibed_texts = []
+        for segment in segments:
+            content = segment.get("preds", "")
+            content, _ = self.punc_model(content)
+            content = self._fix_spaced_uppercase(content)
+            transcibed_texts.append(content)
+        return " ".join(transcibed_texts)

src/VoiceDialogue/services/speech/asr/models/whisper.py ADDED Viewed

	@@ -0,0 +1,59 @@

+import typing
+import numpy as np
+from pywhispercpp.model import Model
+from config import paths
+from .base import ASRInterface
+from ..manager import asr_tables
+from ..utils import ensure_minimum_audio_duration
+@asr_tables.register('asr_classes', 'whisper')
+class WhisperCppClient(ASRInterface):
+    """Whisper C++ API客户端"""
+    supported_langs = ['en', 'zh', 'auto']
+    def __init__(self):
+        super().__init__()
+        self.whisper: typing.Optional[Model] = None
+        self.language = "en"
+    def setup(self, **kwargs) -> None:
+        model = kwargs.get('model', 'medium')
+        if model == "medium":
+            model = "medium-q5_0"
+        else:
+            model = "large-v3-turbo-q5_0"
+        models_dir = paths.MODELS_PATH / "asr"
+        self.whisper = Model(model=model, models_dir=models_dir)
+    def warmup(self) -> None:
+        print('[INFO] Warming up Whisper model...')
+        try:
+            self.transcribe(self.warmup_audiodata)
+            print('[INFO] Whisper model warmed up.')
+        except Exception as e:
+            print(f'[WARNING] Whisper model warmup failed: {e}')
+    def transcribe(self, audio_array: np.ndarray, language="en"):
+        if language == "zh":
+            prompt = "以下是简体中文普通话的句子。"
+        else:
+            prompt = "The following is an English sentence."
+        audio_array = ensure_minimum_audio_duration(audio_array)
+        # print('............... language:', language)
+        segments = self.whisper.transcribe(
+            audio_array, language=language, initial_prompt=prompt, print_progress=False
+        )
+        text = []
+        for segment in segments:
+            content = segment.text
+            # if not content.endswith(()):
+            # content += ','
+            text.append(content)
+        text = " ".join(text)
+        return text

src/VoiceDialogue/services/speech/asr/utils.py ADDED Viewed

	@@ -0,0 +1,206 @@

+"""
+ASR模块的工具函数
+包含音频预处理、格式转换等工具函数
+"""
+import numpy as np
+def ensure_minimum_audio_duration(
+        audio_array: np.ndarray, min_duration: float = 1.0, sample_rate: int = 16000
+) -> np.ndarray:
+    """
+    确保音频数组满足最小时长要求，如果不足则用静音填充
+    Args:
+        audio_array: 输入音频数组
+        min_duration: 最小时长要求（秒），默认1秒
+        sample_rate: 采样率，默认16000Hz
+    Returns:
+        处理后的音频数组
+    """
+    audio_duration = audio_array.shape[-1] / sample_rate
+    if audio_duration < min_duration:
+        padding_seconds = min_duration - audio_duration
+        audio_array = padding_silence(audio_array, padding_seconds, sample_rate)
+    return audio_array
+def padding_silence(
+        audio_data: np.ndarray, duration_seconds: float, sample_rate: int = 16000
+) -> np.ndarray:
+    """
+    为音频数据添加静音填充
+    Args:
+        audio_data: 原始音频数据
+        duration_seconds: 需要填充的时长（秒）
+        sample_rate: 采样率
+    Returns:
+        填充后的音频数据
+    """
+    frequency = 440.0
+    duration = duration_seconds + 0.1
+    t = np.linspace(
+        0, duration, int(sample_rate * duration), endpoint=False, dtype=audio_data.dtype
+    )
+    silence = 0.5 * np.sin(2 * np.pi * frequency * t)
+    audio_data = np.concatenate([audio_data, silence])
+    return audio_data
+def validate_audio_array(audio_array: np.ndarray) -> bool:
+    """
+    验证音频数组是否有效
+    Args:
+        audio_array: 音频数组
+    Returns:
+        bool: 是否为有效的音频数组
+    """
+    if audio_array is None:
+        return False
+    if not isinstance(audio_array, np.ndarray):
+        return False
+    if audio_array.size == 0:
+        return False
+    if len(audio_array.shape) > 2:
+        return False
+    return True
+def normalize_audio(audio_array: np.ndarray, target_peak: float = 0.95) -> np.ndarray:
+    """
+    标准化音频数组的音量
+    Args:
+        audio_array: 输入音频数组
+        target_peak: 目标峰值，默认0.95
+    Returns:
+        标准化后的音频数组
+    """
+    if not validate_audio_array(audio_array):
+        raise ValueError("Invalid audio array")
+    # 获取当前峰值
+    current_peak = np.max(np.abs(audio_array))
+    if current_peak == 0:
+        return audio_array
+    # 计算缩放因子
+    scale_factor = target_peak / current_peak
+    # 应用缩放
+    normalized_audio = audio_array * scale_factor
+    return normalized_audio
+def convert_sample_rate(
+        audio_array: np.ndarray,
+        source_rate: int,
+        target_rate: int
+) -> np.ndarray:
+    """
+    转换音频采样率
+    Args:
+        audio_array: 输入音频数组
+        source_rate: 源采样率
+        target_rate: 目标采样率
+    Returns:
+        转换后的音频数组
+    """
+    if source_rate == target_rate:
+        return audio_array
+    try:
+        import librosa
+        return librosa.resample(audio_array, orig_sr=source_rate, target_sr=target_rate)
+    except ImportError:
+        # 如果没有librosa，使用简单的重采样
+        ratio = target_rate / source_rate
+        new_length = int(len(audio_array) * ratio)
+        indices = np.linspace(0, len(audio_array) - 1, new_length)
+        return np.interp(indices, np.arange(len(audio_array)), audio_array)
+def trim_silence(
+        audio_array: np.ndarray,
+        threshold: float = 0.01,
+        sample_rate: int = 16000
+) -> np.ndarray:
+    """
+    修剪音频开头和结尾的静音部分
+    Args:
+        audio_array: 输入音频数组
+        threshold: 静音检测阈值
+        sample_rate: 采样率
+    Returns:
+        修剪后的音频数组
+    """
+    if not validate_audio_array(audio_array):
+        return audio_array
+    # 计算音频的绝对值
+    audio_abs = np.abs(audio_array)
+    # 找到非静音部分的开始和结束
+    non_silent = audio_abs > threshold
+    if not np.any(non_silent):
+        # 如果全是静音，返回最小长度的音频
+        min_samples = int(0.1 * sample_rate)  # 100ms
+        return audio_array[:min_samples] if len(audio_array) > min_samples else audio_array
+    # 找到第一个和最后一个非静音样本
+    start_idx = np.argmax(non_silent)
+    end_idx = len(non_silent) - np.argmax(non_silent[::-1])
+    return audio_array[start_idx:end_idx]
+def get_audio_duration(audio_array: np.ndarray, sample_rate: int = 16000) -> float:
+    """
+    获取音频时长（秒）
+    Args:
+        audio_array: 音频数组
+        sample_rate: 采样率
+    Returns:
+        音频时长（秒）
+    """
+    if not validate_audio_array(audio_array):
+        return 0.0
+    return audio_array.shape[-1] / sample_rate
+def create_silence(duration_seconds: float, sample_rate: int = 16000) -> np.ndarray:
+    """
+    创建指定时长的静音
+    Args:
+        duration_seconds: 静音时长（秒）
+        sample_rate: 采样率
+    Returns:
+        静音音频数组
+    """
+    num_samples = int(duration_seconds * sample_rate)
+    return np.zeros(num_samples, dtype=np.float32)

src/VoiceDialogue/services/speech/asr_service.py CHANGED Viewed

@@ -1,168 +1,14 @@
-import re
 import time
 import typing
 from queue import Queue
-import librosa
 import numpy as np
-from funasr_onnx import SeacoParaformer, CT_Transformer
-from pywhispercpp.model import Model
-from config import paths
 from models.voice_task import VoiceTask
 from services.core.base import BaseThread
 from services.core.constants import user_still_speaking_event, voice_state_manager, dropped_audio_cache
 from utils.cache import LRUCacheDict
-def ensure_minimum_audio_duration(
-        audio_array: np.ndarray, min_duration: float = 1.0, sample_rate: int = 16000
-) -> np.ndarray:
-    """
-    确保音频数组满足最小时长要求，如果不足则用静音填充
-    Args:
-        audio_array: 输入音频数组
-        min_duration: 最小时长要求（秒），默认1秒
-        sample_rate: 采样率，默认16000Hz
-    Returns:
-        处理后的音频数组
-    """
-    audio_duration = audio_array.shape[-1] / sample_rate
-    if audio_duration < min_duration:
-        padding_seconds = min_duration - audio_duration
-        audio_array = padding_silence(audio_array, padding_seconds, sample_rate)
-    return audio_array
-def padding_silence(
-        audio_data: np.ndarray, duration_seconds: float, sample_rate: int = 16000
-) -> np.ndarray:
-    """
-    为音频数据添加静音填充
-    Args:
-        audio_data: 原始音频数据
-        duration_seconds: 需要填充的时长（秒）
-        sample_rate: 采样率
-    Returns:
-        填充后的音频数据
-    """
-    frequency = 440.0
-    duration = duration_seconds + 0.1
-    t = np.linspace(
-        0, duration, int(sample_rate * duration), endpoint=False, dtype=audio_data.dtype
-    )
-    silence = 0.5 * np.sin(2 * np.pi * frequency * t)
-    audio_data = np.concatenate([audio_data, silence])
-    return audio_data
-class WhisperCppClient:
-    """Whisper C++ API客户端"""
-    def __init__(self, model: typing.Literal["medium", "large"] = "medium"):
-        if model == "medium":
-            model = "medium-q5_0"
-        else:
-            model = "large-v3-turbo-q5_0"
-        models_dir = paths.MODELS_PATH / "asr"
-        self.whisper = Model(model=model, models_dir=models_dir)
-    def transcribe(self, audio_array: np.ndarray, language="en"):
-        if language == "zh":
-            prompt = "以下是简体中文普通话的句子。"
-        else:
-            prompt = "The following is an English sentence."
-        audio_array = ensure_minimum_audio_duration(audio_array)
-        # print('............... language:', language)
-        segments = self.whisper.transcribe(
-            audio_array, language=language, initial_prompt=prompt, print_progress=False
-        )
-        text = []
-        for segment in segments:
-            content = segment.text
-            # if not content.endswith(()):
-            # content += ','
-            text.append(content)
-        text = " ".join(text)
-        return text
-class FunASRClient:
-    """FunASR API客户端"""
-    def __init__(self):
-        # 设置模型缓存目录
-        models_dir = paths.MODELS_PATH / "asr"
-        asr_model_path = models_dir / "speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
-        punc_model_path = models_dir / "punc_ct-transformer_cn-en-common-vocab471067-large"
-        self.funasr_model = SeacoParaformer(asr_model_path, quantize=True)
-        self.punc_model = CT_Transformer(punc_model_path, quantize=True)
-    def _fix_spaced_uppercase(self, text: str) -> str:
-        """
-        修复类似 " G N O M E " 这样的大写字母间有空格的字符串，将其替换为 "GNOME"
-        """
-        # 匹配大写字母之间的空格模式，至少2个大写字母
-        pattern = r'([A-Z])\s+([A-Z](?:\s+[A-Z])*)'
-        def replace_func(match):
-            # 移除所有空格
-            return match.group(0).replace(' ', '')
-        return re.sub(pattern, replace_func, text)
-    def transcribe(self, audio_array: np.ndarray, language="auto"):
-        audio_array = ensure_minimum_audio_duration(audio_array)
-        segments = self.funasr_model(wav_content=audio_array, hotwords='')
-        transcibed_texts = []
-        for segment in segments:
-            content = segment.get("preds", "")
-            content, _ = self.punc_model(content)
-            content = self._fix_spaced_uppercase(content)
-            transcibed_texts.append(content)
-        return " ".join(transcibed_texts)
-class UnifiedASRClient:
-    """统一的语音识别客户端，根据语言自动选择FunASR或Whisper"""
-    def __init__(self, language: typing.Literal["auto", "zh", "en"] = "zh"):
-        self.language = language
-        if language == "zh":
-            self.client = FunASRClient()
-        else:
-            self.client = WhisperCppClient()
-    def warmup(self):
-        """预热模型"""
-        print('[INFO] 预热语音识别模型...')
-        try:
-            warmup_audiofile = paths.RESOURCES_PATH / 'audio' / 'jfk.flac'
-            if warmup_audiofile.exists():
-                data, sr = librosa.load(warmup_audiofile, sr=16000, mono=True)
-                self.client.transcribe(data, language=self.language)
-            else:
-                # 创建测试音频
-                test_audio = np.random.randn(16000).astype(np.float32) * 0.1  # 1秒的噪声
-                self.client.transcribe(test_audio, language=self.language)
-            print('[INFO] ASR模型预热完成')
-        except Exception as e:
-            print(f'[WARNING] ASR模型预热失败: {e}')
-    def transcribe(self, audio_array: np.ndarray) -> str:
-        return self.client.transcribe(audio_array, language=self.language)
 class ASRWorker(BaseThread):
@@ -179,7 +25,8 @@ class ASRWorker(BaseThread):
         self.cached_user_questions = LRUCacheDict(maxsize=10)
     def run(self):
-        self.client = UnifiedASRClient(self.language)
         self.client.warmup()
         self.is_ready = True

 import time
 import typing
 from queue import Queue
 import numpy as np
 from models.voice_task import VoiceTask
 from services.core.base import BaseThread
 from services.core.constants import user_still_speaking_event, voice_state_manager, dropped_audio_cache
 from utils.cache import LRUCacheDict
+from .asr import asr_manager
 class ASRWorker(BaseThread):
         self.cached_user_questions = LRUCacheDict(maxsize=10)
     def run(self):
+        self.client = asr_manager.create_asr(self.language)
+        self.client.setup()
         self.client.warmup()
         self.is_ready = True