Instructions to use MoYoYoTech/VoiceDialogue with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MoYoYoTech/VoiceDialogue with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-to-speech", model="MoYoYoTech/VoiceDialogue")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("MoYoYoTech/VoiceDialogue", dtype="auto")

llama-cpp-python

How to use MoYoYoTech/VoiceDialogue with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MoYoYoTech/VoiceDialogue",
	filename="assets/models/llm/qwen/Qwen3-8B-Q6_K.gguf",
)

llm.create_chat_completion(
	messages = "\"The answer to the universe is 42\""
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use MoYoYoTech/VoiceDialogue with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MoYoYoTech/VoiceDialogue:Q6_K
# Run inference directly in the terminal:
llama-cli -hf MoYoYoTech/VoiceDialogue:Q6_K

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MoYoYoTech/VoiceDialogue:Q6_K
# Run inference directly in the terminal:
llama-cli -hf MoYoYoTech/VoiceDialogue:Q6_K

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MoYoYoTech/VoiceDialogue:Q6_K
# Run inference directly in the terminal:
./llama-cli -hf MoYoYoTech/VoiceDialogue:Q6_K

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MoYoYoTech/VoiceDialogue:Q6_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MoYoYoTech/VoiceDialogue:Q6_K

Use Docker

docker model run hf.co/MoYoYoTech/VoiceDialogue:Q6_K

LM Studio
Jan
Ollama
How to use MoYoYoTech/VoiceDialogue with Ollama:
```
ollama run hf.co/MoYoYoTech/VoiceDialogue:Q6_K
```

Unsloth Studio new

How to use MoYoYoTech/VoiceDialogue with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MoYoYoTech/VoiceDialogue to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MoYoYoTech/VoiceDialogue to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MoYoYoTech/VoiceDialogue to start chatting

Pi new

How to use MoYoYoTech/VoiceDialogue with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf MoYoYoTech/VoiceDialogue:Q6_K

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "MoYoYoTech/VoiceDialogue:Q6_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use MoYoYoTech/VoiceDialogue with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf MoYoYoTech/VoiceDialogue:Q6_K

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default MoYoYoTech/VoiceDialogue:Q6_K

Run Hermes

hermes

Docker Model Runner
How to use MoYoYoTech/VoiceDialogue with Docker Model Runner:
```
docker model run hf.co/MoYoYoTech/VoiceDialogue:Q6_K
```

Lemonade

How to use MoYoYoTech/VoiceDialogue with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MoYoYoTech/VoiceDialogue:Q6_K

Run and chat with the model

lemonade run user.VoiceDialogue-Q6_K

List all available models

lemonade list

liumaolin commited on May 29, 2025

Commit

2988b10

1 Parent(s): ecc005d

Add multilingual support and optimize LLM pipeline configuration.

Browse files

Files changed (6) hide show

models/llm/Qwen2.5-14B-Instruct.Q4_0.gguf +3 -0
src/VoiceDialogue/main.py +3 -35
src/VoiceDialogue/models/voice_task.py +1 -0
src/VoiceDialogue/services/speech/asr_service.py +1 -0
src/VoiceDialogue/services/text/{llm.py → langchain_llm.py} +3 -74
src/VoiceDialogue/services/text/text_generator.py +43 -9

models/llm/Qwen2.5-14B-Instruct.Q4_0.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1c9d82509108e9f758ff9ad1034d2c23088c611c76eaec308ad246caf8decac5
+size 8517725952

src/VoiceDialogue/main.py CHANGED Viewed

@@ -6,7 +6,6 @@ from config.paths import load_third_party
 load_third_party()
-from models.language_model import language_model_registry
 from models.voice_model import voice_model_registry
 from services.audio.aec_audio_capture import EchoCancellingAudioCapture
 from services.audio.audio_answer import TTSAudioGenerator
@@ -15,16 +14,13 @@ from services.speech.speech_monitor import SpeechStateMonitor
 from services.speech.asr_service import ASRWorker
 from services.text.text_generator import LLMResponseGenerator
 HERE = Path(__file__).parent
 language: typing.Literal['zh', 'en'] = 'en'
 def launch_system(
         user_language: str,
-        system_prompt: str,
-        tts_speaker: str,
-        llm: typing.Literal['7B', '14B'] = '14B'
 ):
     audio_frames_queue = Queue()
     user_voice_queue = Queue()
@@ -53,31 +49,9 @@ def launch_system(
     whisper_worker.start()
     threads.append(whisper_worker)
-    if llm == '8B':
-        selected_llm_model = language_model_registry[-1]
-    elif llm == '7B':
-        selected_llm_model = language_model_registry[-3]
-    else:
-        selected_llm_model = language_model_registry[-2]
-    selected_llm_model.download_model()
-    default_llm_params = {
-        'streaming': True,
-        'n_gpu_layers': -1,
-        'n_batch': 512,
-        'n_ctx': 2048,
-        'f16_kv': True,
-        'temperature': 0.8,
-        # 'n_predict': -1,
-        'top_k': 50,
-        'top_p': 1.0,
-    }
     answer_generator_worker = LLMResponseGenerator(
         user_question_queue=transcribed_text_queue,
-        generated_answer_queue=generated_answer_queue,
-        local_model_path=selected_llm_model.pretrained_model_path,
-        model_params=default_llm_params,
-        prompt_template=system_prompt
     )
     answer_generator_worker.start()
     threads.append(answer_generator_worker)
@@ -114,16 +88,10 @@ def launch_system(
 def main():
     user_language: typing.Literal['zh', 'en'] = 'zh'
-    SYSTEM_PROMPT = ("你是善于模拟真实的思考过程的AI助手。"
-                     "回答时，必须首先生成一个不超过5个字的简短句子，"
-                     "比如：\"让我想一下\"、\"在我看来\"、\"稍等我理一理\"、\"不错的问题\"、\"稍等片刻\"等，然后再进行正式回答，"
-                     "保持中文口语化表达，禁用emoji和系统相关描述，确保衔接词与内容存在合理逻辑关联。")
     # '罗翔', '马保国', '沈逸', '杨幂', '周杰伦', '马云'
     tts_speaker = '沈逸'
-    # QWen2.5 7B or 14B
-    llm = '14B'
-    launch_system(user_language, SYSTEM_PROMPT, tts_speaker, llm=llm)
 if __name__ == '__main__':

 load_third_party()
 from models.voice_model import voice_model_registry
 from services.audio.aec_audio_capture import EchoCancellingAudioCapture
 from services.audio.audio_answer import TTSAudioGenerator
 from services.speech.asr_service import ASRWorker
 from services.text.text_generator import LLMResponseGenerator
 HERE = Path(__file__).parent
 language: typing.Literal['zh', 'en'] = 'en'
 def launch_system(
         user_language: str,
+        tts_speaker: str
 ):
     audio_frames_queue = Queue()
     user_voice_queue = Queue()
     whisper_worker.start()
     threads.append(whisper_worker)
     answer_generator_worker = LLMResponseGenerator(
         user_question_queue=transcribed_text_queue,
+        generated_answer_queue=generated_answer_queue
     )
     answer_generator_worker.start()
     threads.append(answer_generator_worker)
 def main():
     user_language: typing.Literal['zh', 'en'] = 'zh'
     # '罗翔', '马保国', '沈逸', '杨幂', '周杰伦', '马云'
     tts_speaker = '沈逸'
+    launch_system(user_language, tts_speaker)
 if __name__ == '__main__':

src/VoiceDialogue/models/voice_task.py CHANGED Viewed

@@ -8,6 +8,7 @@ class VoiceTask(BaseModel):
     id: str
     session_id: str = Field(default="")
     is_speaking_over_threshold: bool = Field(default=False)
     is_over_audio_frames_threshold: bool = Field(default=False)
     user_voice: np.array = Field(default=np.array([]))

     id: str
     session_id: str = Field(default="")
+    language: str = Field(default="zh")
     is_speaking_over_threshold: bool = Field(default=False)
     is_over_audio_frames_threshold: bool = Field(default=False)
     user_voice: np.array = Field(default=np.array([]))

src/VoiceDialogue/services/speech/asr_service.py CHANGED Viewed

@@ -180,6 +180,7 @@ class ASRWorker(BaseThread):
         while not self.stopped():
             voice_task: VoiceTask = self.user_voice_queue.get()
             voice_task.whisper_start_time = time.time()
             user_voice: np.array = voice_task.user_voice
             transcribed_text = self.client.transcribe(user_voice)

         while not self.stopped():
             voice_task: VoiceTask = self.user_voice_queue.get()
+            voice_task.language = self.language
             voice_task.whisper_start_time = time.time()
             user_voice: np.array = voice_task.user_voice
             transcribed_text = self.client.transcribe(user_voice)

src/VoiceDialogue/services/text/{llm.py → langchain_llm.py} RENAMED Viewed

@@ -1,13 +1,8 @@
-import hashlib
-import os
 import pathlib
-import threading
 import typing
-from collections import OrderedDict
 from langchain_community.chat_models.llamacpp import ChatLlamaCpp
 from langchain_core.callbacks import StreamingStdOutCallbackHandler, CallbackManager
-from langchain_core.language_models.llms import LLM
 from langchain_core.messages import SystemMessage
 from langchain_core.prompts import (
     ChatPromptTemplate, MessagesPlaceholder, HumanMessagePromptTemplate
@@ -16,74 +11,8 @@ from langchain_core.runnables import RunnableWithMessageHistory
 from utils.strings import remove_emojis, convert_comma_separated_numbers, convert_uppercase_words_to_lowercase
-default_llm_params = OrderedDict({
-    'streaming': True,
-    'n_gpu_layers': -1,
-    'n_batch': 512,
-    'n_ctx': 2048,
-    'f16_kv': True,
-    'temperature': 0.7,
-    'n_predict': -1,
-    'top_k': 50,
-    'top_p': 1.0,
-})
-singleton_chat_langchain_instance: typing.Optional[LLM] = None
-singleton_chat_langchain_instance_uid: str = ''
-single_chat_instance_locker = threading.Lock()
-def setup_chat_langchain_pipeline(
-        local_model_path: str,
-        model_params: dict | None = None,
-        prompt_template: str = '',
-        get_session_history: typing.Callable = None
-):
-    model_path = pathlib.Path(local_model_path)
-    if not model_path.exists():
-        raise RuntimeError(f'Model path not exists: {model_path}')
-    if get_session_history is None:
-        raise RuntimeError(f'Function<get_session_history> can\'t be None.')
-    if not isinstance(model_params, dict):
-        model_params = default_llm_params
-    current_model_uid = generate_unique_id(model_path, model_params)
-    with single_chat_instance_locker:
-        global singleton_chat_langchain_instance_uid, singleton_chat_langchain_instance
-        if current_model_uid == singleton_chat_langchain_instance_uid:
-            instance = singleton_chat_langchain_instance
-            langchain_pipeline_is_warmup = True
-        else:
-            singleton_chat_langchain_instance_uid = current_model_uid
-            instance = setup_chat_llamacpp_langchain_instance(local_model_path, model_params)
-            singleton_chat_langchain_instance = instance
-            langchain_pipeline_is_warmup = False
-    pipeline = build_chat_langchain_pipeline(instance, prompt_template, get_session_history)
-    if not langchain_pipeline_is_warmup:
-        warmup_chat_langchain_pipeline(pipeline)
-    return pipeline
-def generate_unique_id(
-        model_path: str | os.PathLike,
-        model_params: dict,
-        multimodal_path: str | os.PathLike = ''
-):
-    model_uid_params = [f'llm_path={model_path}']
-    if multimodal_path:
-        model_uid_params.append(f'multimodal={multimodal_path}')
-    model_uid_params.extend(f'{k}:{v}' for k, v in model_params.items())
-    current_model_uid = hashlib.md5('&'.join(model_uid_params).encode()).hexdigest()
-    return current_model_uid
-def setup_chat_llamacpp_langchain_instance(
         local_model_path: str,
         model_params: dict | None = None
 ) -> ChatLlamaCpp:
@@ -109,7 +38,7 @@ def setup_chat_llamacpp_langchain_instance(
     return llamacpp_langchain_instance
-def build_chat_langchain_pipeline(langchain_instance: LLM, system_prompt: str, get_session_history: typing.Callable):
     prompt = ChatPromptTemplate(messages=[
         SystemMessage(content=system_prompt),
         MessagesPlaceholder(variable_name="history"),
@@ -123,7 +52,7 @@ def build_chat_langchain_pipeline(langchain_instance: LLM, system_prompt: str, g
     return chain_with_history
-def warmup_chat_langchain_pipeline(pipeline):
     print("Warmup chat pipeline...")
     user_input = 'Hello, this is warming up step, if you understand, output "Ok".'

 import pathlib
 import typing
 from langchain_community.chat_models.llamacpp import ChatLlamaCpp
 from langchain_core.callbacks import StreamingStdOutCallbackHandler, CallbackManager
 from langchain_core.messages import SystemMessage
 from langchain_core.prompts import (
     ChatPromptTemplate, MessagesPlaceholder, HumanMessagePromptTemplate
 from utils.strings import remove_emojis, convert_comma_separated_numbers, convert_uppercase_words_to_lowercase
+def create_langchain_chat_llamacpp_instance(
         local_model_path: str,
         model_params: dict | None = None
 ) -> ChatLlamaCpp:
     return llamacpp_langchain_instance
+def create_langchain_pipeline(langchain_instance, system_prompt: str, get_session_history: typing.Callable):
     prompt = ChatPromptTemplate(messages=[
         SystemMessage(content=system_prompt),
         MessagesPlaceholder(variable_name="history"),
     return chain_with_history
+def warmup_langchain_pipeline(pipeline):
     print("Warmup chat pipeline...")
     user_input = 'Hello, this is warming up step, if you understand, output "Ok".'

src/VoiceDialogue/services/text/text_generator.py CHANGED Viewed

@@ -5,10 +5,22 @@ from queue import Queue, Empty
 from langchain.memory import ConversationBufferWindowMemory
 from langchain_core.chat_history import InMemoryChatMessageHistory
 from models.voice_task import VoiceTask
 from services.core.base import BaseThread
 from services.core.constants import chat_history_cache
-from services.text.llm import setup_chat_langchain_pipeline, preprocess_sentence_text
 class LLMResponseGenerator(BaseThread):
@@ -16,18 +28,19 @@ class LLMResponseGenerator(BaseThread):
     def __init__(self, group=None, target=None, name=None, args=(), kwargs={}, *, daemon=None,
                  user_question_queue: Queue,
-                 generated_answer_queue: Queue,
-                 local_model_path: str,
-                 model_params: dict | None = None,
-                 prompt_template: str = ''):
         super().__init__(group, target, name, args, kwargs, daemon=daemon)
         self.user_question_queue = user_question_queue
         self.generated_answer_queue = generated_answer_queue
-        self.langchain_pipeline = setup_chat_langchain_pipeline(
-            local_model_path, model_params, prompt_template, self.get_session_history
-        )
     def get_session_history(self, session_id: str) -> InMemoryChatMessageHistory:
         message_history = InMemoryChatMessageHistory()
@@ -104,10 +117,13 @@ class LLMResponseGenerator(BaseThread):
         print(f'用户问题: {user_question}')
         voice_task.llm_start_time = time.time()
         config = {"configurable": {"session_id": voice_task.session_id}}
         try:
-            for chunk in self.langchain_pipeline.stream(input={'input': user_question}, config=config):
                 chunk_content = f'{chunk.content.strip()}'
                 if not chunk_content:
                     continue
@@ -148,6 +164,24 @@ class LLMResponseGenerator(BaseThread):
         self._send_sentence_to_queue(voice_task, sentence, answer_index)
     def run(self):
         """主运行循环"""
         while not self.stopped():
             try:

 from langchain.memory import ConversationBufferWindowMemory
 from langchain_core.chat_history import InMemoryChatMessageHistory
+from config import paths
 from models.voice_task import VoiceTask
 from services.core.base import BaseThread
 from services.core.constants import chat_history_cache
+from services.text.langchain_llm import preprocess_sentence_text, \
+    create_langchain_chat_llamacpp_instance, create_langchain_pipeline, warmup_langchain_pipeline
+CHINESE_SYSTEM_PROMPT = ("你是善于模拟真实的思考过程的AI助手。"
+                         "回答时，必须首先生成一个不超过5个字的简短句子，"
+                         "比如：\"让我想一下\"、\"在我看来\"、\"稍等我理一理\"、\"不错的问题\"、\"稍等片刻\"等，然后再进行正式回答，"
+                         "保持中文口语化表达，禁用emoji和系统相关描述，确保衔接词与内容存在合理逻辑关联。")
+ENGLISH_SYSTEM_PROMPT = ("You are an AI assistant skilled at simulating authentic thinking processes. "
+                         "When responding, you must first generate a brief phrase of no more than 5 words, "
+                         "such as: 'Let me think', 'I see', 'Let me process this', 'Good question', 'One moment', etc., then proceed with your formal response. "
+                         "Maintain natural conversational English expression, avoid emojis and system-related descriptions, and ensure logical coherence between transitional phrases and content.")
 class LLMResponseGenerator(BaseThread):
     def __init__(self, group=None, target=None, name=None, args=(), kwargs={}, *, daemon=None,
                  user_question_queue: Queue,
+                 generated_answer_queue: Queue
+                 ):
         super().__init__(group, target, name, args, kwargs, daemon=daemon)
         self.user_question_queue = user_question_queue
         self.generated_answer_queue = generated_answer_queue
+    def _get_prompt_by_language(self, language: str) -> str:
+        """根据语言获取对应的 prompt"""
+        if language == "zh":
+            return CHINESE_SYSTEM_PROMPT
+        else:
+            return ENGLISH_SYSTEM_PROMPT
     def get_session_history(self, session_id: str) -> InMemoryChatMessageHistory:
         message_history = InMemoryChatMessageHistory()
         print(f'用户问题: {user_question}')
         voice_task.llm_start_time = time.time()
+        system_prompt = self._get_prompt_by_language(voice_task.language)
+        pipeline = create_langchain_pipeline(self.model_instance, system_prompt, self.get_session_history)
         config = {"configurable": {"session_id": voice_task.session_id}}
         try:
+            for chunk in pipeline.stream(input={'input': user_question}, config=config):
                 chunk_content = f'{chunk.content.strip()}'
                 if not chunk_content:
                     continue
         self._send_sentence_to_queue(voice_task, sentence, answer_index)
     def run(self):
+        model_path = paths.MODELS_PATH / 'llm' / 'Qwen2.5-14B-Instruct.Q4_0.gguf'
+        model_params = {
+            'streaming': True,
+            'n_gpu_layers': -1,
+            'n_batch': 512,
+            'n_ctx': 2048,
+            'f16_kv': True,
+            'temperature': 0.8,
+            # 'n_predict': -1,
+            'top_k': 50,
+            'top_p': 1.0,
+        }
+        self.model_instance = create_langchain_chat_llamacpp_instance(
+            local_model_path=model_path, model_params=model_params
+        )
+        pipeline = create_langchain_pipeline(self.model_instance, CHINESE_SYSTEM_PROMPT, self.get_session_history)
+        warmup_langchain_pipeline(pipeline)
         """主运行循环"""
         while not self.stopped():
             try: