Instructions to use MoYoYoTech/Translator with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MoYoYoTech/Translator with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MoYoYoTech/Translator",
	filename="moyoyo_asr_models/qwen2.5-1.5b-instruct-q5_0.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use MoYoYoTech/Translator with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MoYoYoTech/Translator:Q5_0
# Run inference directly in the terminal:
llama-cli -hf MoYoYoTech/Translator:Q5_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MoYoYoTech/Translator:Q5_0
# Run inference directly in the terminal:
llama-cli -hf MoYoYoTech/Translator:Q5_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MoYoYoTech/Translator:Q5_0
# Run inference directly in the terminal:
./llama-cli -hf MoYoYoTech/Translator:Q5_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MoYoYoTech/Translator:Q5_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MoYoYoTech/Translator:Q5_0

Use Docker

docker model run hf.co/MoYoYoTech/Translator:Q5_0

LM Studio
Jan
Ollama
How to use MoYoYoTech/Translator with Ollama:
```
ollama run hf.co/MoYoYoTech/Translator:Q5_0
```

Unsloth Studio

How to use MoYoYoTech/Translator with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MoYoYoTech/Translator to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MoYoYoTech/Translator to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MoYoYoTech/Translator to start chatting

How to use MoYoYoTech/Translator with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf MoYoYoTech/Translator:Q5_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "MoYoYoTech/Translator:Q5_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use MoYoYoTech/Translator with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf MoYoYoTech/Translator:Q5_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default MoYoYoTech/Translator:Q5_0

Run Hermes

hermes

Docker Model Runner
How to use MoYoYoTech/Translator with Docker Model Runner:
```
docker model run hf.co/MoYoYoTech/Translator:Q5_0
```

Lemonade

How to use MoYoYoTech/Translator with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MoYoYoTech/Translator:Q5_0

Run and chat with the model

lemonade run user.Translator-Q5_0

List all available models

lemonade list

Xin Zhang commited on Apr 24, 2025

Commit

43d0fbe

2 Parent(s): 70b1d55 418e265

Merge branch 'vad'

Browse files

* vad:
[fix]: logging level.
add string replace
[fix]: parameter.
[fix]: hot words.
[fix]: update.
fix 'transcrible' named error
[fix]: hot words.
update some keywords
add speech start padding 100ms
[fix]: words.
fix max speech duration bug
remove time delaly in loop
add DESIGN_TIME_THREHOLD
添加热词文件路径配置，并在生成模型时使用热词参数。
Disable FunASR pbar in Warmup.
update log level
remove unused codes
remove unused codes
add log to debug silence ms

# Conflicts:
# transcribe/pipelines/pipe_vad.py

Files changed (17) hide show

api_model.py +2 -2
config.py +17 -24
main.py +1 -5
moyoyo_asr_models/hotwords.json +7 -0
moyoyo_asr_models/hotwords.txt +34 -0
tests/audio_utils.py +54 -0
tests/test_vad.ipynb +129 -0
transcribe/client.py +0 -677
transcribe/helpers/funasr.py +5 -8
transcribe/helpers/vadprocessor.py +8 -8
transcribe/pipelines/pipe_vad.py +5 -32
transcribe/server.py +0 -382
transcribe/strategy.py +0 -405
transcribe/transcription.py +0 -334
transcribe/translatepipes.py +3 -14
transcribe/utils.py +37 -12
transcribe/whisper_llm_serve.py +75 -162

api_model.py CHANGED Viewed

@@ -18,9 +18,9 @@ class TransResult(BaseModel):
 class DebugResult(BaseModel):
     # trans_pattern: str
     seg_id: int
-    transcrible_time: float
     translate_time:float
-    context: str = Field(alias="transcribleContent")
     from_: str = Field(alias="from")
     to: str
     tran_content: str = Field(alias="translateContent")

 class DebugResult(BaseModel):
     # trans_pattern: str
     seg_id: int
+    transcribe_time: float
     translate_time:float
+    context: str = Field(alias="transcribeContent")
     from_: str = Field(alias="from")
     to: str
     tran_content: str = Field(alias="translateContent")

config.py CHANGED Viewed

@@ -1,12 +1,15 @@
 import pathlib
 import re
 import logging
-DEBUG = True
 logging.getLogger("pywhispercpp").setLevel(logging.WARNING)
 logging.basicConfig(
-    level=logging.DEBUG if DEBUG else logging.INFO,
     format="%(asctime)s - %(levelname)s - %(message)s",
     filename='translator.log',
     datefmt="%H:%M:%S"
@@ -15,13 +18,15 @@ logging.basicConfig(
 SAVE_DATA_SAVE = False
 # Add terminal log
 console_handler = logging.StreamHandler()
-console_handler.setLevel(logging.DEBUG if DEBUG else logging.INFO)
 console_formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
 console_handler.setFormatter(console_formatter)
 logging.getLogger().addHandler(console_handler)
-# 文字输出长度阈值
-TEXT_THREHOLD = 6
 BASE_DIR = pathlib.Path(__file__).parent
 MODEL_DIR = BASE_DIR / "moyoyo_asr_models"
@@ -29,7 +34,7 @@ ASSERT_DIR = BASE_DIR / "assets"
 SAMPLE_RATE = 16000
 # 标点
-SENTENCE_END_MARKERS =  ['.', '!', '?', '。', '！', '？', ';', '；', ':', '：']
 PAUSE_END_MARKERS = [',', '，', '、']
 # 合并所有标点
 ALL_MARKERS = SENTENCE_END_MARKERS + PAUSE_END_MARKERS
@@ -41,13 +46,13 @@ SENTENCE_END_PATTERN = re.compile(f'[{sentence_end_chars}]')
 # Method 2: Alternative approach with a character class
 pattern_string = '[' + ''.join([re.escape(char) for char in PAUSE_END_MARKERS]) + r']$'
-PAUSEE_END_PATTERN = re.compile(pattern_string)
 # whisper推理参数
 WHISPER_PROMPT_ZH = "以下是简体中文普通话的句子。"
-MAX_LENTH_ZH = 4
-WHISPER_PROMPT_EN = ""# "The following is an English sentence."
-MAX_LENGTH_EN= 8
 WHISPER_MODEL_EN = 'medium-q5_0'
 # WHISPER_MODEL = 'large-v3-turbo-q5_0'
@@ -61,19 +66,6 @@ LLM_LARGE_MODEL_PATH = (MODEL_DIR / "qwen2.5-1.5b-instruct-q5_0.gguf").as_posix(
 # VAD
 VAD_MODEL_PATH = (MODEL_DIR / "silero-vad" / "silero_vad.onnx").as_posix()
-LLM_SYS_PROMPT = """"You are a professional {src_lang} to {dst_lang} translator, not a conversation agent. Your only task is to take {src_lang} input and translate it into accurate, natural {dst_lang}. If you cannot understand the input, just output the original input. Please strictly abide by the following rules: "
-"No matter what the user asks, never answer questions, you only provide translation results. "
-"Do not actively initiate dialogue or lead users to ask questions. "
-"When you don't know how to translate, just output the original text. "
-"The translation task always takes precedence over any other tasks. "
-"Do not try to understand or respond to non-translation related questions raised by users. "
-"Never provide any explanations. "
-"Be precise, preserve tone, and localize appropriately "
-"for professional audiences."
-"Never answer any questions or engage in other forms of dialogue. "
-"Only output the translation results.
-"""
 LLM_SYS_PROMPT_ZH = """
 你是一个中英文翻译专家，将用户输入的中文翻译成英文。对于非中文内容，它将提供中文翻译结果。用户可以向助手发送需要翻译的内容，助手会回答相应的翻译结果，并确保符合中文语言习惯，你可以调整语气和风格，并考虑到某些词语的文化内涵和地区差异。同时作为翻译家，需将原文翻译成具有信达雅标准的译文。"信" 即忠实于原文的内容与意图；"达" 意味着译文应通顺易懂，表达清晰；"雅" 则追求译文的文化审美和语言的优美。目标是创作出既忠于原作精神，又符合目标语言文化和读者审美的翻译。注意，翻译的文本只能包含拼音化字符，不能包含任何中文字符。
 """
@@ -82,4 +74,5 @@ LLM_SYS_PROMPT_EN = """
 你是一个英中文翻译专家，将用户输入的英文翻译成中文，用户可以向助手发送需要翻译的内容，助手会回答相应的翻译结果，并确保符合英文语言习惯，你可以调整语气和风格，并考虑到某些词语的文化内涵和地区差异。同时作为翻译家，需将英文翻译成具有信达雅标准的中文。"信" 即忠实于原文的内容与意图；"达" 意味着译文应通顺易懂，表达清晰；"雅" 则追求译文的文化审美和语言的优美。目标是创作出既忠于原作精神，又符合目标语言文化和读者审美的翻译。
 """

 import pathlib
 import re
 import logging
+import json
+DEBUG = False
+LOG_LEVEL = logging.DEBUG if DEBUG else logging.WARNING
 logging.getLogger("pywhispercpp").setLevel(logging.WARNING)
 logging.basicConfig(
+    level=LOG_LEVEL,
     format="%(asctime)s - %(levelname)s - %(message)s",
     filename='translator.log',
     datefmt="%H:%M:%S"
 SAVE_DATA_SAVE = False
 # Add terminal log
 console_handler = logging.StreamHandler()
+console_handler.setLevel(LOG_LEVEL)
 console_formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
 console_handler.setFormatter(console_formatter)
 logging.getLogger().addHandler(console_handler)
+# 音频段的决策时间
+FRAME_SCOPE_TIME_THRESHOLD = 4
+# 最长语音时长
+MAX_SPEECH_DURATION_S = 15
 BASE_DIR = pathlib.Path(__file__).parent
 MODEL_DIR = BASE_DIR / "moyoyo_asr_models"
 SAMPLE_RATE = 16000
 # 标点
+SENTENCE_END_MARKERS = ['.', '!', '?', '。', '！', '？', ';', '；', ':', '：']
 PAUSE_END_MARKERS = [',', '，', '、']
 # 合并所有标点
 ALL_MARKERS = SENTENCE_END_MARKERS + PAUSE_END_MARKERS
 # Method 2: Alternative approach with a character class
 pattern_string = '[' + ''.join([re.escape(char) for char in PAUSE_END_MARKERS]) + r']$'
+PAUSE_END_PATTERN = re.compile(pattern_string)
 # whisper推理参数
 WHISPER_PROMPT_ZH = "以下是简体中文普通话的句子。"
+MAX_LENGTH_ZH = 4
+WHISPER_PROMPT_EN = ""  # "The following is an English sentence."
+MAX_LENGTH_EN = 8
 WHISPER_MODEL_EN = 'medium-q5_0'
 # WHISPER_MODEL = 'large-v3-turbo-q5_0'
 # VAD
 VAD_MODEL_PATH = (MODEL_DIR / "silero-vad" / "silero_vad.onnx").as_posix()
 LLM_SYS_PROMPT_ZH = """
 你是一个中英文翻译专家，将用户输入的中文翻译成英文。对于非中文内容，它将提供中文翻译结果。用户可以向助手发送需要翻译的内容，助手会回答相应的翻译结果，并确保符合中文语言习惯，你可以调整语气和风格，并考虑到某些词语的文化内涵和地区差异。同时作为翻译家，需将原文翻译成具有信达雅标准的译文。"信" 即忠实于原文的内容与意图；"达" 意味着译文应通顺易懂，表达清晰；"雅" 则追求译文的文化审美和语言的优美。目标是创作出既忠于原作精神，又符合目标语言文化和读者审美的翻译。注意，翻译的文本只能包含拼音化字符，不能包含任何中文字符。
 """
 你是一个英中文翻译专家，将用户输入的英文翻译成中文，用户可以向助手发送需要翻译的内容，助手会回答相应的翻译结果，并确保符合英文语言习惯，你可以调整语气和风格，并考虑到某些词语的文化内涵和地区差异。同时作为翻译家，需将英文翻译成具有信达雅标准的中文。"信" 即忠实于原文的内容与意图；"达" 意味着译文应通顺易懂，表达清晰；"雅" 则追求译文的文化审美和语言的优美。目标是创作出既忠于原作精神，又符合目标语言文化和读者审美的翻译。
 """
+hotwords_file = MODEL_DIR / 'hotwords.txt'
+hotwords_json = json.loads((MODEL_DIR / 'hotwords.json').read_text())

main.py CHANGED Viewed

@@ -11,6 +11,7 @@ from fastapi.staticfiles import StaticFiles
 from fastapi.responses import RedirectResponse
 import os
 from transcribe.utils import pcm_bytes_to_np_array
 logger = getLogger(__name__)
@@ -39,9 +40,6 @@ async def lifespan(app:FastAPI):
     yield
-# 获取当前文件所在目录的绝对路径
-BASE_DIR = os.path.dirname(os.path.abspath(__file__))
-# 构建frontend目录的绝对路径
 FRONTEND_DIR = os.path.join(BASE_DIR, "frontend")
@@ -66,9 +64,7 @@ async def translate(websocket: WebSocket):
         client_uid=f"{uuid1()}",
     )
     if from_lang and to_lang and client:
-        client.set_language(from_lang, to_lang)
         logger.info(f"Source lange: {from_lang}  -> Dst lange: {to_lang}")
         await websocket.accept()
     try:

 from fastapi.responses import RedirectResponse
 import os
 from transcribe.utils import pcm_bytes_to_np_array
+from config import BASE_DIR
 logger = getLogger(__name__)
     yield
 FRONTEND_DIR = os.path.join(BASE_DIR, "frontend")
         client_uid=f"{uuid1()}",
     )
     if from_lang and to_lang and client:
         logger.info(f"Source lange: {from_lang}  -> Dst lange: {to_lang}")
         await websocket.accept()
     try:

moyoyo_asr_models/hotwords.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+    "高斯姆": "GOSIM",
+    "GO SIM": "GOSIM",
+    "go sim": "GOSIM",
+    "GO SAME": "GOSIM",
+    "go same": "GOSIM"
+}

moyoyo_asr_models/hotwords.txt ADDED Viewed

	@@ -0,0 +1,34 @@

+GOSIM
+CSDN
+Rust
+git
+lib
+HUAWEI
+Futurewei
+Cloud
+OpenAI
+PYTHON
+千问
+鸿蒙
+vLLM
+MiniCPM
+ChatGPT
+GPT
+GPT2
+GPT3
+GPT4
+Llama
+Llama2
+Llama3
+MISTRAL
+Large
+Mistral
+Small
+LoRA
+finetune
+quantization
+pruning
+MoXIN
+Function
+Func
+Lava

tests/audio_utils.py ADDED Viewed

	@@ -0,0 +1,54 @@

+import numpy as np
+import soundfile as sf
+import time
+def audio_stream_generator(audio_file_path, chunk_size=4096, simulate_realtime=True):
+    """
+    音频流生成器，从音频文件中读取数据并以流的方式输出
+    参数:
+        audio_file_path: 音频文件路径
+        chunk_size: 每个数据块的大小（采样点数）
+        simulate_realtime: 是否模拟实时流处理的速度
+    生成:
+        numpy.ndarray: 每次生成一个chunk_size大小的np.float32数据块
+    """
+    # 加载音频文件
+    audio_data, sample_rate = sf.read(audio_file_path)
+    # 确保音频数据是float32类型
+    if audio_data.dtype != np.float32:
+        audio_data = audio_data.astype(np.float32)
+    # 如果是立体声，转换为单声道
+    if len(audio_data.shape) > 1 and audio_data.shape[1] > 1:
+        audio_data = audio_data.mean(axis=1)
+    print(f"已加载音频文件: {audio_file_path}")
+    print(f"采样率: {sample_rate} Hz")
+    print(f"音频长度: {len(audio_data)/sample_rate:.2f} 秒")
+    # 计算每个块的时长（秒）
+    chunk_duration = chunk_size / sample_rate if simulate_realtime else 0
+    # 按块生成数据
+    audio_len = len(audio_data)
+    for pos in range(0, audio_len, chunk_size):
+        # 获取当前块
+        end_pos = min(pos + chunk_size, audio_len)
+        chunk = audio_data[pos:end_pos]
+        # 如果块大小不足，用0填充
+        if len(chunk) < chunk_size:
+            padded_chunk = np.zeros(chunk_size, dtype=np.float32)
+            padded_chunk[:len(chunk)] = chunk
+            chunk = padded_chunk
+        # 模拟实时处理的延迟
+        if simulate_realtime:
+            time.sleep(chunk_duration)
+        yield chunk
+    print("音频流处理完成")

tests/test_vad.ipynb ADDED Viewed

	@@ -0,0 +1,129 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from audio_utils import audio_stream_generator\n",
+    "import  IPython.display as ipd\n",
+    "import sys\n",
+    "sys.path.append(\"..\")\n",
+    "from transcribe.helpers.vadprocessor import FixedVADIterator\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vac = FixedVADIterator(\n",
+    "                threshold=0.5,\n",
+    "                sampling_rate=16000,\n",
+    "                # speech_pad_ms=10\n",
+    "                min_silence_duration_ms = 100,\n",
+    "                # speech_pad_ms = 30,\n",
+    "                max_speech_duration_s=5.0,\n",
+    "                )\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "SAMPLE_FILE_PATH = \"/Users/david/Samples/Audio/zh/liyongle.wav\"\n",
+    "SAMPLING_RATE = 16000\n",
+    "\n",
+    "chunks_generator =  audio_stream_generator(SAMPLE_FILE_PATH, chunk_size=4096)\n",
+    "vac.reset_states()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "已加载音频文件: /Users/david/Samples/Audio/zh/liyongle.wav\n",
+      "采样率: 16000 Hz\n",
+      "音频长度: 64.00 秒\n",
+      "{'start': 3616}\n",
+      "{'end': 83968}\n",
+      "{'end': 164352}\n",
+      "{'end': 244736}\n",
+      "{'end': 325120}\n",
+      "{'end': 405504}\n",
+      "{'end': 485888}\n",
+      "{'end': 566272}\n",
+      "{'end': 624608}\n",
+      "{'start': 631328}\n",
+      "{'end': 691168}\n",
+      "{'start': 698912}\n",
+      "{'end': 779264}\n",
+      "{'end': 800736}\n",
+      "{'start': 805920}\n",
+      "{'end': 846816}\n",
+      "{'start': 855072}\n",
+      "{'end': 862176}\n",
+      "{'start': 864288}\n",
+      "{'end': 890336}\n",
+      "{'start': 893984}\n",
+      "{'end': 912352}\n",
+      "{'start': 917536}\n",
+      "{'end': 932320}\n",
+      "{'start': 939040}\n",
+      "{'end': 966112}\n",
+      "{'start': 970784}\n",
+      "{'end': 1015264}\n",
+      "{'start': 1019424}\n",
+      "音频流处理完成\n"
+     ]
+    }
+   ],
+   "source": [
+    "for chunk in chunks_generator:\n",
+    "    # vad_iterator.reset_states()\n",
+    "    # audio_buffer = np.append(audio_buffer, chunk)\n",
+    "    \n",
+    "    speech_dict = vac(chunk, return_seconds=False)\n",
+    "    if speech_dict:\n",
+    "        print(speech_dict)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

transcribe/client.py DELETED Viewed

@@ -1,677 +0,0 @@
-import json
-import os
-import shutil
-import threading
-import time
-import uuid
-import wave
-import av
-import numpy as np
-import pyaudio
-import websocket
-import transcribe.utils as utils
-class Client:
-    """
-    Handles communication with a server using WebSocket.
-    """
-    INSTANCES = {}
-    END_OF_AUDIO = "END_OF_AUDIO"
-    def __init__(
-            self,
-            host=None,
-            port=None,
-            lang=None,
-            log_transcription=True,
-            max_clients=4,
-            max_connection_time=600,
-            dst_lang='zh',
-    ):
-        """
-        Initializes a Client instance for audio recording and streaming to a server.
-        If host and port are not provided, the WebSocket connection will not be established.
-        the audio recording starts immediately upon initialization.
-        Args:
-            host (str): The hostname or IP address of the server.
-            port (int): The port number for the WebSocket server.
-            lang (str, optional): The selected language for transcription. Default is None.
-            log_transcription (bool, optional): Whether to log transcription output to the console. Default is True.
-            max_clients (int, optional): Maximum number of client connections allowed. Default is 4.
-            max_connection_time (int, optional): Maximum allowed connection time in seconds. Default is 600.
-        """
-        self.recording = False
-        self.uid = str(uuid.uuid4())
-        self.waiting = False
-        self.last_response_received = None
-        self.disconnect_if_no_response_for = 15
-        self.language = lang
-        self.server_error = False
-        self.last_segment = None
-        self.last_received_segment = None
-        self.log_transcription = log_transcription
-        self.max_clients = max_clients
-        self.max_connection_time = max_connection_time
-        self.dst_lang = dst_lang
-        self.audio_bytes = None
-        if host is not None and port is not None:
-            socket_url = f"ws://{host}:{port}?from={self.language}&to={self.dst_lang}"
-            self.client_socket = websocket.WebSocketApp(
-                socket_url,
-                on_open=lambda ws: self.on_open(ws),
-                on_message=lambda ws, message: self.on_message(ws, message),
-                on_error=lambda ws, error: self.on_error(ws, error),
-                on_close=lambda ws, close_status_code, close_msg: self.on_close(
-                    ws, close_status_code, close_msg
-                ),
-            )
-        else:
-            print("[ERROR]: No host or port specified.")
-            return
-        Client.INSTANCES[self.uid] = self
-        # start websocket client in a thread
-        self.ws_thread = threading.Thread(target=self.client_socket.run_forever)
-        self.ws_thread.daemon = True
-        self.ws_thread.start()
-        self.transcript = []
-        print("[INFO]: * recording")
-    def handle_status_messages(self, message_data):
-        """Handles server status messages."""
-        status = message_data["status"]
-        if status == "WAIT":
-            self.waiting = True
-            print(f"[INFO]: Server is full. Estimated wait time {round(message_data['message'])} minutes.")
-        elif status == "ERROR":
-            print(f"Message from Server: {message_data['message']}")
-            self.server_error = True
-        elif status == "WARNING":
-            print(f"Message from Server: {message_data['message']}")
-    def process_segments(self, segments):
-        """Processes transcript segments."""
-        text = []
-        for i, seg in enumerate(segments):
-            if not text or text[-1] != seg["text"]:
-                text.append(seg["text"])
-                if i == len(segments) - 1 and not seg.get("completed", False):
-                    self.last_segment = seg
-        # update last received segment and last valid response time
-        if self.last_received_segment is None or self.last_received_segment != segments[-1]["text"]:
-            self.last_response_received = time.time()
-            self.last_received_segment = segments[-1]["text"]
-        if self.log_transcription:
-            # Truncate to last 3 entries for brevity.
-            text = text[-3:]
-            utils.clear_screen()
-            utils.print_transcript(text)
-    def on_message(self, ws, message):
-        """
-        Callback function called when a message is received from the server.
-        It updates various attributes of the client based on the received message, including
-        recording status, language detection, and server messages. If a disconnect message
-        is received, it sets the recording status to False.
-        Args:
-            ws (websocket.WebSocketApp): The WebSocket client instance.
-            message (str): The received message from the server.
-        """
-        message = json.loads(message)
-        # if self.uid != message.get("uid"):
-        #     print("[ERROR]: invalid client uid")
-        #     return
-        if "status" in message.keys():
-            self.handle_status_messages(message)
-            return
-        if "message" in message.keys() and message["message"] == "DISCONNECT":
-            print("[INFO]: Server disconnected due to overtime.")
-            self.recording = False
-        if "message" in message.keys() and message["message"] == "SERVER_READY":
-            self.last_response_received = time.time()
-            self.recording = True
-            self.server_backend = message["backend"]
-            print(f"[INFO]: Server Running with backend {self.server_backend}")
-            return
-        if "language" in message.keys():
-            self.language = message.get("language")
-            lang_prob = message.get("language_prob")
-            print(
-                f"[INFO]: Server detected language {self.language} with probability {lang_prob}"
-            )
-            return
-        if "segments" in message.keys():
-            self.process_segments(message["segments"])
-    def on_error(self, ws, error):
-        print(f"[ERROR] WebSocket Error: {error}")
-        self.server_error = True
-        self.error_message = error
-    def on_close(self, ws, close_status_code, close_msg):
-        print(f"[INFO]: Websocket connection closed: {close_status_code}: {close_msg}")
-        self.recording = False
-        self.waiting = False
-    def on_open(self, ws):
-        """
-        Callback function called when the WebSocket connection is successfully opened.
-        Sends an initial configuration message to the server, including client UID,
-        language selection, and task type.
-        Args:
-            ws (websocket.WebSocketApp): The WebSocket client instance.
-        """
-        print("[INFO]: Opened connection")
-        ws.send(
-            json.dumps(
-                {
-                    "uid": self.uid,
-                    "language": self.language,
-                    "max_clients": self.max_clients,
-                    "max_connection_time": self.max_connection_time,
-                }
-            )
-        )
-    def send_packet_to_server(self, message):
-        """
-        Send an audio packet to the server using WebSocket.
-        Args:
-            message (bytes): The audio data packet in bytes to be sent to the server.
-        """
-        try:
-            self.client_socket.send(message, websocket.ABNF.OPCODE_BINARY)
-        except Exception as e:
-            print(e)
-    def close_websocket(self):
-        """
-        Close the WebSocket connection and join the WebSocket thread.
-        First attempts to close the WebSocket connection using `self.client_socket.close()`. After
-        closing the connection, it joins the WebSocket thread to ensure proper termination.
-        """
-        try:
-            self.client_socket.close()
-        except Exception as e:
-            print("[ERROR]: Error closing WebSocket:", e)
-        try:
-            self.ws_thread.join()
-        except Exception as e:
-            print("[ERROR:] Error joining WebSocket thread:", e)
-    def get_client_socket(self):
-        """
-        Get the WebSocket client socket instance.
-        Returns:
-            WebSocketApp: The WebSocket client socket instance currently in use by the client.
-        """
-        return self.client_socket
-    def wait_before_disconnect(self):
-        """Waits a bit before disconnecting in order to process pending responses."""
-        assert self.last_response_received
-        while time.time() - self.last_response_received < self.disconnect_if_no_response_for:
-            continue
-class TranscriptionTeeClient:
-    """
-    Client for handling audio recording, streaming, and transcription tasks via one or more
-    WebSocket connections.
-    Acts as a high-level client for audio transcription tasks using a WebSocket connection. It can be used
-    to send audio data for transcription to one or more servers, and receive transcribed text segments.
-    Args:
-        clients (list): one or more previously initialized Client instances
-    Attributes:
-        clients (list): the underlying Client instances responsible for handling WebSocket connections.
-    """
-    def __init__(self, clients, save_output_recording=False, output_recording_filename="./output_recording.wav",
-                 mute_audio_playback=False):
-        self.clients = clients
-        if not self.clients:
-            raise Exception("At least one client is required.")
-        self.chunk = 4096
-        self.format = pyaudio.paInt16
-        self.channels = 1
-        self.rate = 16000
-        self.record_seconds = 60000
-        self.save_output_recording = save_output_recording
-        self.output_recording_filename = output_recording_filename
-        self.mute_audio_playback = mute_audio_playback
-        self.frames = b""
-        self.p = pyaudio.PyAudio()
-        try:
-            self.stream = self.p.open(
-                format=self.format,
-                channels=self.channels,
-                rate=self.rate,
-                input=True,
-                frames_per_buffer=self.chunk,
-            )
-        except OSError as error:
-            print(f"[WARN]: Unable to access microphone. {error}")
-            self.stream = None
-    def __call__(self, audio=None, rtsp_url=None, hls_url=None, save_file=None):
-        """
-        Start the transcription process.
-        Initiates the transcription process by connecting to the server via a WebSocket. It waits for the server
-        to be ready to receive audio data and then sends audio for transcription. If an audio file is provided, it
-        will be played and streamed to the server; otherwise, it will perform live recording.
-        Args:
-            audio (str, optional): Path to an audio file for transcription. Default is None, which triggers live recording.
-        """
-        assert sum(
-            source is not None for source in [audio, rtsp_url, hls_url]
-        ) <= 1, 'You must provide only one selected source'
-        print("[INFO]: Waiting for server ready ...")
-        for client in self.clients:
-            while not client.recording:
-                if client.waiting or client.server_error:
-                    self.close_all_clients()
-                    return
-        print("[INFO]: Server Ready!")
-        if hls_url is not None:
-            self.process_hls_stream(hls_url, save_file)
-        elif audio is not None:
-            resampled_file = utils.resample(audio)
-            self.play_file(resampled_file)
-        elif rtsp_url is not None:
-            self.process_rtsp_stream(rtsp_url)
-        else:
-            self.record()
-    def close_all_clients(self):
-        """Closes all client websockets."""
-        for client in self.clients:
-            client.close_websocket()
-    def multicast_packet(self, packet, unconditional=False):
-        """
-        Sends an identical packet via all clients.
-        Args:
-            packet (bytes): The audio data packet in bytes to be sent.
-            unconditional (bool, optional): If true, send regardless of whether clients are recording.  Default is False.
-        """
-        for client in self.clients:
-            if (unconditional or client.recording):
-                client.send_packet_to_server(packet)
-    def play_file(self, filename):
-        """
-        Play an audio file and send it to the server for processing.
-        Reads an audio file, plays it through the audio output, and simultaneously sends
-        the audio data to the server for processing. It uses PyAudio to create an audio
-        stream for playback. The audio data is read from the file in chunks, converted to
-        floating-point format, and sent to the server using WebSocket communication.
-        This method is typically used when you want to process pre-recorded audio and send it
-        to the server in real-time.
-        Args:
-            filename (str): The path to the audio file to be played and sent to the server.
-        """
-        # read audio and create pyaudio stream
-        with wave.open(filename, "rb") as wavfile:
-            self.stream = self.p.open(
-                format=self.p.get_format_from_width(wavfile.getsampwidth()),
-                channels=wavfile.getnchannels(),
-                rate=wavfile.getframerate(),
-                input=True,
-                output=True,
-                frames_per_buffer=self.chunk,
-            )
-            chunk_duration = self.chunk / float(wavfile.getframerate())
-            try:
-                while any(client.recording for client in self.clients):
-                    data = wavfile.readframes(self.chunk)
-                    if data == b"":
-                        break
-                    audio_array = self.bytes_to_float_array(data)
-                    self.multicast_packet(audio_array.tobytes())
-                    if self.mute_audio_playback:
-                        time.sleep(chunk_duration)
-                    else:
-                        self.stream.write(data)
-                wavfile.close()
-                for client in self.clients:
-                    client.wait_before_disconnect()
-                self.multicast_packet(Client.END_OF_AUDIO.encode('utf-8'), True)
-                self.stream.close()
-                self.close_all_clients()
-            except KeyboardInterrupt:
-                wavfile.close()
-                self.stream.stop_stream()
-                self.stream.close()
-                self.p.terminate()
-                self.close_all_clients()
-                print("[INFO]: Keyboard interrupt.")
-    def process_rtsp_stream(self, rtsp_url):
-        """
-        Connect to an RTSP source, process the audio stream, and send it for transcription.
-        Args:
-            rtsp_url (str): The URL of the RTSP stream source.
-        """
-        print("[INFO]: Connecting to RTSP stream...")
-        try:
-            container = av.open(rtsp_url, format="rtsp", options={"rtsp_transport": "tcp"})
-            self.process_av_stream(container, stream_type="RTSP")
-        except Exception as e:
-            print(f"[ERROR]: Failed to process RTSP stream: {e}")
-        finally:
-            for client in self.clients:
-                client.wait_before_disconnect()
-            self.multicast_packet(Client.END_OF_AUDIO.encode('utf-8'), True)
-            self.close_all_clients()
-        print("[INFO]: RTSP stream processing finished.")
-    def process_hls_stream(self, hls_url, save_file=None):
-        """
-        Connect to an HLS source, process the audio stream, and send it for transcription.
-        Args:
-            hls_url (str): The URL of the HLS stream source.
-            save_file (str, optional): Local path to save the network stream.
-        """
-        print("[INFO]: Connecting to HLS stream...")
-        try:
-            container = av.open(hls_url, format="hls")
-            self.process_av_stream(container, stream_type="HLS", save_file=save_file)
-        except Exception as e:
-            print(f"[ERROR]: Failed to process HLS stream: {e}")
-        finally:
-            for client in self.clients:
-                client.wait_before_disconnect()
-            self.multicast_packet(Client.END_OF_AUDIO.encode('utf-8'), True)
-            self.close_all_clients()
-        print("[INFO]: HLS stream processing finished.")
-    def process_av_stream(self, container, stream_type, save_file=None):
-        """
-        Process an AV container stream and send audio packets to the server.
-        Args:
-            container (av.container.InputContainer): The input container to process.
-            stream_type (str): The type of stream being processed ("RTSP" or "HLS").
-            save_file (str, optional): Local path to save the stream. Default is None.
-        """
-        audio_stream = next((s for s in container.streams if s.type == "audio"), None)
-        if not audio_stream:
-            print(f"[ERROR]: No audio stream found in {stream_type} source.")
-            return
-        output_container = None
-        if save_file:
-            output_container = av.open(save_file, mode="w")
-            output_audio_stream = output_container.add_stream(codec_name="pcm_s16le", rate=self.rate)
-        try:
-            for packet in container.demux(audio_stream):
-                for frame in packet.decode():
-                    audio_data = frame.to_ndarray().tobytes()
-                    self.multicast_packet(audio_data)
-                    if save_file:
-                        output_container.mux(frame)
-        except Exception as e:
-            print(f"[ERROR]: Error during {stream_type} stream processing: {e}")
-        finally:
-            # Wait for server to send any leftover transcription.
-            time.sleep(5)
-            self.multicast_packet(Client.END_OF_AUDIO.encode('utf-8'), True)
-            if output_container:
-                output_container.close()
-            container.close()
-    def save_chunk(self, n_audio_file):
-        """
-        Saves the current audio frames to a WAV file in a separate thread.
-        Args:
-        n_audio_file (int): The index of the audio file which determines the filename.
-                            This helps in maintaining the order and uniqueness of each chunk.
-        """
-        t = threading.Thread(
-            target=self.write_audio_frames_to_file,
-            args=(self.frames[:], f"chunks/{n_audio_file}.wav",),
-        )
-        t.start()
-    def finalize_recording(self, n_audio_file):
-        """
-        Finalizes the recording process by saving any remaining audio frames,
-        closing the audio stream, and terminating the process.
-        Args:
-        n_audio_file (int): The file index to be used if there are remaining audio frames to be saved.
-                            This index is incremented before use if the last chunk is saved.
-        """
-        if self.save_output_recording and len(self.frames):
-            self.write_audio_frames_to_file(
-                self.frames[:], f"chunks/{n_audio_file}.wav"
-            )
-            n_audio_file += 1
-        self.stream.stop_stream()
-        self.stream.close()
-        self.p.terminate()
-        self.close_all_clients()
-        if self.save_output_recording:
-            self.write_output_recording(n_audio_file)
-    def record(self):
-        """
-        Record audio data from the input stream and save it to a WAV file.
-        Continuously records audio data from the input stream, sends it to the server via a WebSocket
-        connection, and simultaneously saves it to multiple WAV files in chunks. It stops recording when
-        the `RECORD_SECONDS` duration is reached or when the `RECORDING` flag is set to `False`.
-        Audio data is saved in chunks to the "chunks" directory. Each chunk is saved as a separate WAV file.
-        The recording will continue until the specified duration is reached or until the `RECORDING` flag is set to `False`.
-        The recording process can be interrupted by sending a KeyboardInterrupt (e.g., pressing Ctrl+C). After recording,
-        the method combines all the saved audio chunks into the specified `out_file`.
-        """
-        n_audio_file = 0
-        if self.save_output_recording:
-            if os.path.exists("chunks"):
-                shutil.rmtree("chunks")
-            os.makedirs("chunks")
-        try:
-            for _ in range(0, int(self.rate / self.chunk * self.record_seconds)):
-                if not any(client.recording for client in self.clients):
-                    break
-                data = self.stream.read(self.chunk, exception_on_overflow=False)
-                self.frames += data
-                audio_array = self.bytes_to_float_array(data)
-                self.multicast_packet(audio_array.tobytes())
-                # save frames if more than a minute
-                if len(self.frames) > 60 * self.rate:
-                    if self.save_output_recording:
-                        self.save_chunk(n_audio_file)
-                        n_audio_file += 1
-                    self.frames = b""
-        except KeyboardInterrupt:
-            self.finalize_recording(n_audio_file)
-    def write_audio_frames_to_file(self, frames, file_name):
-        """
-        Write audio frames to a WAV file.
-        The WAV file is created or overwritten with the specified name. The audio frames should be
-        in the correct format and match the specified channel, sample width, and sample rate.
-        Args:
-            frames (bytes): The audio frames to be written to the file.
-            file_name (str): The name of the WAV file to which the frames will be written.
-        """
-        with wave.open(file_name, "wb") as wavfile:
-            wavfile: wave.Wave_write
-            wavfile.setnchannels(self.channels)
-            wavfile.setsampwidth(2)
-            wavfile.setframerate(self.rate)
-            wavfile.writeframes(frames)
-    def write_output_recording(self, n_audio_file):
-        """
-        Combine and save recorded audio chunks into a single WAV file.
-        The individual audio chunk files are expected to be located in the "chunks" directory. Reads each chunk
-        file, appends its audio data to the final recording, and then deletes the chunk file. After combining
-        and saving, the final recording is stored in the specified `out_file`.
-        Args:
-            n_audio_file (int): The number of audio chunk files to combine.
-            out_file (str): The name of the output WAV file to save the final recording.
-        """
-        input_files = [
-            f"chunks/{i}.wav"
-            for i in range(n_audio_file)
-            if os.path.exists(f"chunks/{i}.wav")
-        ]
-        with wave.open(self.output_recording_filename, "wb") as wavfile:
-            wavfile: wave.Wave_write
-            wavfile.setnchannels(self.channels)
-            wavfile.setsampwidth(2)
-            wavfile.setframerate(self.rate)
-            for in_file in input_files:
-                with wave.open(in_file, "rb") as wav_in:
-                    while True:
-                        data = wav_in.readframes(self.chunk)
-                        if data == b"":
-                            break
-                        wavfile.writeframes(data)
-                # remove this file
-                os.remove(in_file)
-        wavfile.close()
-        # clean up temporary directory to store chunks
-        if os.path.exists("chunks"):
-            shutil.rmtree("chunks")
-    @staticmethod
-    def bytes_to_float_array(audio_bytes):
-        """
-        Convert audio data from bytes to a NumPy float array.
-        It assumes that the audio data is in 16-bit PCM format. The audio data is normalized to
-        have values between -1 and 1.
-        Args:
-            audio_bytes (bytes): Audio data in bytes.
-        Returns:
-            np.ndarray: A NumPy array containing the audio data as float values normalized between -1 and 1.
-        """
-        raw_data = np.frombuffer(buffer=audio_bytes, dtype=np.int16)
-        return raw_data.astype(np.float32) / 32768.0
-class TranscriptionClient(TranscriptionTeeClient):
-    """
-    Client for handling audio transcription tasks via a single WebSocket connection.
-    Acts as a high-level client for audio transcription tasks using a WebSocket connection. It can be used
-    to send audio data for transcription to a server and receive transcribed text segments.
-    Args:
-        host (str): The hostname or IP address of the server.
-        port (int): The port number to connect to on the server.
-        lang (str, optional): The primary language for transcription. Default is None, which defaults to English ('en').
-        save_output_recording (bool, optional): Whether to save the microphone recording. Default is False.
-        output_recording_filename (str, optional): Path to save the output recording WAV file. Default is "./output_recording.wav".
-        output_transcription_path (str, optional): File path to save the output transcription (SRT file). Default is "./output.srt".
-        log_transcription (bool, optional): Whether to log transcription output to the console. Default is True.
-        max_clients (int, optional): Maximum number of client connections allowed. Default is 4.
-        max_connection_time (int, optional): Maximum allowed connection time in seconds. Default is 600.
-        mute_audio_playback (bool, optional): If True, mutes audio playback during file playback. Default is False.
-    Attributes:
-        client (Client): An instance of the underlying Client class responsible for handling the WebSocket connection.
-    Example:
-        To create a TranscriptionClient and start transcription on microphone audio:
-        ```python
-        transcription_client = TranscriptionClient(host="localhost", port=9090)
-        transcription_client()
-        ```
-    """
-    def __init__(
-            self,
-            host,
-            port,
-            lang=None,
-            save_output_recording=False,
-            output_recording_filename="./output_recording.wav",
-            log_transcription=True,
-            max_clients=4,
-            max_connection_time=600,
-            mute_audio_playback=False,
-            dst_lang='en',
-    ):
-        self.client = Client(
-            host, port, lang, log_transcription=log_transcription, max_clients=max_clients,
-            max_connection_time=max_connection_time, dst_lang=dst_lang
-        )
-        if save_output_recording and not output_recording_filename.endswith(".wav"):
-            raise ValueError(f"Please provide a valid `output_recording_filename`: {output_recording_filename}")
-        TranscriptionTeeClient.__init__(
-            self,
-            [self.client],
-            save_output_recording=save_output_recording,
-            output_recording_filename=output_recording_filename,
-            mute_audio_playback=mute_audio_playback,
-        )

transcribe/helpers/funasr.py CHANGED Viewed

@@ -1,14 +1,11 @@
-import time
-import uuid
-from logging import getLogger
 import numpy as np
 from funasr import AutoModel
-import soundfile as sf
 import config
-logger = getLogger(__name__)
 class FunASR:
@@ -16,7 +13,7 @@ class FunASR:
         self.source_lange = source_lange
         self.model = AutoModel(
-            model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc"
         )
         if warmup:
             self.warmup()
@@ -30,8 +27,8 @@ class FunASR:
         audio_frames = np.frombuffer(audio_buffer, dtype=np.float32)
         # sf.write(f'{config.ASSERT_DIR}/{time.time()}.wav', audio_frames, samplerate=16000)
         try:
-            output = self.model.generate(input=audio_frames, disable_pbar=True)
             return output
         except Exception as e:
-            logger.error(e)
             return []

+# from logging import getLogger
 import numpy as np
 from funasr import AutoModel
 import config
+# logger = getLogger(__name__)
 class FunASR:
         self.source_lange = source_lange
         self.model = AutoModel(
+            model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", log_level="ERROR",
         )
         if warmup:
             self.warmup()
         audio_frames = np.frombuffer(audio_buffer, dtype=np.float32)
         # sf.write(f'{config.ASSERT_DIR}/{time.time()}.wav', audio_frames, samplerate=16000)
         try:
+            output = self.model.generate(input=audio_frames, disable_pbar=True, hotword=config.hotwords_file.as_posix())
             return output
         except Exception as e:
+            print(f"Error during transcription: {e}")
             return []

transcribe/helpers/vadprocessor.py CHANGED Viewed

@@ -36,7 +36,7 @@ class AdaptiveSilenceController:
             speed_factor = 0.5
         elif avg_speech < 600:
             speed_factor = 0.8
         # 3. silence 的变化趋势也考虑进去
         adaptive = self.base * speed_factor + 0.3 * avg_silence
@@ -155,7 +155,7 @@ class VADIteratorOnnx:
             raise ValueError('VADIterator does not support sampling rates other than [8000, 16000]')
         self.min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
-        self.max_speech_samples = int(sampling_rate * max_speech_duration_s)
         self.speech_pad_samples = sampling_rate * speech_pad_ms / 1000
         self.reset_states()
@@ -184,7 +184,7 @@ class VADIteratorOnnx:
         self.current_sample += window_size_samples
         speech_prob = self.model(x, self.sampling_rate)[0,0]
-        # print(f"{self.current_sample/self.sampling_rate:.2f}: {speech_prob}")
         if (speech_prob >= self.threshold) and self.temp_end:
             self.temp_end = 0
@@ -196,11 +196,11 @@ class VADIteratorOnnx:
             self.start = speech_start
             return {'start': int(speech_start) if not return_seconds else round(speech_start / self.sampling_rate, 1)}
-        if (speech_prob >= self.threshold) and self.current_sample - self.start >= self.max_speech_samples:
-            if self.temp_end:
-                self.temp_end = 0
-            self.start = self.current_sample
-            return {'end': int(self.current_sample) if not return_seconds else round(self.current_sample / self.sampling_rate, 1)}
         if (speech_prob < self.threshold - 0.15) and self.triggered:
             if not self.temp_end:

             speed_factor = 0.5
         elif avg_speech < 600:
             speed_factor = 0.8
+        logging.warning(f"Avg speech :{avg_speech}, Avg silence: {avg_silence}")
         # 3. silence 的变化趋势也考虑进去
         adaptive = self.base * speed_factor + 0.3 * avg_silence
             raise ValueError('VADIterator does not support sampling rates other than [8000, 16000]')
         self.min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
+        # self.max_speech_samples = int(sampling_rate * max_speech_duration_s)
         self.speech_pad_samples = sampling_rate * speech_pad_ms / 1000
         self.reset_states()
         self.current_sample += window_size_samples
         speech_prob = self.model(x, self.sampling_rate)[0,0]
         if (speech_prob >= self.threshold) and self.temp_end:
             self.temp_end = 0
             self.start = speech_start
             return {'start': int(speech_start) if not return_seconds else round(speech_start / self.sampling_rate, 1)}
+        # if (speech_prob >= self.threshold) and self.current_sample - self.start >= self.max_speech_samples:
+        #     if self.temp_end:
+        #         self.temp_end = 0
+        #     self.start = self.current_sample
+        #     return {'end': int(self.current_sample) if not return_seconds else round(self.current_sample / self.sampling_rate, 1)}
         if (speech_prob < self.threshold - 0.15) and self.triggered:
             if not self.temp_end:

transcribe/pipelines/pipe_vad.py CHANGED Viewed

@@ -1,12 +1,10 @@
 from .base import MetaItem, BasePipe
-from ..helpers.vadprocessor import FixedVADIterator, AdaptiveSilenceController
 import numpy as np
-from silero_vad import get_speech_timestamps
-from typing import List
 import logging
-import time
 # import noisereduce as nr
@@ -18,15 +16,12 @@ class VadPipe(BasePipe):
         super().__init__(in_queue, out_queue)
         self._offset = 0 # 处理的frame size offset
         self._status = 'END'
-        self.last_state_change_offset = 0
-        self.adaptive_ctrl = AdaptiveSilenceController()
     def reset(self):
         self._offset = 0
         self._status = 'END'
-        self.last_state_change_offset = 0
-        self.adaptive_ctrl = AdaptiveSilenceController()
         self.vac.reset_states()
     @classmethod
@@ -38,7 +33,6 @@ class VadPipe(BasePipe):
                 # speech_pad_ms=10
                 min_silence_duration_ms = 80,
                 # speech_pad_ms = 30,
-                max_speech_duration_s=25.0,
                 )
             cls.vac.reset_states()
@@ -55,15 +49,9 @@ class VadPipe(BasePipe):
             if start_frame:
                 relative_start_frame =start_frame - self._offset
             if end_frame:
-                relative_end_frame = max(0, end_frame - self._offset)
             return relative_start_frame, relative_end_frame
-    def update_silence_ms(self):
-        min_silence = self.adaptive_ctrl.get_adaptive_silence_ms()
-        min_silence_samples = self.sample_rate * min_silence / 1000
-        self.vac.min_silence_samples = min_silence_samples
-        logging.warning(f"🫠 update_silence_ms :{min_silence} => current: {self.vac.min_silence_samples} ")
     def process(self, in_data: MetaItem) -> MetaItem:
         if self._offset == 0:
             self.vac.reset_states()
@@ -73,34 +61,19 @@ class VadPipe(BasePipe):
         speech_data  = self._process_speech_chunk(source_audio)
         if speech_data: # 表示有音频的变化点出现
-            # self.update_silence_ms()
             rel_start_frame, rel_end_frame = speech_data
             if rel_start_frame is not None and rel_end_frame is None:
                 self._status = "START" # 语音开始
-                target_audio = source_audio[rel_start_frame:]
-                 # 计算上一段静音长度
-                silence_len = (self._offset + rel_start_frame - self.last_state_change_offset) / self.sample_rate * 1000
-                self.adaptive_ctrl.update_silence(silence_len)
-                self.last_state_change_offset = self._offset + rel_start_frame
                 logging.debug("🫸 Speech start frame: {}".format(rel_start_frame))
             elif rel_start_frame is None and rel_end_frame is not None:
                 self._status = "END" # 音频结束
                 target_audio = source_audio[:rel_end_frame]
-                speech_len = (rel_end_frame) / self.sample_rate * 1000
-                self.adaptive_ctrl.update_speech(speech_len)
-                self.last_state_change_offset = self._offset + rel_end_frame
                 logging.debug(" 🫷Speech ended, capturing audio up to frame: {}".format(rel_end_frame))
             else:
                 self._status = 'END'
                 target_audio = source_audio[rel_start_frame:rel_end_frame]
                 logging.debug(" 🔄 Speech segment captured from frame {} to frame {}".format(rel_start_frame, rel_end_frame))
-                seg_len = (rel_end_frame - rel_start_frame) / self.sample_rate * 1000
-                self.adaptive_ctrl.update_speech(seg_len)
-                self.last_state_change_offset = self._offset + rel_end_frame
                 # logging.debug("❌ No valid speech segment detected, setting status to END")
         else:
             if self._status == 'START':

 from .base import MetaItem, BasePipe
+from ..helpers.vadprocessor import FixedVADIterator
 import numpy as np
 import logging
 # import noisereduce as nr
         super().__init__(in_queue, out_queue)
         self._offset = 0 # 处理的frame size offset
         self._status = 'END'
     def reset(self):
         self._offset = 0
         self._status = 'END'
         self.vac.reset_states()
     @classmethod
                 # speech_pad_ms=10
                 min_silence_duration_ms = 80,
                 # speech_pad_ms = 30,
                 )
             cls.vac.reset_states()
             if start_frame:
                 relative_start_frame =start_frame - self._offset
             if end_frame:
+                relative_end_frame = end_frame - self._offset
             return relative_start_frame, relative_end_frame
     def process(self, in_data: MetaItem) -> MetaItem:
         if self._offset == 0:
             self.vac.reset_states()
         speech_data  = self._process_speech_chunk(source_audio)
         if speech_data: # 表示有音频的变化点出现
             rel_start_frame, rel_end_frame = speech_data
             if rel_start_frame is not None and rel_end_frame is None:
                 self._status = "START" # 语音开始
+                target_audio = source_audio[max(rel_start_frame-100, 0):]
                 logging.debug("🫸 Speech start frame: {}".format(rel_start_frame))
             elif rel_start_frame is None and rel_end_frame is not None:
                 self._status = "END" # 音频结束
                 target_audio = source_audio[:rel_end_frame]
                 logging.debug(" 🫷Speech ended, capturing audio up to frame: {}".format(rel_end_frame))
             else:
                 self._status = 'END'
                 target_audio = source_audio[rel_start_frame:rel_end_frame]
                 logging.debug(" 🔄 Speech segment captured from frame {} to frame {}".format(rel_start_frame, rel_end_frame))
                 # logging.debug("❌ No valid speech segment detected, setting status to END")
         else:
             if self._status == 'START':

transcribe/server.py DELETED Viewed

@@ -1,382 +0,0 @@
-import json
-import logging
-import threading
-import time
-import config
-import librosa
-import numpy as np
-import soundfile
-from pywhispercpp.model import Model
-logging.basicConfig(level=logging.INFO)
-class ServeClientBase(object):
-    RATE = 16000
-    SERVER_READY = "SERVER_READY"
-    DISCONNECT = "DISCONNECT"
-    def __init__(self, client_uid, websocket):
-        self.client_uid = client_uid
-        self.websocket = websocket
-        self.frames = b""
-        self.timestamp_offset = 0.0
-        self.frames_np = None
-        self.frames_offset = 0.0
-        self.text = []
-        self.current_out = ''
-        self.prev_out = ''
-        self.t_start = None
-        self.exit = False
-        self.same_output_count = 0
-        self.show_prev_out_thresh = 5  # if pause(no output from whisper) show previous output for 5 seconds
-        self.add_pause_thresh = 3  # add a blank to segment list as a pause(no speech) for 3 seconds
-        self.transcript = []
-        self.send_last_n_segments = 10
-        # text formatting
-        self.pick_previous_segments = 2
-        # threading
-        self.lock = threading.Lock()
-    def speech_to_text(self):
-        raise NotImplementedError
-    def transcribe_audio(self):
-        raise NotImplementedError
-    def handle_transcription_output(self):
-        raise NotImplementedError
-    def add_frames(self, frame_np):
-        """
-        Add audio frames to the ongoing audio stream buffer.
-        This method is responsible for maintaining the audio stream buffer, allowing the continuous addition
-        of audio frames as they are received. It also ensures that the buffer does not exceed a specified size
-        to prevent excessive memory usage.
-        If the buffer size exceeds a threshold (45 seconds of audio data), it discards the oldest 30 seconds
-        of audio data to maintain a reasonable buffer size. If the buffer is empty, it initializes it with the provided
-        audio frame. The audio stream buffer is used for real-time processing of audio data for transcription.
-        Args:
-            frame_np (numpy.ndarray): The audio frame data as a NumPy array.
-        """
-        self.lock.acquire()
-        if self.frames_np is not None and self.frames_np.shape[0] > 45 * self.RATE:
-            self.frames_offset += 30.0
-            self.frames_np = self.frames_np[int(30 * self.RATE):]
-            # check timestamp offset(should be >= self.frame_offset)
-            # this basically means that there is no speech as timestamp offset hasnt updated
-            # and is less than frame_offset
-            if self.timestamp_offset < self.frames_offset:
-                self.timestamp_offset = self.frames_offset
-        if self.frames_np is None:
-            self.frames_np = frame_np.copy()
-        else:
-            self.frames_np = np.concatenate((self.frames_np, frame_np), axis=0)
-        self.lock.release()
-    def clip_audio_if_no_valid_segment(self):
-        """
-        Update the timestamp offset based on audio buffer status.
-        Clip audio if the current chunk exceeds 30 seconds, this basically implies that
-        no valid segment for the last 30 seconds from whisper
-        """
-        with self.lock:
-            if self.frames_np[int((self.timestamp_offset - self.frames_offset) * self.RATE):].shape[0] > 25 * self.RATE:
-                duration = self.frames_np.shape[0] / self.RATE
-                self.timestamp_offset = self.frames_offset + duration - 5
-    def get_audio_chunk_for_processing(self):
-        """
-        Retrieves the next chunk of audio data for processing based on the current offsets.
-        Calculates which part of the audio data should be processed next, based on
-        the difference between the current timestamp offset and the frame's offset, scaled by
-        the audio sample rate (RATE). It then returns this chunk of audio data along with its
-        duration in seconds.
-        Returns:
-            tuple: A tuple containing:
-                - input_bytes (np.ndarray): The next chunk of audio data to be processed.
-                - duration (float): The duration of the audio chunk in seconds.
-        """
-        with self.lock:
-            samples_take = max(0, (self.timestamp_offset - self.frames_offset) * self.RATE)
-            input_bytes = self.frames_np[int(samples_take):].copy()
-        duration = input_bytes.shape[0] / self.RATE
-        return input_bytes, duration
-    def prepare_segments(self, last_segment=None):
-        """
-        Prepares the segments of transcribed text to be sent to the client.
-        This method compiles the recent segments of transcribed text, ensuring that only the
-        specified number of the most recent segments are included. It also appends the most
-        recent segment of text if provided (which is considered incomplete because of the possibility
-        of the last word being truncated in the audio chunk).
-        Args:
-            last_segment (str, optional): The most recent segment of transcribed text to be added
-                                          to the list of segments. Defaults to None.
-        Returns:
-            list: A list of transcribed text segments to be sent to the client.
-        """
-        segments = []
-        if len(self.transcript) >= self.send_last_n_segments:
-            segments = self.transcript[-self.send_last_n_segments:].copy()
-        else:
-            segments = self.transcript.copy()
-        if last_segment is not None:
-            segments = segments + [last_segment]
-        logging.info(f"{segments}")
-        return segments
-    def get_audio_chunk_duration(self, input_bytes):
-        """
-        Calculates the duration of the provided audio chunk.
-        Args:
-            input_bytes (numpy.ndarray): The audio chunk for which to calculate the duration.
-        Returns:
-            float: The duration of the audio chunk in seconds.
-        """
-        return input_bytes.shape[0] / self.RATE
-    def send_transcription_to_client(self, segments):
-        """
-        Sends the specified transcription segments to the client over the websocket connection.
-        This method formats the transcription segments into a JSON object and attempts to send
-        this object to the client. If an error occurs during the send operation, it logs the error.
-        Returns:
-            segments (list): A list of transcription segments to be sent to the client.
-        """
-        try:
-            self.websocket.send(
-                json.dumps({
-                    "uid": self.client_uid,
-                    "segments": segments,
-                })
-            )
-        except Exception as e:
-            logging.error(f"[ERROR]: Sending data to client: {e}")
-    def disconnect(self):
-        """
-        Notify the client of disconnection and send a disconnect message.
-        This method sends a disconnect message to the client via the WebSocket connection to notify them
-        that the transcription service is disconnecting gracefully.
-        """
-        self.websocket.send(json.dumps({
-            "uid": self.client_uid,
-            "message": self.DISCONNECT
-        }))
-    def cleanup(self):
-        """
-        Perform cleanup tasks before exiting the transcription service.
-        This method performs necessary cleanup tasks, including stopping the transcription thread, marking
-        the exit flag to indicate the transcription thread should exit gracefully, and destroying resources
-        associated with the transcription process.
-        """
-        logging.info("Cleaning up.")
-        self.exit = True
-class ServeClientWhisperCPP(ServeClientBase):
-    SINGLE_MODEL = None
-    SINGLE_MODEL_LOCK = threading.Lock()
-    def __init__(self, websocket, language=None, client_uid=None,
-                 single_model=False):
-        """
-        Initialize a ServeClient instance.
-        The Whisper model is initialized based on the client's language and device availability.
-        The transcription thread is started upon initialization. A "SERVER_READY" message is sent
-        to the client to indicate that the server is ready.
-        Args:
-            websocket (WebSocket): The WebSocket connection for the client.
-            language (str, optional): The language for transcription. Defaults to None.
-            client_uid (str, optional): A unique identifier for the client. Defaults to None.
-            single_model (bool, optional): Whether to instantiate a new model for each client connection. Defaults to False.
-        """
-        super().__init__(client_uid, websocket)
-        self.language = language
-        self.eos = False
-        if single_model:
-            if ServeClientWhisperCPP.SINGLE_MODEL is None:
-                self.create_model()
-                ServeClientWhisperCPP.SINGLE_MODEL = self.transcriber
-            else:
-                self.transcriber = ServeClientWhisperCPP.SINGLE_MODEL
-        else:
-            self.create_model()
-        # threading
-        logging.info('Create a thread to process audio.')
-        self.trans_thread = threading.Thread(target=self.speech_to_text)
-        self.trans_thread.start()
-        self.websocket.send(json.dumps({
-            "uid": self.client_uid,
-            "message": self.SERVER_READY,
-            "backend": "pywhispercpp"
-        }))
-    def create_model(self, warmup=True):
-        """
-        Instantiates a new model, sets it as the transcriber and does warmup if desired.
-        """
-        self.transcriber = Model(model=config.WHISPER_MODEL, models_dir=config.MODEL_DIR)
-        if warmup:
-            self.warmup()
-    def warmup(self, warmup_steps=1):
-        """
-        Warmup TensorRT since first few inferences are slow.
-        Args:
-            warmup_steps (int): Number of steps to warm up the model for.
-        """
-        logging.info("[INFO:] Warming up whisper.cpp engine..")
-        mel, _, = soundfile.read("assets/jfk.flac")
-        for i in range(warmup_steps):
-            self.transcriber.transcribe(mel, print_progress=False)
-    def set_eos(self, eos):
-        """
-        Sets the End of Speech (EOS) flag.
-        Args:
-            eos (bool): The value to set for the EOS flag.
-        """
-        self.lock.acquire()
-        self.eos = eos
-        self.lock.release()
-    def handle_transcription_output(self, last_segment, duration):
-        """
-        Handle the transcription output, updating the transcript and sending data to the client.
-        Args:
-            last_segment (str): The last segment from the whisper output which is considered to be incomplete because
-                                of the possibility of word being truncated.
-            duration (float): Duration of the transcribed audio chunk.
-        """
-        segments = self.prepare_segments({"text": last_segment})
-        self.send_transcription_to_client(segments)
-        if self.eos:
-            self.update_timestamp_offset(last_segment, duration)
-    def transcribe_audio(self, input_bytes):
-        """
-        Transcribe the audio chunk and send the results to the client.
-        Args:
-            input_bytes (np.array): The audio chunk to transcribe.
-        """
-        if ServeClientWhisperCPP.SINGLE_MODEL:
-            ServeClientWhisperCPP.SINGLE_MODEL_LOCK.acquire()
-        logging.info(f"[pywhispercpp:] Processing audio with duration: {input_bytes.shape[0] / self.RATE}")
-        mel = input_bytes
-        duration = librosa.get_duration(y=input_bytes, sr=self.RATE)
-        if self.language == "zh":
-            prompt = '以下是简体中文普通话的句子。'
-        else:
-            prompt = 'The following is an English sentence.'
-        segments = self.transcriber.transcribe(
-            mel,
-            language=self.language,
-            initial_prompt=prompt,
-            token_timestamps=True,
-            # max_len=max_len,
-            print_progress=False
-        )
-        text = []
-        for segment in segments:
-            content = segment.text
-            text.append(content)
-        last_segment = ' '.join(text)
-        logging.info(f"[pywhispercpp:] Last segment: {last_segment}")
-        if ServeClientWhisperCPP.SINGLE_MODEL:
-            ServeClientWhisperCPP.SINGLE_MODEL_LOCK.release()
-        if last_segment:
-            self.handle_transcription_output(last_segment, duration)
-    def update_timestamp_offset(self, last_segment, duration):
-        """
-        Update timestamp offset and transcript.
-        Args:
-            last_segment (str): Last transcribed audio from the whisper model.
-            duration (float): Duration of the last audio chunk.
-        """
-        if not len(self.transcript):
-            self.transcript.append({"text": last_segment + " "})
-        elif self.transcript[-1]["text"].strip() != last_segment:
-            self.transcript.append({"text": last_segment + " "})
-        logging.info(f'Transcript list context: {self.transcript}')
-        with self.lock:
-            self.timestamp_offset += duration
-    def speech_to_text(self):
-        """
-        Process an audio stream in an infinite loop, continuously transcribing the speech.
-        This method continuously receives audio frames, performs real-time transcription, and sends
-        transcribed segments to the client via a WebSocket connection.
-        If the client's language is not detected, it waits for 30 seconds of audio input to make a language prediction.
-        It utilizes the Whisper ASR model to transcribe the audio, continuously processing and streaming results. Segments
-        are sent to the client in real-time, and a history of segments is maintained to provide context.Pauses in speech
-        (no output from Whisper) are handled by showing the previous output for a set duration. A blank segment is added if
-        there is no speech for a specified duration to indicate a pause.
-        Raises:
-            Exception: If there is an issue with audio processing or WebSocket communication.
-        """
-        while True:
-            if self.exit:
-                logging.info("Exiting speech to text thread")
-                break
-            if self.frames_np is None:
-                time.sleep(0.02)  # wait for any audio to arrive
-                continue
-            self.clip_audio_if_no_valid_segment()
-            input_bytes, duration = self.get_audio_chunk_for_processing()
-            if duration < 1:
-                continue
-            try:
-                input_sample = input_bytes.copy()
-                logging.info(f"[pywhispercpp:] Processing audio with duration: {duration}")
-                self.transcribe_audio(input_sample)
-            except Exception as e:
-                logging.error(f"[ERROR]: {e}")

transcribe/strategy.py DELETED Viewed

@@ -1,405 +0,0 @@
-import collections
-import logging
-from difflib import SequenceMatcher
-from itertools import chain
-from dataclasses import dataclass, field
-from typing import List, Tuple, Optional, Deque, Any, Iterator,Literal
-from config import SENTENCE_END_MARKERS, ALL_MARKERS,SENTENCE_END_PATTERN,REGEX_MARKERS, PAUSEE_END_PATTERN,SAMPLE_RATE
-from enum import Enum
-import wordninja
-import config
-import re
-logger = logging.getLogger("TranscriptionStrategy")
-class SplitMode(Enum):
-    PUNCTUATION = "punctuation"
-    PAUSE = "pause"
-    END = "end"
-@dataclass
-class TranscriptResult:
-    seg_id: int = 0
-    cut_index: int = 0
-    is_end_sentence: bool = False
-    context: str = ""
-    def partial(self):
-        return not self.is_end_sentence
-@dataclass
-class TranscriptToken:
-    """表示一个转录片段，包含文本和时间信息"""
-    text: str  # 转录的文本内容
-    t0: int  # 开始时间（百分之一秒）
-    t1: int  # 结束时间（百分之一秒）
-    def is_punctuation(self):
-        """检查文本是否包含标点符号"""
-        return REGEX_MARKERS.search(self.text.strip()) is not  None
-    def is_end(self):
-        """检查文本是否为句子结束标记"""
-        return SENTENCE_END_PATTERN.search(self.text.strip())  is not  None
-    def is_pause(self):
-        """检查文本是否为暂停标记"""
-        return PAUSEE_END_PATTERN.search(self.text.strip()) is not  None
-    def buffer_index(self) -> int:
-        return int(self.t1 / 100 * SAMPLE_RATE)
-@dataclass
-class TranscriptChunk:
-    """表示一组转录片段，支持分割和比较操作"""
-    separator: str = ""  # 用于连接片段的分隔符
-    items: list[TranscriptToken] = field(default_factory=list)  # 转录片段列表
-    @staticmethod
-    def _calculate_similarity(text1: str, text2: str) -> float:
-        """计算两段文本的相似度"""
-        return SequenceMatcher(None, text1, text2).ratio()
-    def split_by(self, mode: SplitMode) -> list['TranscriptChunk']:
-        """根据文本中的标点符号分割片段列表"""
-        if mode == SplitMode.PUNCTUATION:
-            indexes = [i for i, seg in enumerate(self.items) if seg.is_punctuation()]
-        elif mode == SplitMode.PAUSE:
-            indexes = [i for i, seg in enumerate(self.items) if seg.is_pause()]
-        elif mode == SplitMode.END:
-            indexes = [i for i, seg in enumerate(self.items) if seg.is_end()]
-        else:
-            raise ValueError(f"Unsupported mode: {mode}")
-        # 每个切分点向后移一个索引，表示“分隔符归前段”
-        cut_points = [0] + sorted(i + 1 for i in indexes) + [len(self.items)]
-        chunks =  [
-            TranscriptChunk(items=self.items[start:end], separator=self.separator)
-            for start, end in zip(cut_points, cut_points[1:])
-        ]
-        return [
-            ck
-            for ck in chunks
-            if not ck.only_punctuation()
-        ]
-    def get_split_first_rest(self,  mode: SplitMode):
-        chunks = self.split_by(mode)
-        fisrt_chunk = chunks[0] if chunks else self
-        rest_chunks = chunks[1:] if chunks else None
-        return fisrt_chunk, rest_chunks
-    def puncation_numbers(self) -> int:
-        """计算片段中标点符号的数量"""
-        return sum(1 for seg in self.items if seg.is_punctuation())
-    def length(self) -> int:
-        """返回片段列表的长度"""
-        return len(self.items)
-    def join(self) -> str:
-        """将片段连接为一个字符串"""
-        return self.separator.join(seg.text for seg in self.items)
-    def compare(self, chunk: Optional['TranscriptChunk'] = None) -> float:
-        """比较当前片段与另一个片段的相似度"""
-        if not chunk:
-            return 0
-        score =  self._calculate_similarity(self.join(), chunk.join())
-        # logger.debug(f"Compare: {self.join()} vs {chunk.join()} : {score}")
-        return score
-    def only_punctuation(self)->bool:
-        return all(seg.is_punctuation() for seg in self.items)
-    def has_punctuation(self) -> bool:
-        return any(seg.is_punctuation() for seg in self.items)
-    def get_buffer_index(self) -> int:
-        return self.items[-1].buffer_index()
-    def is_end_sentence(self) ->bool:
-        return self.items[-1].is_end()
-class TranscriptHistory:
-    """管理转录片段的历史记录"""
-    def __init__(self) -> None:
-        self.history = collections.deque(maxlen=2)  # 存储最近的两个片段
-    def add(self, chunk: TranscriptChunk):
-        """添加新的片段到历史记录"""
-        self.history.appendleft(chunk)
-    def previous_chunk(self) -> Optional[TranscriptChunk]:
-        """获取上一个片段（如果存在）"""
-        return self.history[1] if len(self.history) == 2 else None
-    def lastest_chunk(self):
-        """获取最后一个片段"""
-        return self.history[-1]
-    def clear(self):
-        self.history.clear()
-class TranscriptBuffer:
-    """
-    管理转录文本的分级结构：临时字符串 -> 短句 -> 完整段落
-    |-- 已确认文本 --|-- 观察窗口 --|-- 新输入 --|
-    管理 pending -> line -> paragraph 的缓冲逻辑
-    """
-    def __init__(self, source_lang:str, separator:str):
-        self._segments: List[str] = collections.deque(maxlen=2)     # 确认的完整段落
-        self._sentences: List[str] = collections.deque()   # 当前段落中的短句
-        self._buffer: str = ""             # 当前缓冲中的文本
-        self._current_seg_id: int = 0
-        self.source_language = source_lang
-        self._separator = separator
-    def get_seg_id(self) -> int:
-        return self._current_seg_id
-    @property
-    def current_sentences_length(self) -> int:
-        count = 0
-        for item in self._sentences:
-            if self._separator:
-                count += len(item.split(self._separator))
-            else:
-                count += len(item)
-        return count
-    def update_pending_text(self, text: str) -> None:
-        """更新临时缓冲字符串"""
-        self._buffer = text
-    def commit_line(self,) -> None:
-        """将缓冲字符串提交为短句"""
-        if self._buffer:
-            self._sentences.append(self._buffer)
-            self._buffer = ""
-    def commit_paragraph(self) -> None:
-        """
-        提交当前短句为完整段落（如句子结束）
-        Args:
-            end_of_sentence: 是否为句子结尾（如检测到句号）
-        """
-        count = 0
-        current_sentences = []
-        while len(self._sentences): # and count < 20:
-            item = self._sentences.popleft()
-            current_sentences.append(item)
-            if self._separator:
-                count += len(item.split(self._separator))
-            else:
-                count += len(item)
-        if current_sentences:
-            self._segments.append("".join(current_sentences))
-        logger.debug(f"=== count to paragraph ===")
-        logger.debug(f"push: {current_sentences}")
-        logger.debug(f"rest: {self._sentences}")
-        # if self._sentences:
-        #     self._segments.append("".join(self._sentences))
-        #     self._sentences.clear()
-    def rebuild(self, text):
-        output = self.split_and_join(
-                    text.replace(
-                        self._separator, ""))
-        logger.debug("==== rebuild string ====")
-        logger.debug(text)
-        logger.debug(output)
-        return output
-    @staticmethod
-    def split_and_join(text):
-        tokens = []
-        word_buf = ''
-        for char in text:
-            if char in ALL_MARKERS:
-                if word_buf:
-                    tokens.extend(wordninja.split(word_buf))
-                    word_buf = ''
-                tokens.append(char)
-            else:
-                word_buf += char
-        if word_buf:
-            tokens.extend(wordninja.split(word_buf))
-        output = ''
-        for i, token in enumerate(tokens):
-            if i == 0:
-                output += token
-            elif token in ALL_MARKERS:
-                output += (token + " ")
-            else:
-                output += ' ' + token
-        return output
-    def update_and_commit(self, stable_strings: List[str], remaining_strings:List[str], is_end_sentence=False):
-        if self.source_language == "en":
-            stable_strings = [self.rebuild(i) for i in stable_strings]
-            remaining_strings =[self.rebuild(i) for i in remaining_strings]
-        remaining_string = "".join(remaining_strings)
-        logger.debug(f"{self.__dict__}")
-        if is_end_sentence:
-            for stable_str in stable_strings:
-                self.update_pending_text(stable_str)
-                self.commit_line()
-            current_text_len = len(self.current_not_commit_text.split(self._separator)) if self._separator else len(self.current_not_commit_text)
-            # current_text_len = len(self.current_not_commit_text.split(self._separator))
-            self.update_pending_text(remaining_string)
-            if current_text_len >= config.TEXT_THREHOLD:
-                self.commit_paragraph()
-                self._current_seg_id += 1
-                return True
-        else:
-            for stable_str in stable_strings:
-                self.update_pending_text(stable_str)
-                self.commit_line()
-            self.update_pending_text(remaining_string)
-        return False
-    @property
-    def un_commit_paragraph(self) -> str:
-        """当前短句组合"""
-        return "".join([i for i in self._sentences])
-    @property
-    def pending_text(self) -> str:
-        """当前缓冲内容"""
-        return self._buffer
-    @property
-    def latest_paragraph(self) -> str:
-        """最新确认的段落"""
-        return self._segments[-1] if self._segments else ""
-    @property
-    def current_not_commit_text(self) -> str:
-        return self.un_commit_paragraph + self.pending_text
-class TranscriptStabilityAnalyzer:
-    def __init__(self, source_lang, separator) -> None:
-        self._transcript_buffer = TranscriptBuffer(source_lang=source_lang,separator=separator)
-        self._transcript_history = TranscriptHistory()
-        self._separator = separator
-        logger.debug(f"Current separator: {self._separator}")
-    def merge_chunks(self, chunks: List[TranscriptChunk])->str:
-        if not chunks:
-            return [""]
-        output =  list(r.join() for r in chunks if r)
-        return output
-    def analysis(self, current: TranscriptChunk, buffer_duration: float) -> Iterator[TranscriptResult]:
-        current = TranscriptChunk(items=current, separator=self._separator)
-        self._transcript_history.add(current)
-        prev = self._transcript_history.previous_chunk()
-        self._transcript_buffer.update_pending_text(current.join())
-        if not prev: # 如果没有历史记录 那么就说明是新的语句 直接输出就行
-            yield TranscriptResult(
-                context=self._transcript_buffer.current_not_commit_text,
-                seg_id=self._transcript_buffer.get_seg_id()
-            )
-            return
-        # yield from self._handle_short_buffer(current, prev)
-        if buffer_duration <= 4:
-            yield from self._handle_short_buffer(current, prev)
-        else:
-            yield from self._handle_long_buffer(current)
-    def _handle_short_buffer(self, curr: TranscriptChunk, prev: TranscriptChunk) -> Iterator[TranscriptResult]:
-        curr_first, curr_rest = curr.get_split_first_rest(SplitMode.PUNCTUATION)
-        prev_first, _ = prev.get_split_first_rest(SplitMode.PUNCTUATION)
-        # logger.debug("==== Current cut item ====")
-        # logger.debug(f"{curr.join()} ")
-        # logger.debug(f"{prev.join()}")
-        # logger.debug("==========================")
-        if curr_first and prev_first:
-            core = curr_first.compare(prev_first)
-            has_punctuation = curr_first.has_punctuation()
-            if core >= 0.8 and has_punctuation:
-                yield from self._yield_commit_results(curr_first, curr_rest, curr_first.is_end_sentence())
-                return
-        yield TranscriptResult(
-            seg_id=self._transcript_buffer.get_seg_id(),
-            context=self._transcript_buffer.current_not_commit_text
-        )
-    def _handle_long_buffer(self, curr: TranscriptChunk) -> Iterator[TranscriptResult]:
-        chunks = curr.split_by(SplitMode.PUNCTUATION)
-        if len(chunks) > 1:
-            stable, remaining = chunks[:-1], chunks[-1:]
-            # stable_str = self.merge_chunks(stable)
-            # remaining_str = self.merge_chunks(remaining)
-            yield from self._yield_commit_results(
-                stable, remaining, is_end_sentence=True  # 暂时硬编码为True
-            )
-        else:
-            yield TranscriptResult(
-                seg_id=self._transcript_buffer.get_seg_id(),
-                context=self._transcript_buffer.current_not_commit_text
-            )
-    def _yield_commit_results(self, stable_chunk, remaining_chunks, is_end_sentence: bool) -> Iterator[TranscriptResult]:
-        stable_str_list = [stable_chunk.join()] if hasattr(stable_chunk, "join") else self.merge_chunks(stable_chunk)
-        remaining_str_list = self.merge_chunks(remaining_chunks)
-        frame_cut_index = stable_chunk[-1].get_buffer_index() if isinstance(stable_chunk, list) else stable_chunk.get_buffer_index()
-        prev_seg_id = self._transcript_buffer.get_seg_id()
-        commit_paragraph = self._transcript_buffer.update_and_commit(stable_str_list, remaining_str_list, is_end_sentence)
-        logger.debug(f"current buffer: {self._transcript_buffer.__dict__}")
-        if commit_paragraph:
-            # 表示生成了一个新段落 换行
-            yield TranscriptResult(
-                seg_id=prev_seg_id,
-                cut_index=frame_cut_index,
-                context=self._transcript_buffer.latest_paragraph,
-                is_end_sentence=True
-            )
-            if (context := self._transcript_buffer.current_not_commit_text.strip()):
-                yield TranscriptResult(
-                    seg_id=self._transcript_buffer.get_seg_id(),
-                    context=context,
-                )
-        else:
-            yield TranscriptResult(
-                seg_id=self._transcript_buffer.get_seg_id(),
-                cut_index=frame_cut_index,
-                context=self._transcript_buffer.current_not_commit_text,
-            )

transcribe/transcription.py DELETED Viewed

@@ -1,334 +0,0 @@
-import logging
-import time
-import functools
-import json
-import logging
-import time
-from enum import Enum
-from typing import List, Optional
-import numpy as np
-from .server import ServeClientBase
-from .whisper_llm_serve import PyWhiperCppServe
-from .vad import VoiceActivityDetector
-from urllib.parse import urlparse, parse_qsl
-from websockets.exceptions import ConnectionClosed
-from websockets.sync.server import serve
-from uuid import uuid1
-logging.basicConfig(level=logging.INFO)
-class ClientManager:
-    def __init__(self, max_clients=4, max_connection_time=600):
-        """
-        Initializes the ClientManager with specified limits on client connections and connection durations.
-        Args:
-            max_clients (int, optional): The maximum number of simultaneous client connections allowed. Defaults to 4.
-            max_connection_time (int, optional): The maximum duration (in seconds) a client can stay connected. Defaults
-                                                 to 600 seconds (10 minutes).
-        """
-        self.clients = {}
-        self.start_times = {}
-        self.max_clients = max_clients
-        self.max_connection_time = max_connection_time
-    def add_client(self, websocket, client):
-        """
-        Adds a client and their connection start time to the tracking dictionaries.
-        Args:
-            websocket: The websocket associated with the client to add.
-            client: The client object to be added and tracked.
-        """
-        self.clients[websocket] = client
-        self.start_times[websocket] = time.time()
-    def get_client(self, websocket):
-        """
-        Retrieves a client associated with the given websocket.
-        Args:
-            websocket: The websocket associated with the client to retrieve.
-        Returns:
-            The client object if found, False otherwise.
-        """
-        if websocket in self.clients:
-            return self.clients[websocket]
-        return False
-    def remove_client(self, websocket):
-        """
-        Removes a client and their connection start time from the tracking dictionaries. Performs cleanup on the
-        client if necessary.
-        Args:
-            websocket: The websocket associated with the client to be removed.
-        """
-        client = self.clients.pop(websocket, None)
-        if client:
-            client.cleanup()
-        self.start_times.pop(websocket, None)
-    def get_wait_time(self):
-        """
-        Calculates the estimated wait time for new clients based on the remaining connection times of current clients.
-        Returns:
-            The estimated wait time in minutes for new clients to connect. Returns 0 if there are available slots.
-        """
-        wait_time = None
-        for start_time in self.start_times.values():
-            current_client_time_remaining = self.max_connection_time - (time.time() - start_time)
-            if wait_time is None or current_client_time_remaining < wait_time:
-                wait_time = current_client_time_remaining
-        return wait_time / 60 if wait_time is not None else 0
-    def is_server_full(self, websocket, options):
-        """
-        Checks if the server is at its maximum client capacity and sends a wait message to the client if necessary.
-        Args:
-            websocket: The websocket of the client attempting to connect.
-            options: A dictionary of options that may include the client's unique identifier.
-        Returns:
-            True if the server is full, False otherwise.
-        """
-        if len(self.clients) >= self.max_clients:
-            wait_time = self.get_wait_time()
-            response = {"uid": options["uid"], "status": "WAIT", "message": wait_time}
-            websocket.send(json.dumps(response))
-            return True
-        return False
-    def is_client_timeout(self, websocket):
-        """
-        Checks if a client has exceeded the maximum allowed connection time and disconnects them if so, issuing a warning.
-        Args:
-            websocket: The websocket associated with the client to check.
-        Returns:
-            True if the client's connection time has exceeded the maximum limit, False otherwise.
-        """
-        elapsed_time = time.time() - self.start_times[websocket]
-        if elapsed_time >= self.max_connection_time:
-            self.clients[websocket].disconnect()
-            logging.warning(f"Client with uid '{self.clients[websocket].client_uid}' disconnected due to overtime.")
-            return True
-        return False
-class BackendType(Enum):
-    PYWHISPERCPP = "pywhispercpp"
-    @staticmethod
-    def valid_types() -> List[str]:
-        return [backend_type.value for backend_type in BackendType]
-    @staticmethod
-    def is_valid(backend: str) -> bool:
-        return backend in BackendType.valid_types()
-    def is_pywhispercpp(self) -> bool:
-        return self == BackendType.PYWHISPERCPP
-class TranscriptionServer:
-    RATE = 16000
-    def __init__(self):
-        self.client_manager = None
-        self.no_voice_activity_chunks = 0
-        self.single_model = False
-    def initialize_client(
-            self, websocket, options
-    ):
-        client: Optional[ServeClientBase] = None
-        if self.backend.is_pywhispercpp():
-            client = PyWhiperCppServe(
-                websocket,
-                language=options["language"],
-                client_uid=options["uid"],
-            )
-            logging.info("Running pywhispercpp backend.")
-        if client is None:
-            raise ValueError(f"Backend type {self.backend.value} not recognised or not handled.")
-        self.client_manager.add_client(websocket, client)
-    def get_audio_from_websocket(self, websocket):
-        """
-        Receives audio buffer from websocket and creates a numpy array out of it.
-        Args:
-            websocket: The websocket to receive audio from.
-        Returns:
-            A numpy array containing the audio.
-        """
-        frame_data = websocket.recv()
-        if frame_data == b"END_OF_AUDIO":
-            return False
-        return np.frombuffer(frame_data, dtype=np.int16).astype(np.float32) / 32768.0
-        # return np.frombuffer(frame_data, dtype=np.float32)
-    def handle_new_connection(self, websocket):
-        query_parameters_dict = dict(parse_qsl(urlparse(websocket.request.path).query))
-        from_lang, to_lang = query_parameters_dict.get('from'), query_parameters_dict.get('to')
-        try:
-            logging.info("New client connected")
-            options = websocket.recv()
-            try:
-                options = json.loads(options)
-            except Exception as e:
-                options = {"language": from_lang, "uid": str(uuid1())}
-            if self.client_manager is None:
-                max_clients = options.get('max_clients', 4)
-                max_connection_time = options.get('max_connection_time', 600)
-                self.client_manager = ClientManager(max_clients, max_connection_time)
-            if self.client_manager.is_server_full(websocket, options):
-                websocket.close()
-                return False  # Indicates that the connection should not continue
-            if self.backend.is_pywhispercpp():
-                self.vad_detector = VoiceActivityDetector(frame_rate=self.RATE)
-            self.initialize_client(websocket, options)
-            if from_lang and to_lang:
-                self.set_lang(websocket, from_lang, to_lang)
-                logging.info(f"Source lange: {from_lang}  -> Dst lange: {to_lang}")
-            return True
-        except json.JSONDecodeError:
-            logging.error("Failed to decode JSON from client")
-            return False
-        except ConnectionClosed:
-            logging.info("Connection closed by client")
-            return False
-        except Exception as e:
-            logging.error(f"Error during new connection initialization: {str(e)}")
-            return False
-    def process_audio_frames(self, websocket):
-        frame_np = self.get_audio_from_websocket(websocket)
-        client = self.client_manager.get_client(websocket)
-        # TODO Vad has some problem, it will be blocking process loop
-        # if frame_np is False:
-        #     if self.backend.is_pywhispercpp():
-        #         client.set_eos(True)
-        #     return False
-        # if self.backend.is_pywhispercpp():
-        #     voice_active = self.voice_activity(websocket, frame_np)
-        #     if voice_active:
-        #         self.no_voice_activity_chunks = 0
-        #         client.set_eos(False)
-        #     if self.use_vad and not voice_active:
-        #         return True
-        client.add_frames(frame_np)
-        return True
-    def set_lang(self, websocket, src_lang, dst_lang):
-        client = self.client_manager.get_client(websocket)
-        if isinstance(client, PyWhiperCppServe):
-            client.set_lang(src_lang, dst_lang)
-    def recv_audio(self,
-                   websocket,
-                   backend: BackendType = BackendType.PYWHISPERCPP):
-        self.backend = backend
-        if not self.handle_new_connection(websocket):
-            return
-        try:
-            while not self.client_manager.is_client_timeout(websocket):
-                if not self.process_audio_frames(websocket):
-                    break
-        except ConnectionClosed:
-            logging.info("Connection closed by client")
-        except Exception as e:
-            logging.error(f"Unexpected error: {str(e)}")
-        finally:
-            if self.client_manager.get_client(websocket):
-                self.cleanup(websocket)
-                websocket.close()
-            del websocket
-    def run(self,
-            host,
-            port=9090,
-            backend="pywhispercpp"):
-        """
-        Run the transcription server.
-        Args:
-            host (str): The host address to bind the server.
-            port (int): The port number to bind the server.
-        """
-        if not BackendType.is_valid(backend):
-            raise ValueError(f"{backend} is not a valid backend type. Choose backend from {BackendType.valid_types()}")
-        with serve(
-                functools.partial(
-                    self.recv_audio,
-                    backend=BackendType(backend),
-                ),
-                host,
-                port
-        ) as server:
-            server.serve_forever()
-    def voice_activity(self, websocket, frame_np):
-        """
-        Evaluates the voice activity in a given audio frame and manages the state of voice activity detection.
-        This method uses the configured voice activity detection (VAD) model to assess whether the given audio frame
-        contains speech. If the VAD model detects no voice activity for more than three consecutive frames,
-        it sets an end-of-speech (EOS) flag for the associated client. This method aims to efficiently manage
-        speech detection to improve subsequent processing steps.
-        Args:
-            websocket: The websocket associated with the current client. Used to retrieve the client object
-                    from the client manager for state management.
-            frame_np (numpy.ndarray): The audio frame to be analyzed. This should be a NumPy array containing
-                                    the audio data for the current frame.
-        Returns:
-            bool: True if voice activity is detected in the current frame, False otherwise. When returning False
-                after detecting no voice activity for more than three consecutive frames, it also triggers the
-                end-of-speech (EOS) flag for the client.
-        """
-        if not self.vad_detector(frame_np):
-            self.no_voice_activity_chunks += 1
-            if self.no_voice_activity_chunks > 3:
-                client = self.client_manager.get_client(websocket)
-                if not client.eos:
-                    client.set_eos(True)
-                time.sleep(0.1)  # Sleep 100m; wait some voice activity.
-            return False
-        return True
-    def cleanup(self, websocket):
-        """
-        Cleans up resources associated with a given client's websocket.
-        Args:
-            websocket: The websocket associated with the client to be cleaned up.
-        """
-        if self.client_manager.get_client(websocket):
-            self.client_manager.remove_client(websocket)

transcribe/translatepipes.py CHANGED Viewed

@@ -3,9 +3,7 @@ from transcribe.pipelines import WhisperPipe, MetaItem, WhisperChinese, Translat
 class TranslatePipes:
     def __init__(self) -> None:
-        # self.whisper_input_q = mp.Queue()
-        # self.translate_input_q = mp.Queue()
-        # self.result_queue = mp.Queue()
         self._process = []
         # whisper 转录
         self._whisper_pipe_en = self._launch_process(WhisperPipe())
@@ -14,13 +12,9 @@ class TranslatePipes:
         # llm 翻译
         # self._translate_pipe = self._launch_process(TranslatePipe())
         self._translate_7b_pipe = self._launch_process(Translate7BPipe())
         # vad
         self._vad_pipe = self._launch_process(VadPipe())
-    # def reset(self):
-    #     self._vad_pipe.reset()
     def _launch_process(self, process_obj):
         process_obj.daemon = True
@@ -48,17 +42,12 @@ class TranslatePipes:
         self._translate_7b_pipe.input_queue.put(item)
         return self._translate_7b_pipe.output_queue.get()
-    def get_whisper_model(self, lang: str = 'en'):
-        if lang == 'zh':
-            return self._whisper_pipe_zh
-        return self._whisper_pipe_en
     def get_transcription_model(self, lang: str = 'en'):
         if lang == 'zh':
             return self._funasr_pipe
         return self._whisper_pipe_en
-    def transcrible(self, audio_buffer: bytes, src_lang: str) -> MetaItem:
         transcription_model = self.get_transcription_model(src_lang)
         item = MetaItem(audio=audio_buffer, source_language=src_lang)
         transcription_model.input_queue.put(item)
@@ -76,6 +65,6 @@ if __name__ == "__main__":
     tp = TranslatePipes()
     # result = tp.translate("你好，今天天气怎么样?", src_lang="zh", dst_lang="en")
     mel, _, = soundfile.read("assets/jfk.flac")
-    # result = tp.transcrible(mel, 'en')
     result = tp.voice_detect(mel)
     print(result)

 class TranslatePipes:
     def __init__(self) -> None:
         self._process = []
         # whisper 转录
         self._whisper_pipe_en = self._launch_process(WhisperPipe())
         # llm 翻译
         # self._translate_pipe = self._launch_process(TranslatePipe())
         self._translate_7b_pipe = self._launch_process(Translate7BPipe())
         # vad
         self._vad_pipe = self._launch_process(VadPipe())
     def _launch_process(self, process_obj):
         process_obj.daemon = True
         self._translate_7b_pipe.input_queue.put(item)
         return self._translate_7b_pipe.output_queue.get()
     def get_transcription_model(self, lang: str = 'en'):
         if lang == 'zh':
             return self._funasr_pipe
         return self._whisper_pipe_en
+    def transcribe(self, audio_buffer: bytes, src_lang: str) -> MetaItem:
         transcription_model = self.get_transcription_model(src_lang)
         item = MetaItem(audio=audio_buffer, source_language=src_lang)
         transcription_model.input_queue.put(item)
     tp = TranslatePipes()
     # result = tp.translate("你好，今天天气怎么样?", src_lang="zh", dst_lang="en")
     mel, _, = soundfile.read("assets/jfk.flac")
+    # result = tp.transcribe(mel, 'en')
     result = tp.voice_detect(mel)
     print(result)

transcribe/utils.py CHANGED Viewed

@@ -8,6 +8,7 @@ import config
 import csv
 import av
 import re
 # Compile regex patterns once outside the loop for better performance
 p_pattern = re.compile(r"(\s*\[.*?\])")
@@ -18,43 +19,67 @@ p_end_pattern = re.compile(r"(\s*.*\])")
 def filter_words(res_word):
     """
     Filter words according to specific bracket patterns.
     Args:
         res_word: Iterable of word objects with a 'text' attribute
     Returns:
         List of filtered word objects
     """
     asr_results = []
     skip_word = False
     for word in res_word:
         # Skip words that completely match the pattern
         if p_pattern.match(word.text):
             continue
         # Mark the start of a section to skip
         if p_start_pattern.match(word.text):
             skip_word = True
             continue
         # Mark the end of a section to skip
         if p_end_pattern.match(word.text) and skip_word:
             skip_word = False
             continue
         # Skip words if we're in a skip section
         if skip_word:
             continue
         # Add the word to results if it passed all filters
         asr_results.append(word)
     return asr_results
 def log_block(key: str, value, unit=''):
     if config.DEBUG:
-        return
     """格式化输出日志内容"""
     key_fmt = f"[  {key.ljust(25)}]"  # 左对齐填充
     val_fmt = f"{value} {unit}".strip()
@@ -157,8 +182,8 @@ class TestDataWriter:
     def __init__(self, file_path='test_data.csv'):
         self.file_path = file_path
         self.fieldnames = [
-            'seg_id', 'transcrible_time', 'translate_time',
-            'transcribleContent', 'from', 'to', 'translateContent', 'partial'
         ]
         self._ensure_file_has_header()
@@ -171,4 +196,4 @@ class TestDataWriter:
     def write(self, result: 'DebugResult'):
         with open(self.file_path, mode='a', newline='') as file:
             writer = csv.DictWriter(file, fieldnames=self.fieldnames)
-            writer.writerow(result.model_dump(by_alias=True))

 import csv
 import av
 import re
+import json
 # Compile regex patterns once outside the loop for better performance
 p_pattern = re.compile(r"(\s*\[.*?\])")
 def filter_words(res_word):
     """
     Filter words according to specific bracket patterns.
     Args:
         res_word: Iterable of word objects with a 'text' attribute
     Returns:
         List of filtered word objects
     """
     asr_results = []
     skip_word = False
     for word in res_word:
         # Skip words that completely match the pattern
         if p_pattern.match(word.text):
             continue
         # Mark the start of a section to skip
         if p_start_pattern.match(word.text):
             skip_word = True
             continue
         # Mark the end of a section to skip
         if p_end_pattern.match(word.text) and skip_word:
             skip_word = False
             continue
         # Skip words if we're in a skip section
         if skip_word:
             continue
+        word.text = replace_hotwords(word.text)
         # Add the word to results if it passed all filters
         asr_results.append(word)
     return asr_results
+def replace_hotwords(text: str) -> str:
+    """
+    Reads hotwords from a JSON file and replaces occurrences in the input text.
+    Args:
+        text: The input string to process.
+    Returns:
+        The string with hotwords replaced.
+    """
+    processed_text = text
+    # Iterate through the hotwords dictionary
+    for key, value in config.hotwords_json.items():
+        # Replace all occurrences of the key with the value in the text
+        processed_text = processed_text.replace(key, value)
+    logging.debug(f"Replace string: {text} => {processed_text}")
+    return processed_text
 def log_block(key: str, value, unit=''):
     if config.DEBUG:
+        return
     """格式化输出日志内容"""
     key_fmt = f"[  {key.ljust(25)}]"  # 左对齐填充
     val_fmt = f"{value} {unit}".strip()
     def __init__(self, file_path='test_data.csv'):
         self.file_path = file_path
         self.fieldnames = [
+            'seg_id', 'transcribe_time', 'translate_time',
+            'transcribeContent', 'from', 'to', 'translateContent', 'partial'
         ]
         self._ensure_file_has_header()
     def write(self, result: 'DebugResult'):
         with open(self.file_path, mode='a', newline='') as file:
             writer = csv.DictWriter(file, fieldnames=self.fieldnames)
+            writer.writerow(result.model_dump(by_alias=True))

transcribe/whisper_llm_serve.py CHANGED Viewed

@@ -1,10 +1,8 @@
-import asyncio
-import json
 import queue
 import threading
 import time
 from logging import getLogger
-from typing import List, Optional, Iterator, Tuple, Any
 import asyncio
 import numpy as np
 import config
@@ -13,16 +11,26 @@ from api_model import TransResult, Message, DebugResult
 from .utils import log_block, save_to_wave, TestDataWriter, filter_words
 from .translatepipes import TranslatePipes
-from .strategy import (
-    TranscriptStabilityAnalyzer, TranscriptToken)
-from transcribe.helpers.vadprocessor import VadProcessor
-# from transcribe.helpers.vad_dynamic import VadProcessor
-# from transcribe.helpers.vadprocessor import VadProcessor
 from transcribe.pipelines import MetaItem
 logger = getLogger("TranscriptionService")
 class WhisperTranscriptionService:
     """
     Whisper语音转录服务类，处理音频流转录和翻译
@@ -42,45 +50,35 @@ class WhisperTranscriptionService:
         self._translate_pipe = pipe
         # 音频处理相关
-        self.sample_rate = 16000
         self.lock = threading.Lock()
         # 文本分隔符，根据语言设置
-        self.text_separator = self._get_text_separator(language)
         self.loop = asyncio.get_event_loop()
-        # 发送就绪状态
         #  原始音频队列
         self._frame_queue = queue.Queue()
         #  音频队列缓冲区
-        self.frames_np = None
         #  完整音频队列
-        self.segments_queue = collections.deque()
-        self._temp_string = ""
-        self._transcrible_analysis = None
         # 启动处理线程
         self._translate_thread_stop = threading.Event()
         self._frame_processing_thread_stop = threading.Event()
-        self.translate_thread = self._start_thread(self._transcription_processing_loop)
-        self.frame_processing_thread = self._start_thread(self._frame_processing_loop)
-        # if language == "zh":
-        #     self._vad = VadProcessor(prob_threshold=0.8, silence_s=0.2, cache_s=0.15)
-        # else:
-        #     self._vad = VadProcessor(prob_threshold=0.7, silence_s=0.2, cache_s=0.15)
         self.row_number = 0
         # for test
-        self._transcrible_time_cost = 0.
         self._translate_time_cost = 0.
         if config.SAVE_DATA_SAVE:
             self._save_task_stop = threading.Event()
             self._save_queue = queue.Queue()
-            self._save_thread = self._start_thread(self.save_data_loop)
-        # self._c = 0
     def save_data_loop(self):
         writer = TestDataWriter()
@@ -88,33 +86,6 @@ class WhisperTranscriptionService:
             test_data = self._save_queue.get()
             writer.write(test_data)  # Save test_data to CSV
-    def _start_thread(self, target_function) -> threading.Thread:
-        """启动守护线程执行指定函数"""
-        thread = threading.Thread(target=target_function)
-        thread.daemon = True
-        thread.start()
-        return thread
-    def _get_text_separator(self, language: str) -> str:
-        """根据语言返回适当的文本分隔符"""
-        return "" if language == "zh" else " "
-    async def send_ready_state(self) -> None:
-        """发送服务就绪状态消息"""
-        await self.websocket.send(json.dumps({
-            "uid": self.client_uid,
-            "message": self.SERVER_READY,
-            "backend": "whisper_transcription"
-        }))
-    def set_language(self, source_lang: str, target_lang: str) -> None:
-        """设置源语言和目标语言"""
-        self.source_language = source_lang
-        self.target_language = target_lang
-        self.text_separator = self._get_text_separator(source_lang)
-        # self._transcrible_analysis = TranscriptStabilityAnalyzer(self.source_language, self.text_separator)
     def add_frames(self, frame_np: np.ndarray) -> None:
         """添加音频帧到处理队列"""
         self._frame_queue.put(frame_np)
@@ -126,100 +97,88 @@ class WhisperTranscriptionService:
         speech_status = processed_audio.speech_status
         return speech_audio, speech_status
     def _frame_processing_loop(self) -> None:
         """从队列获取音频帧并合并到缓冲区"""
         while not self._frame_processing_thread_stop.is_set():
             try:
                 frame_np = self._frame_queue.get(timeout=0.1)
                 frame_np, speech_status = self._apply_voice_activity_detection(frame_np)
-                if frame_np is None or len(frame_np) == 0:
                     continue
                 with self.lock:
-                    if self.frames_np is None:
-                        self.frames_np = frame_np.copy()
-                    else:
-                        self.frames_np = np.append(self.frames_np, frame_np)
-                    if speech_status == "END" and len(self.frames_np) > 0:
-                        self.segments_queue.appendleft(self.frames_np.copy())
                         self.frames_np = np.array([], dtype=np.float32)
             except queue.Empty:
                 pass
-    def _process_transcription_results_2(self, seg_text:str,partial):
-        item =  TransResult(
-                seg_id=self.row_number,
-                context=seg_text,
-                from_=self.source_language,
-                to=self.target_language,
-                tran_content=self._translate_text_large(seg_text),
-                partial=partial
-            )
-        if partial == False:
-            self.row_number += 1
-        return item
     def _transcription_processing_loop(self) -> None:
         """主转录处理循环"""
         frame_epoch = 1
-        while not self._translate_thread_stop.is_set():
-            if self.frames_np is None:
-                time.sleep(0.01)
-                continue
-            if len(self.segments_queue) >0:
-                audio_buffer = self.segments_queue.pop()
-                partial = False
-            else:
-                with self.lock:
-                    audio_buffer = self.frames_np[:int(frame_epoch * 1.5 * self.sample_rate)].copy()# 获取 1.5s * epoch 个音频长度
-                partial = True
-            if len(audio_buffer) ==0:
                 time.sleep(0.01)
                 continue
             if len(audio_buffer) < int(self.sample_rate):
                 silence_audio = np.zeros(self.sample_rate, dtype=np.float32)
                 silence_audio[-len(audio_buffer):] = audio_buffer
                 audio_buffer = silence_audio
             logger.debug(f"audio buffer size: {len(audio_buffer) / self.sample_rate:.2f}s")
-            # try:
             meta_item = self._transcribe_audio(audio_buffer)
             segments = meta_item.segments
             logger.debug(f"Segments: {segments}")
             segments = filter_words(segments)
             if len(segments):
                 seg_text = self.text_separator.join(seg.text for seg in segments)
-                if self._temp_string:
-                    seg_text = self._temp_string + seg_text
-                if partial == False:
-                    if len(seg_text) < config.TEXT_THREHOLD:
-                        partial = True
-                        self._temp_string = seg_text
-                    else:
-                        self._temp_string = ""
-                result = self._process_transcription_results_2(seg_text, partial)
                 self._send_result_to_client(result)
-                time.sleep(0.1)
-                if partial == False:
                     frame_epoch = 1
                 else:
                     frame_epoch += 1
-            # 处理转录结果并发送到客户端
-            # for result in self._process_transcription_results(segments, audio_buffer):
-            #     self._send_result_to_client(result)
-            # except Exception as e:
-            #     logger.error(f"Error processing audio: {e}")
     def _transcribe_audio(self, audio_buffer: np.ndarray)->MetaItem:
@@ -227,14 +186,13 @@ class WhisperTranscriptionService:
         log_block("Audio buffer length", f"{audio_buffer.shape[0]/self.sample_rate:.2f}", "s")
         start_time = time.perf_counter()
-        result = self._translate_pipe.transcrible(audio_buffer.tobytes(), self.source_language)
         segments = result.segments
         time_diff = (time.perf_counter() - start_time)
-        logger.debug(f"📝 Transcrible Segments: {segments} ")
-        # logger.debug(f"📝 Transcrible: {self.text_separator.join(seg.text for seg in segments)} ")
-        log_block("📝 Transcrible output", f"{self.text_separator.join(seg.text for seg in segments)}", "")
-        log_block("📝 Transcrible time", f"{time_diff:.3f}", "s")
-        self._transcrible_time_cost = round(time_diff, 3)
         return result
     def _translate_text(self, text: str) -> str:
@@ -270,51 +228,6 @@ class WhisperTranscriptionService:
         return translated_text
-    def _process_transcription_results(self, segments: List[TranscriptToken], audio_buffer: np.ndarray) -> Iterator[TransResult]:
-        """
-        处理转录结果，生成翻译结果
-        Returns:
-            TransResult对象的迭代器
-        """
-        if not segments:
-            return
-        start_time = time.perf_counter()
-        for ana_result in self._transcrible_analysis.analysis(segments, len(audio_buffer)/self.sample_rate):
-            if (cut_index :=ana_result.cut_index)>0:
-                # 更新音频缓冲区，移除已处理部分
-                self._update_audio_buffer(cut_index)
-            if ana_result.partial():
-                translated_context = self._translate_text(ana_result.context)
-            else:
-                translated_context = self._translate_text_large(ana_result.context)
-            yield TransResult(
-                seg_id=ana_result.seg_id,
-                context=ana_result.context,
-                from_=self.source_language,
-                to=self.target_language,
-                tran_content=translated_context,
-                partial=ana_result.partial()
-            )
-            current_time = time.perf_counter()
-            time_diff = current_time - start_time
-            if config.SAVE_DATA_SAVE:
-                self._save_queue.put(DebugResult(
-                    seg_id=ana_result.seg_id,
-                    transcrible_time=self._transcrible_time_cost,
-                    translate_time=self._translate_time_cost,
-                    context=ana_result.context,
-                    from_=self.source_language,
-                    to=self.target_language,
-                    tran_content=translated_context,
-                    partial=ana_result.partial()
-                ))
-            log_block("🚦 Traffic times diff", round(time_diff, 2), 's')
     def _send_result_to_client(self, result: TransResult) -> None:
         """发送翻译结果到客户端"""
         try:

 import queue
 import threading
 import time
 from logging import getLogger
 import asyncio
 import numpy as np
 import config
 from .utils import log_block, save_to_wave, TestDataWriter, filter_words
 from .translatepipes import TranslatePipes
 from transcribe.pipelines import MetaItem
 logger = getLogger("TranscriptionService")
+def _get_text_separator(language: str) -> str:
+    """根据语言返回适当的文本分隔符"""
+    return "" if language == "zh" else " "
+def _start_thread(target_function) -> threading.Thread:
+    """启动守护线程执行指定函数"""
+    thread = threading.Thread(target=target_function)
+    thread.daemon = True
+    thread.start()
+    return thread
 class WhisperTranscriptionService:
     """
     Whisper语音转录服务类，处理音频流转录和翻译
         self._translate_pipe = pipe
         # 音频处理相关
+        self.sample_rate = config.SAMPLE_RATE
         self.lock = threading.Lock()
         # 文本分隔符，根据语言设置
+        self.text_separator = _get_text_separator(language)
         self.loop = asyncio.get_event_loop()
         #  原始音频队列
         self._frame_queue = queue.Queue()
         #  音频队列缓冲区
+        self.frames_np = np.array([], dtype=np.float32)
+        self.frames_np_start_timestamp = None
         #  完整音频队列
+        self.full_segments_queue = collections.deque()
         # 启动处理线程
         self._translate_thread_stop = threading.Event()
         self._frame_processing_thread_stop = threading.Event()
+        self.translate_thread = _start_thread(self._transcription_processing_loop)
+        self.frame_processing_thread = _start_thread(self._frame_processing_loop)
         self.row_number = 0
         # for test
+        self._transcribe_time_cost = 0.
         self._translate_time_cost = 0.
         if config.SAVE_DATA_SAVE:
             self._save_task_stop = threading.Event()
             self._save_queue = queue.Queue()
+            self._save_thread = _start_thread(self.save_data_loop)
     def save_data_loop(self):
         writer = TestDataWriter()
             test_data = self._save_queue.get()
             writer.write(test_data)  # Save test_data to CSV
     def add_frames(self, frame_np: np.ndarray) -> None:
         """添加音频帧到处理队列"""
         self._frame_queue.put(frame_np)
         speech_status = processed_audio.speech_status
         return speech_audio, speech_status
     def _frame_processing_loop(self) -> None:
         """从队列获取音频帧并合并到缓冲区"""
         while not self._frame_processing_thread_stop.is_set():
             try:
                 frame_np = self._frame_queue.get(timeout=0.1)
                 frame_np, speech_status = self._apply_voice_activity_detection(frame_np)
+                if frame_np is None:
                     continue
                 with self.lock:
+                    if speech_status == "START" and self.frames_np_start_timestamp is None:
+                        self.frames_np_start_timestamp = time.time()
+                    # 添加音频到音频缓冲区
+                    self.frames_np = np.append(self.frames_np, frame_np)
+                    if len(self.frames_np) >= self.sample_rate * config.MAX_SPEECH_DURATION_S:
+                        audio_array=self.frames_np.copy()
+                        self.full_segments_queue.appendleft(audio_array) # 根据时间是否满足三秒长度 来整合音频块
+                        self.frames_np_start_timestamp = time.time()
                         self.frames_np = np.array([], dtype=np.float32)
+                    elif speech_status == "END" and len(self.frames_np) > 0 and self.frames_np_start_timestamp:
+                        time_diff = time.time() - self.frames_np_start_timestamp
+                        if time_diff >= config.FRAME_SCOPE_TIME_THRESHOLD:
+                            audio_array=self.frames_np.copy()
+                            self.full_segments_queue.appendleft(audio_array) # 根据时间是否满足三秒长度 来整合音频块
+                            self.frames_np_start_timestamp = None
+                            self.frames_np = np.array([], dtype=np.float32)
+                        else:
+                            logger.debug(f"🥳 当前时间与上一句的时间差: {time_diff:.2f}s,继续增加缓冲区")
             except queue.Empty:
                 pass
     def _transcription_processing_loop(self) -> None:
         """主转录处理循环"""
         frame_epoch = 1
+        while not self._translate_thread_stop.is_set():
+            if len(self.frames_np) ==0:
                 time.sleep(0.01)
                 continue
+            with self.lock:
+                if len(self.full_segments_queue) > 0:
+                    audio_buffer = self.full_segments_queue.pop()
+                    partial = False
+                else:
+                    audio_buffer = self.frames_np[:int(frame_epoch * 1.5 * self.sample_rate)].copy()# 获取 1.5s * epoch 个音频长度
+                    partial = True
             if len(audio_buffer) < int(self.sample_rate):
                 silence_audio = np.zeros(self.sample_rate, dtype=np.float32)
                 silence_audio[-len(audio_buffer):] = audio_buffer
                 audio_buffer = silence_audio
             logger.debug(f"audio buffer size: {len(audio_buffer) / self.sample_rate:.2f}s")
             meta_item = self._transcribe_audio(audio_buffer)
             segments = meta_item.segments
             logger.debug(f"Segments: {segments}")
             segments = filter_words(segments)
             if len(segments):
                 seg_text = self.text_separator.join(seg.text for seg in segments)
+                result = TransResult(
+                    seg_id=self.row_number,
+                    context=seg_text,
+                    from_=self.source_language,
+                    to=self.target_language,
+                    tran_content=self._translate_text_large(seg_text),
+                    partial=partial
+                )
                 self._send_result_to_client(result)
+                if not partial:
+                    self.row_number += 1
                     frame_epoch = 1
                 else:
                     frame_epoch += 1
     def _transcribe_audio(self, audio_buffer: np.ndarray)->MetaItem:
         log_block("Audio buffer length", f"{audio_buffer.shape[0]/self.sample_rate:.2f}", "s")
         start_time = time.perf_counter()
+        result = self._translate_pipe.transcribe(audio_buffer.tobytes(), self.source_language)
         segments = result.segments
         time_diff = (time.perf_counter() - start_time)
+        logger.debug(f"📝 transcribe Segments: {segments} ")
+        log_block("📝 transcribe output", f"{self.text_separator.join(seg.text for seg in segments)}", "")
+        log_block("📝 transcribe time", f"{time_diff:.3f}", "s")
+        self._transcribe_time_cost = round(time_diff, 3)
         return result
     def _translate_text(self, text: str) -> str:
         return translated_text
     def _send_result_to_client(self, result: TransResult) -> None:
         """发送翻译结果到客户端"""
         try: