Instructions to use MoYoYoTech/Translator with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MoYoYoTech/Translator with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MoYoYoTech/Translator",
	filename="moyoyo_asr_models/qwen2.5-1.5b-instruct-q5_0.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use MoYoYoTech/Translator with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MoYoYoTech/Translator:Q5_0
# Run inference directly in the terminal:
llama-cli -hf MoYoYoTech/Translator:Q5_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MoYoYoTech/Translator:Q5_0
# Run inference directly in the terminal:
llama-cli -hf MoYoYoTech/Translator:Q5_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MoYoYoTech/Translator:Q5_0
# Run inference directly in the terminal:
./llama-cli -hf MoYoYoTech/Translator:Q5_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MoYoYoTech/Translator:Q5_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MoYoYoTech/Translator:Q5_0

Use Docker

docker model run hf.co/MoYoYoTech/Translator:Q5_0

LM Studio
Jan
Ollama
How to use MoYoYoTech/Translator with Ollama:
```
ollama run hf.co/MoYoYoTech/Translator:Q5_0
```

Unsloth Studio

How to use MoYoYoTech/Translator with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MoYoYoTech/Translator to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MoYoYoTech/Translator to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MoYoYoTech/Translator to start chatting

How to use MoYoYoTech/Translator with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf MoYoYoTech/Translator:Q5_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "MoYoYoTech/Translator:Q5_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use MoYoYoTech/Translator with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf MoYoYoTech/Translator:Q5_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default MoYoYoTech/Translator:Q5_0

Run Hermes

hermes

Docker Model Runner
How to use MoYoYoTech/Translator with Docker Model Runner:
```
docker model run hf.co/MoYoYoTech/Translator:Q5_0
```

Lemonade

How to use MoYoYoTech/Translator with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MoYoYoTech/Translator:Q5_0

Run and chat with the model

lemonade run user.Translator-Q5_0

List all available models

lemonade list

Xin Zhang commited on Apr 18, 2025

Commit

a79d676

2 Parent(s): 9b1398f 601009a

Merge branch 'vad'

Browse files

* vad:
[fix]: update view.
update vad parameter
change to vad audio translate
add whisper fine tune for chinese
vad parameters v1 test
fix vad bug
[fix]: update installation.
fix vad buf

Files changed (20) hide show

config.py +6 -5
frontend/assets/{index-eff0154e.css → index-2c7aa850.css} +1 -1
frontend/assets/{index-0364c095.js → index-640e640f.js} +0 -0
frontend/index.html +2 -2
main.py +2 -0
moyoyo_asr_models/ggml-small-encoder.mlmodelc/analytics/coremldata.bin +3 -0
moyoyo_asr_models/ggml-small-encoder.mlmodelc/coremldata.bin +3 -0
moyoyo_asr_models/ggml-small-encoder.mlmodelc/metadata.json +64 -0
moyoyo_asr_models/ggml-small-encoder.mlmodelc/model.mil +0 -0
moyoyo_asr_models/ggml-small-encoder.mlmodelc/weights/weight.bin +3 -0
moyoyo_asr_models/ggml-small.bin +3 -0
transcribe/helpers/vadprocessor.py +262 -242
transcribe/helpers/whisper.py +8 -4
transcribe/pipelines/__init__.py +1 -1
transcribe/pipelines/pipe_translate.py +3 -0
transcribe/pipelines/pipe_vad.py +21 -79
transcribe/pipelines/pipe_whisper.py +12 -2
transcribe/strategy.py +1 -1
transcribe/translatepipes.py +22 -11
transcribe/whisper_llm_serve.py +62 -103

config.py CHANGED Viewed

@@ -2,7 +2,7 @@ import pathlib
 import re
 import logging
-DEBUG = False
 TEST = False
 logging.getLogger("pywhispercpp").setLevel(logging.WARNING)
@@ -10,7 +10,7 @@ logging.getLogger("pywhispercpp").setLevel(logging.WARNING)
 logging.basicConfig(
     level=logging.DEBUG if DEBUG else logging.INFO,
     format="%(asctime)s - %(levelname)s - %(message)s",
-    filename='translator.log',
     datefmt="%H:%M:%S"
 )
@@ -50,9 +50,10 @@ MAX_LENTH_ZH = 4
 WHISPER_PROMPT_EN = ""# "The following is an English sentence."
 MAX_LENGTH_EN= 8
-# WHISPER_MODEL = 'medium-q5_0'
-WHISPER_MODEL = 'large-v3-turbo-q5_0'
 # LLM
 LLM_MODEL_PATH = (MODEL_DIR / "qwen2.5-1.5b-instruct-q5_0.gguf").as_posix()
 LLM_LARGE_MODEL_PATH = (MODEL_DIR / "qwen2.5-1.5b-instruct-q5_0.gguf").as_posix()

 import re
 import logging
+DEBUG = True
 TEST = False
 logging.getLogger("pywhispercpp").setLevel(logging.WARNING)
 logging.basicConfig(
     level=logging.DEBUG if DEBUG else logging.INFO,
     format="%(asctime)s - %(levelname)s - %(message)s",
+    filename='translator.log',
     datefmt="%H:%M:%S"
 )
 WHISPER_PROMPT_EN = ""# "The following is an English sentence."
 MAX_LENGTH_EN= 8
+WHISPER_MODEL_EN = 'medium-q5_0'
+# WHISPER_MODEL = 'large-v3-turbo-q5_0'
+# WHISPER_MODEL_ZH = 'small'
+WHISPER_MODEL_ZH = 'large-v3-turbo-q5_0'
 # LLM
 LLM_MODEL_PATH = (MODEL_DIR / "qwen2.5-1.5b-instruct-q5_0.gguf").as_posix()
 LLM_LARGE_MODEL_PATH = (MODEL_DIR / "qwen2.5-1.5b-instruct-q5_0.gguf").as_posix()

frontend/assets/{index-eff0154e.css → index-2c7aa850.css} RENAMED Viewed

@@ -1 +1 @@

- html,body{width:100%;height:100%}input::-ms-clear,input::-ms-reveal{display:none}*,*:before,*:after{box-sizing:border-box}html{font-family:sans-serif;line-height:1.15;-webkit-text-size-adjust:100%;-ms-text-size-adjust:100%;-ms-overflow-style:scrollbar;-webkit-tap-highlight-color:rgba(0,0,0,0)}body{margin:0}[tabindex="-1"]:focus{outline:none}hr{box-sizing:content-box;height:0;overflow:visible}h1,h2,h3,h4,h5,h6{margin-top:0;margin-bottom:.5em;font-weight:500}p{margin-top:0;margin-bottom:1em}abbr[title],abbr[data-original-title]{-webkit-text-decoration:underline dotted;text-decoration:underline;text-decoration:underline dotted;border-bottom:0;cursor:help}address{margin-bottom:1em;font-style:normal;line-height:inherit}input[type=text],input[type=password],input[type=number],textarea{-webkit-appearance:none}ol,ul,dl{margin-top:0;margin-bottom:1em}ol ol,ul ul,ol ul,ul ol{margin-bottom:0}dt{font-weight:500}dd{margin-bottom:.5em;margin-left:0}blockquote{margin:0 0 1em}dfn{font-style:italic}b,strong{font-weight:bolder}small{font-size:80%}sub,sup{position:relative;font-size:75%;line-height:0;vertical-align:baseline}sub{bottom:-.25em}sup{top:-.5em}pre,code,kbd,samp{font-size:1em;font-family:SFMono-Regular,Consolas,Liberation Mono,Menlo,Courier,monospace}pre{margin-top:0;margin-bottom:1em;overflow:auto}figure{margin:0 0 1em}img{vertical-align:middle;border-style:none}a,area,button,[role=button],input:not([type=range]),label,select,summary,textarea{touch-action:manipulation}table{border-collapse:collapse}caption{padding-top:.75em;padding-bottom:.3em;text-align:left;caption-side:bottom}input,button,select,optgroup,textarea{margin:0;color:inherit;font-size:inherit;font-family:inherit;line-height:inherit}button,input{overflow:visible}button,select{text-transform:none}button,html [type=button],[type=reset],[type=submit]{-webkit-appearance:button}button::-moz-focus-inner,[type=button]::-moz-focus-inner,[type=reset]::-moz-focus-inner,[type=submit]::-moz-focus-inner{padding:0;border-style:none}input[type=radio],input[type=checkbox]{box-sizing:border-box;padding:0}input[type=date],input[type=time],input[type=datetime-local],input[type=month]{-webkit-appearance:listbox}textarea{overflow:auto;resize:vertical}fieldset{min-width:0;margin:0;padding:0;border:0}legend{display:block;width:100%;max-width:100%;margin-bottom:.5em;padding:0;color:inherit;font-size:1.5em;line-height:inherit;white-space:normal}progress{vertical-align:baseline}[type=number]::-webkit-inner-spin-button,[type=number]::-webkit-outer-spin-button{height:auto}[type=search]{outline-offset:-2px;-webkit-appearance:none}[type=search]::-webkit-search-cancel-button,[type=search]::-webkit-search-decoration{-webkit-appearance:none}::-webkit-file-upload-button{font:inherit;-webkit-appearance:button}output{display:inline-block}summary{display:list-item}template{display:none}[hidden]{display:none!important}mark{padding:.2em;background-color:#feffe6}:root{font-family:Inter,system-ui,Avenir,Helvetica,Arial,sans-serif;line-height:1.5;font-weight:400;color-scheme:light dark;color:#ffffffde;background-color:#242424;font-synthesis:none;text-rendering:optimizeLegibility;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;-webkit-text-size-adjust:100%}a{font-weight:500;color:#646cff;text-decoration:inherit}a:hover{color:#535bf2}body{margin:0;display:flex;place-items:center;min-width:320px;height:auto;min-height:auto;color:#333;background:#fff}h1{font-size:3.2em;line-height:1.1}button{border-radius:8px;border:1px solid transparent;padding:.6em 1.2em;font-size:1em;font-weight:500;font-family:inherit;background-color:#1a1a1a;cursor:pointer;transition:border-color .25s}.card{border-bottom:solid 2px lightgray;align-items:center;justify-content:center;margin-top:40px;display:flex;max-width:1024px;width:100%}.seg-title{margin:24px 0;font-size:20px;font-weight:500}.seg-co{width:1022px;text-align:left;border-left:solid 6px midnightblue;padding-left:8px;margin-left:2px;margin-top:36px;line-height:24px}#app{margin:0 auto;padding:0;text-align:center;width:100%}.ant-btn{padding:4px 12px}@media (prefers-color-scheme: light){:root{color:#213547;background-color:#fff}a:hover{color:#747bff}button{background-color:#f9f9f9}}.ant-card{background:#f5f6fa}.ant-card .ant-card-actions{background-color:#e8e8f8cc!important}.ant-popover{max-width:800px!important}.ant-form-item{background:transparent;margin-bottom:40px!important}.ant-form-item .ant-form-item-explain-error{color:#ff4d4f;text-align:left!important}.ant-form-item-label label{font-size:18px!important;color:#1a1a1a!important;font-weight:500!important}.ant-tooltip{max-width:1022px!important}.ant-page-header-heading{width:1022px!important}.highlight{background:ghostwhite}.content[data-v-178d5f9f]{background-color:#fff;max-width:1280px;min-height:720px;margin:0 auto;display:flex;flex-direction:column;align-items:center;justify-content:space-between}.not-found-wrapper[data-v-aef52a59]{height:calc(100vh - 104px)}.view-wrapper[data-v-~~863105d2~~]{width:100%;height:100%;background-color:#fff}.view-wrapper .content-wrapper[data-v-~~863105d2~~]{text-align:left;~~width:1280px;~~max-width:100vw;min-width:320px;margin-bottom:64px;min-height:calc(100vh - 438px)}.view-wrapper .content-wrapper .chat-box[data-v-~~863105d2~~]{width:100%;height:54vh;border-radius:4px;padding:12px;color:#2e2f33;font-size:18px}.view-wrapper .content-wrapper .chat-box-placeholder[data-v-~~863105d2~~]{width:100%;height:58vh;border-radius:4px;padding:12px;font-size:18px;color:#a4a6ac}.view-wrapper .content-wrapper .actions-box[data-v-~~863105d2~~]{display:flex;align-items:center;justify-content:space-between;margin:0 24px;height:48px}.view-wrapper .content-wrapper .trans-list[data-v-~~863105d2~~]{overflow-y:auto;width:100%;height:58vh;scrollbar-width:none;-ms-overflow-style:none}.view-wrapper .content-wrapper .trans-list[data-v-~~863105d2~~]::-webkit-scrollbar{display:none}.view-wrapper .content-wrapper .trans-list .node[data-v-~~863105d2~~]{margin-bottom:36px;width:100%!important;transition:all .3s ease}.view-wrapper .content-wrapper .trans-list .node .trans-time[data-v-~~863105d2~~]{font-size:14px;color:#c4c6cc}.view-wrapper .content-wrapper .trans-list .node .trans-~~src~~-~~lang~~[data-v-~~863105d2~~]{font-size:18px;color:#909299;font-weight:500}.view-wrapper .content-wrapper .trans-list .node .trans-dst-lang[data-v-~~863105d2~~]{~~font-size:18px;~~color:#2e2f33;font-weight:600}.view-wrapper .content-wrapper .trans-list .current_node[data-v-~~863105d2~~]{background-color:#f0f1f7;padding:4px 8px}@keyframes highlight-~~863105d2~~{0%{background-color:transparent}50%{background-color:#fff1ce80}to{background-color:transparent}}@keyframes slideIn-~~863105d2~~{0%{opacity:0;transform:translateY(10px)}to{opacity:1;transform:translateY(0)}}.content-wrapper[data-v-c39ab0d6]{text-align:left;max-width:800px;min-width:320px;margin-bottom:64px;min-height:calc(100vh - 438px)}.content-wrapper .content-box[data-v-c39ab0d6]{padding:24px;height:240px;background-color:#e8e8e8;border-radius:16px;width:50%;margin:48px auto;min-width:300px}.content-wrapper .video-box[data-v-c39ab0d6]{max-width:800px;min-width:320px;width:90vw;height:auto}

+ html,body{width:100%;height:100%}input::-ms-clear,input::-ms-reveal{display:none}*,*:before,*:after{box-sizing:border-box}html{font-family:sans-serif;line-height:1.15;-webkit-text-size-adjust:100%;-ms-text-size-adjust:100%;-ms-overflow-style:scrollbar;-webkit-tap-highlight-color:rgba(0,0,0,0)}body{margin:0}[tabindex="-1"]:focus{outline:none}hr{box-sizing:content-box;height:0;overflow:visible}h1,h2,h3,h4,h5,h6{margin-top:0;margin-bottom:.5em;font-weight:500}p{margin-top:0;margin-bottom:1em}abbr[title],abbr[data-original-title]{-webkit-text-decoration:underline dotted;text-decoration:underline;text-decoration:underline dotted;border-bottom:0;cursor:help}address{margin-bottom:1em;font-style:normal;line-height:inherit}input[type=text],input[type=password],input[type=number],textarea{-webkit-appearance:none}ol,ul,dl{margin-top:0;margin-bottom:1em}ol ol,ul ul,ol ul,ul ol{margin-bottom:0}dt{font-weight:500}dd{margin-bottom:.5em;margin-left:0}blockquote{margin:0 0 1em}dfn{font-style:italic}b,strong{font-weight:bolder}small{font-size:80%}sub,sup{position:relative;font-size:75%;line-height:0;vertical-align:baseline}sub{bottom:-.25em}sup{top:-.5em}pre,code,kbd,samp{font-size:1em;font-family:SFMono-Regular,Consolas,Liberation Mono,Menlo,Courier,monospace}pre{margin-top:0;margin-bottom:1em;overflow:auto}figure{margin:0 0 1em}img{vertical-align:middle;border-style:none}a,area,button,[role=button],input:not([type=range]),label,select,summary,textarea{touch-action:manipulation}table{border-collapse:collapse}caption{padding-top:.75em;padding-bottom:.3em;text-align:left;caption-side:bottom}input,button,select,optgroup,textarea{margin:0;color:inherit;font-size:inherit;font-family:inherit;line-height:inherit}button,input{overflow:visible}button,select{text-transform:none}button,html [type=button],[type=reset],[type=submit]{-webkit-appearance:button}button::-moz-focus-inner,[type=button]::-moz-focus-inner,[type=reset]::-moz-focus-inner,[type=submit]::-moz-focus-inner{padding:0;border-style:none}input[type=radio],input[type=checkbox]{box-sizing:border-box;padding:0}input[type=date],input[type=time],input[type=datetime-local],input[type=month]{-webkit-appearance:listbox}textarea{overflow:auto;resize:vertical}fieldset{min-width:0;margin:0;padding:0;border:0}legend{display:block;width:100%;max-width:100%;margin-bottom:.5em;padding:0;color:inherit;font-size:1.5em;line-height:inherit;white-space:normal}progress{vertical-align:baseline}[type=number]::-webkit-inner-spin-button,[type=number]::-webkit-outer-spin-button{height:auto}[type=search]{outline-offset:-2px;-webkit-appearance:none}[type=search]::-webkit-search-cancel-button,[type=search]::-webkit-search-decoration{-webkit-appearance:none}::-webkit-file-upload-button{font:inherit;-webkit-appearance:button}output{display:inline-block}summary{display:list-item}template{display:none}[hidden]{display:none!important}mark{padding:.2em;background-color:#feffe6}:root{font-family:Inter,system-ui,Avenir,Helvetica,Arial,sans-serif;line-height:1.5;font-weight:400;color-scheme:light dark;color:#ffffffde;background-color:#242424;font-synthesis:none;text-rendering:optimizeLegibility;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;-webkit-text-size-adjust:100%}a{font-weight:500;color:#646cff;text-decoration:inherit}a:hover{color:#535bf2}body{margin:0;display:flex;place-items:center;min-width:320px;height:auto;min-height:auto;color:#333;background:#fff}h1{font-size:3.2em;line-height:1.1}button{border-radius:8px;border:1px solid transparent;padding:.6em 1.2em;font-size:1em;font-weight:500;font-family:inherit;background-color:#1a1a1a;cursor:pointer;transition:border-color .25s}.card{border-bottom:solid 2px lightgray;align-items:center;justify-content:center;margin-top:40px;display:flex;max-width:1024px;width:100%}.seg-title{margin:24px 0;font-size:20px;font-weight:500}.seg-co{width:1022px;text-align:left;border-left:solid 6px midnightblue;padding-left:8px;margin-left:2px;margin-top:36px;line-height:24px}#app{margin:0 auto;padding:0;text-align:center;width:100%}.ant-btn{padding:4px 12px}@media (prefers-color-scheme: light){:root{color:#213547;background-color:#fff}a:hover{color:#747bff}button{background-color:#f9f9f9}}.ant-card{background:#f5f6fa}.ant-card .ant-card-actions{background-color:#e8e8f8cc!important}.ant-popover{max-width:800px!important}.ant-form-item{background:transparent;margin-bottom:40px!important}.ant-form-item .ant-form-item-explain-error{color:#ff4d4f;text-align:left!important}.ant-form-item-label label{font-size:18px!important;color:#1a1a1a!important;font-weight:500!important}.ant-tooltip{max-width:1022px!important}.ant-page-header-heading{width:1022px!important}.highlight{background:ghostwhite}.content[data-v-178d5f9f]{background-color:#fff;max-width:1280px;min-height:720px;margin:0 auto;display:flex;flex-direction:column;align-items:center;justify-content:space-between}.not-found-wrapper[data-v-aef52a59]{height:calc(100vh - 104px)}.config-content[data-v-ba23d083]{width:420px;margin:12px}.config-content .config-block[data-v-ba23d083]{margin:12px;padding-bottom:12px}.view-wrapper[data-v-ba23d083]{width:100%;height:100%;background-color:#fff}.view-wrapper .wrapper-width-fixed[data-v-ba23d083]{width:1280px}.view-wrapper .wrapper-width-auto[data-v-ba23d083]{width:100vw}.view-wrapper .content-wrapper[data-v-ba23d083]{text-align:left;max-width:100vw;min-width:320px;margin-bottom:64px;min-height:calc(100vh - 438px)}.view-wrapper .content-wrapper .chat-box[data-v-ba23d083]{width:100%;height:54vh;border-radius:4px;padding:12px;color:#2e2f33;font-size:18px}.view-wrapper .content-wrapper .chat-box-placeholder[data-v-ba23d083]{width:100%;height:58vh;border-radius:4px;padding:12px;font-size:18px;color:#a4a6ac}.view-wrapper .content-wrapper .actions-box[data-v-ba23d083]{display:flex;align-items:center;justify-content:space-between;margin:0 24px;height:48px}.view-wrapper .content-wrapper .actions-box .left-actions[data-v-ba23d083]{display:flex;align-items:center;justify-content:space-between;width:288px}.view-wrapper .content-wrapper .trans-list[data-v-ba23d083]{overflow-y:auto;width:100%;height:58vh;scrollbar-width:none;-ms-overflow-style:none}.view-wrapper .content-wrapper .trans-list[data-v-ba23d083]::-webkit-scrollbar{display:none}.view-wrapper .content-wrapper .trans-list .node[data-v-ba23d083]{margin-bottom:36px;width:100%!important;transition:all .3s ease}.view-wrapper .content-wrapper .trans-list .node .trans-time[data-v-ba23d083]{font-size:14px;color:#c4c6cc}.view-wrapper .content-wrapper .trans-list .node .trans-font-size-16[data-v-ba23d083]{font-size:16px}.view-wrapper .content-wrapper .trans-list .node .trans-font-size-18[data-v-ba23d083]{font-size:18px}.view-wrapper .content-wrapper .trans-list .node .trans-font-size-20[data-v-ba23d083]{font-size:20px}.view-wrapper .content-wrapper .trans-list .node .trans-font-size-22[data-v-ba23d083]{font-size:22px}.view-wrapper .content-wrapper .trans-list .node .trans-font-size-24[data-v-ba23d083]{font-size:24px}.view-wrapper .content-wrapper .trans-list .node .trans-src-lang[data-v-ba23d083]{color:#909299;font-weight:500}.view-wrapper .content-wrapper .trans-list .node .trans-dst-lang[data-v-ba23d083]{color:#2e2f33;font-weight:600}.view-wrapper .content-wrapper .trans-list .current_node[data-v-ba23d083]{background-color:#f0f1f7;padding:4px 8px}@keyframes highlight-ba23d083{0%{background-color:transparent}50%{background-color:#fff1ce80}to{background-color:transparent}}@keyframes slideIn-ba23d083{0%{opacity:0;transform:translateY(10px)}to{opacity:1;transform:translateY(0)}}.content-wrapper[data-v-c39ab0d6]{text-align:left;max-width:800px;min-width:320px;margin-bottom:64px;min-height:calc(100vh - 438px)}.content-wrapper .content-box[data-v-c39ab0d6]{padding:24px;height:240px;background-color:#e8e8e8;border-radius:16px;width:50%;margin:48px auto;min-width:300px}.content-wrapper .video-box[data-v-c39ab0d6]{max-width:800px;min-width:320px;width:90vw;height:auto}

frontend/assets/{index-0364c095.js → index-640e640f.js} RENAMED Viewed

The diff for this file is too large to render. See raw diff

frontend/index.html CHANGED Viewed

@@ -5,8 +5,8 @@
     <link rel="icon" type="image/svg+xml" href="./favicon.ico" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
     <title>Translator</title>
-    <script type="module" crossorigin src="./assets/index-0364c095.js"></script>
-    <link rel="stylesheet" href="./assets/index-eff0154e.css">
   </head>
   <body>
     <div id="app"></div>

     <link rel="icon" type="image/svg+xml" href="./favicon.ico" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
     <title>Translator</title>
+    <script type="module" crossorigin src="./assets/index-640e640f.js"></script>
+    <link rel="stylesheet" href="./assets/index-2c7aa850.css">
   </head>
   <body>
     <div id="app"></div>

main.py CHANGED Viewed

@@ -57,6 +57,7 @@ async def root():
 async def translate(websocket: WebSocket):
     query_parameters_dict = websocket.query_params
     from_lang, to_lang = query_parameters_dict.get('from'), query_parameters_dict.get('to')
     client = WhisperTranscriptionService(
         websocket,
         pipe,
@@ -64,6 +65,7 @@ async def translate(websocket: WebSocket):
         client_uid=f"{uuid1()}",
     )
     if from_lang and to_lang:
         client.set_language(from_lang, to_lang)
         logger.info(f"Source lange: {from_lang}  -> Dst lange: {to_lang}")

 async def translate(websocket: WebSocket):
     query_parameters_dict = websocket.query_params
     from_lang, to_lang = query_parameters_dict.get('from'), query_parameters_dict.get('to')
     client = WhisperTranscriptionService(
         websocket,
         pipe,
         client_uid=f"{uuid1()}",
     )
     if from_lang and to_lang:
         client.set_language(from_lang, to_lang)
         logger.info(f"Source lange: {from_lang}  -> Dst lange: {to_lang}")

moyoyo_asr_models/ggml-small-encoder.mlmodelc/analytics/coremldata.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:18ad2072ae82872c2ba8a187071e1e7d6c1105253685e7aa95138adcf07874e0
+size 207

moyoyo_asr_models/ggml-small-encoder.mlmodelc/coremldata.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:05fe28591b40616fa0c34ad7b853133623f5300923ec812acb11459c411acf3b
+size 149

moyoyo_asr_models/ggml-small-encoder.mlmodelc/metadata.json ADDED Viewed

	@@ -0,0 +1,64 @@

+[
+  {
+    "metadataOutputVersion" : "3.0",
+    "storagePrecision" : "Float16",
+    "outputSchema" : [
+      {
+        "hasShapeFlexibility" : "0",
+        "isOptional" : "0",
+        "dataType" : "Float32",
+        "formattedType" : "MultiArray (Float32)",
+        "shortDescription" : "",
+        "shape" : "[]",
+        "name" : "output",
+        "type" : "MultiArray"
+      }
+    ],
+    "modelParameters" : [
+    ],
+    "specificationVersion" : 6,
+    "mlProgramOperationTypeHistogram" : {
+      "Linear" : 72,
+      "Matmul" : 24,
+      "Cast" : 2,
+      "Conv" : 2,
+      "Softmax" : 12,
+      "Add" : 25,
+      "LayerNorm" : 25,
+      "Mul" : 24,
+      "Transpose" : 49,
+      "Gelu" : 14,
+      "Reshape" : 48
+    },
+    "computePrecision" : "Mixed (Float16, Float32, Int32)",
+    "isUpdatable" : "0",
+    "availability" : {
+      "macOS" : "12.0",
+      "tvOS" : "15.0",
+      "watchOS" : "8.0",
+      "iOS" : "15.0",
+      "macCatalyst" : "15.0"
+    },
+    "modelType" : {
+      "name" : "MLModelType_mlProgram"
+    },
+    "userDefinedMetadata" : {
+    },
+    "inputSchema" : [
+      {
+        "hasShapeFlexibility" : "0",
+        "isOptional" : "0",
+        "dataType" : "Float32",
+        "formattedType" : "MultiArray (Float32 1 × 80 × 3000)",
+        "shortDescription" : "",
+        "shape" : "[1, 80, 3000]",
+        "name" : "logmel_data",
+        "type" : "MultiArray"
+      }
+    ],
+    "generatedClassName" : "coreml_encoder_small",
+    "method" : "predict"
+  }
+]

moyoyo_asr_models/ggml-small-encoder.mlmodelc/model.mil ADDED Viewed

The diff for this file is too large to render. See raw diff

moyoyo_asr_models/ggml-small-encoder.mlmodelc/weights/weight.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:87eed4ae76f11a2d4a50786bc7423d4b45c2d0d9ca05577a3bd2557452072eaf
+size 176339456

moyoyo_asr_models/ggml-small.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3f6ef171491de375b741059400ba9a0aead023122b7a7db731b4943f9baa0f97
+size 487601984

transcribe/helpers/vadprocessor.py CHANGED Viewed

@@ -1,276 +1,296 @@
-import gc
-import time
 import numpy as np
 import onnxruntime
-from datetime import timedelta
-from pydub import AudioSegment
-from silero_vad import load_silero_vad, get_speech_timestamps, VADIterator
-import os
-import logging
-class FixedVADIterator(VADIterator):
-    '''It fixes VADIterator by allowing to process any audio length, not only exactly 512 frames at once.
-    If audio to be processed at once is long and multiple voiced segments detected,
-    then __call__ returns the start of the first segment, and end (or middle, which means no end) of the last segment.
-    '''
-    def reset_states(self):
-        super().reset_states()
-        self.buffer = np.array([],dtype=np.float32)
-    def __call__(self, x, return_seconds=False):
-        self.buffer = np.append(self.buffer, x)
-        ret = None
-        while len(self.buffer) >= 512:
-            r = super().__call__(self.buffer[:512], return_seconds=return_seconds)
-            self.buffer = self.buffer[512:]
-            if ret is None:
-                ret = r
-            elif r is not None:
-                if 'end' in r:
-                    ret['end'] = r['end']  # the latter end
-                if 'start' in r and 'end' in ret:  # there is an earlier start.
-                    # Remove end, merging this segment with the previous one.
-                    del ret['end']
-        return ret if ret != {} else None
-class SileroVADProcessor:
-    """
-    A class for processing audio files using Silero VAD to detect voice activity
-    and extract voice segments from audio files.
-    """
-    def __init__(self,
-                 activate_threshold=0.5,
-                 fusion_threshold=0.3,
-                 min_speech_duration=0.25,
-                 max_speech_duration=20,
-                 min_silence_duration=250,
-                 sample_rate=16000,
-                 ort_providers=None):
-        """
-        Initialize the SileroVADProcessor.
-        Args:
-            activate_threshold (float): Threshold for voice activity detection
-            fusion_threshold (float): Threshold for merging close speech segments (seconds)
-            min_speech_duration (float): Minimum duration of speech to be considered valid (seconds)
-            max_speech_duration (float): Maximum duration of speech (seconds)
-            min_silence_duration (int): Minimum silence duration (ms)
-            sample_rate (int): Sample rate of the audio (8000 or 16000 Hz)
-            ort_providers (list): ONNX Runtime providers for acceleration
-        """
-        # VAD parameters
-        self.activate_threshold = activate_threshold
-        self.fusion_threshold = fusion_threshold
-        self.min_speech_duration = min_speech_duration
-        self.max_speech_duration = max_speech_duration
-        self.min_silence_duration = min_silence_duration
-        self.sample_rate = sample_rate
-        self.ort_providers = ort_providers if ort_providers else []
-        # Initialize logger
-        self.logger = logging.getLogger(__name__)
-        # Load Silero VAD model
-        self._init_onnx_session()
-        self.silero_vad = load_silero_vad(onnx=True)
-    def _init_onnx_session(self):
-        """Initialize ONNX Runtime session with appropriate settings."""
-        session_opts = onnxruntime.SessionOptions()
-        session_opts.log_severity_level = 3
-        session_opts.inter_op_num_threads = 0
-        session_opts.intra_op_num_threads = 0
-        session_opts.enable_cpu_mem_arena = True
-        session_opts.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
-        session_opts.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
-        session_opts.add_session_config_entry("session.intra_op.allow_spinning", "1")
-        session_opts.add_session_config_entry("session.inter_op.allow_spinning", "1")
-        session_opts.add_session_config_entry("session.set_denormal_as_zero", "1")
-        # Set the session_opts to be used by silero_vad
-        # onnxruntime.capi._pybind_state.get_default_session_options(session_opts)
-    def load_audio(self, audio_path):
-        """
-        Load audio file and prepare it for VAD processing.
-        Args:
-            audio_path (str): Path to the audio file
-        Returns:
-            numpy.ndarray: Audio data as numpy array
-        """
-        self.logger.info(f"Loading audio from {audio_path}")
-        audio_segment = AudioSegment.from_file(audio_path)
-        audio_segment = audio_segment.set_channels(1).set_frame_rate(self.sample_rate)
-        # Convert to numpy array and normalize
-        dtype = np.float16 if self.use_gpu_fp16 else np.float32
-        audio_array = np.array(audio_segment.get_array_of_samples(), dtype=dtype) * 0.000030517578  # 1/32768
-        self.audio_segment = audio_segment  # Store for later use
-        return audio_array
-    @property
-    def model(self):
-        return self.silero_vad
-    def process_timestamps(self, timestamps):
-        """
-        Process VAD timestamps: filter short segments and merge close segments.
-        Args:
-            timestamps (list): List of (start, end) tuples
-        Returns:
-            list: Processed list of (start, end) tuples
-        """
-        # Filter out short durations
-        filtered_timestamps = [(start, end) for start, end in timestamps
-                               if (end - start) >= self.min_speech_duration]
-        # Fuse timestamps in two passes for better merging
-        fused_timestamps_1st = []
-        for start, end in filtered_timestamps:
-            if fused_timestamps_1st and (start - fused_timestamps_1st[-1][1] <= self.fusion_threshold):
-                fused_timestamps_1st[-1] = (fused_timestamps_1st[-1][0], end)
-            else:
-                fused_timestamps_1st.append((start, end))
-        fused_timestamps_2nd = []
-        for start, end in fused_timestamps_1st:
-            if fused_timestamps_2nd and (start - fused_timestamps_2nd[-1][1] <= self.fusion_threshold):
-                fused_timestamps_2nd[-1] = (fused_timestamps_2nd[-1][0], end)
-            else:
-                fused_timestamps_2nd.append((start, end))
-        return fused_timestamps_2nd
-    def format_time(self, seconds):
-        """
-        Convert seconds to VTT time format 'hh:mm:ss.mmm'.
-        Args:
-            seconds (float): Time in seconds
-        Returns:
-            str: Formatted time string
-        """
-        td = timedelta(seconds=seconds)
-        td_sec = td.total_seconds()
-        total_seconds = int(td_sec)
-        milliseconds = int((td_sec - total_seconds) * 1000)
-        hours = total_seconds // 3600
-        minutes = (total_seconds % 3600) // 60
-        seconds = total_seconds % 60
-        return f"{hours:02}:{minutes:02}:{seconds:02}.{milliseconds:03}"
-    def detect_speech(self, audio:np.array):
-        """
-        Run VAD on the audio file to detect speech segments.
-        Args:
-            audio_path (str): Path to the audio file
-        Returns:
-            list: List of processed timestamps as (start, end) tuples
-        """
-        self.logger.info("Starting VAD process")
-        start_time = time.time()
-        # Get speech timestamps
-        raw_timestamps = get_speech_timestamps(
-            audio,
-            model=self.silero_vad,
-            threshold=self.activate_threshold,
-            max_speech_duration_s=self.max_speech_duration,
-            min_speech_duration_ms=int(self.min_speech_duration * 1000),
-            min_silence_duration_ms=self.min_silence_duration,
-            return_seconds=True
-        )
-        # Convert to simple format and process
-        timestamps = [(item['start'], item['end']) for item in raw_timestamps]
-        processed_timestamps = self.process_timestamps(timestamps)
-        # Clean up
-        del audio
-        gc.collect()
-        self.logger.info(f"VAD completed in {time.time() - start_time:.3f} seconds")
-        return processed_timestamps
-        """
-        Save timestamps in both second and sample indices formats.
-        Args:
-            timestamps (list): List of (start, end) tuples
-            output_prefix (str): Prefix for output files
-        """
-        # Save timestamps in seconds (VTT format)
-        seconds_path = f"{output_prefix}_timestamps_second.txt"
-        with open(seconds_path, "w", encoding='UTF-8') as file:
-            self.logger.info("Saving timestamps in seconds format")
-            for start, end in timestamps:
-                s_time = self.format_time(start)
-                e_time = self.format_time(end)
-                line = f"{s_time} --> {e_time}\n"
-                file.write(line)
-        # Save timestamps in sample indices
-        indices_path = f"{output_prefix}_timestamps_indices.txt"
-        with open(indices_path, "w", encoding='UTF-8') as file:
-            self.logger.info("Saving timestamps in indices format")
-            for start, end in timestamps:
-                line = f"{int(start * self.sample_rate)} --> {int(end * self.sample_rate)}\n"
-                file.write(line)
-        self.logger.info(f"Timestamps saved to {seconds_path} and {indices_path}")
-    def extract_speech_segments(self, audio_segment, timestamps):
-        """
-        Extract speech segments from the audio and combine them into a single audio file.
-        Args:
-            timestamps (list): List of (start, end) tuples indicating speech segments
-        Returns:
-            AudioSegment: The combined speech segments
-        """
-        audio_segment = audio_segment.numpy()
-        combined_speech = np.array([], dtype=np.float32)
-        # Extract and combine each speech segment
-        for i, (start, end) in enumerate(timestamps):
-            # Convert seconds to milliseconds for pydub
-            start_ms = int(start * 1000)
-            end_ms = int(end * 1000)
-            # Ensure the end time does not exceed the length of the audio segment
-            if end_ms > len(audio_segment):
-                end_ms = len(audio_segment)
-            # Extract the segment
-            segment = audio_segment[start_ms:end_ms]
-            # Add to combined audio
-            combined_speech = np.append(combined_speech, segment)
-        return combined_speech
-    def process_audio(self, audio_array:np.array):
         """
-        Complete processing pipeline: detect speech, save timestamps, and optionally extract speech.
-        Returns:
-            tuple: (timestamps, output_speech_path if extract_speech else None)
         """
-        # Run VAD to detect speech
-        timestamps = self.detect_speech(audio_array)
-        combined_speech = self.extract_speech_segments(audio_array, timestamps)
-        return timestamps, combined_speech

+from copy import deepcopy
+from queue import Queue, Empty
+from time import time
+from config import VAD_MODEL_PATH
+# from silero_vad import load_silero_vad
 import numpy as np
 import onnxruntime
+class OnnxWrapper():
+    def __init__(self, path, force_onnx_cpu=False):
+        opts = onnxruntime.SessionOptions()
+        opts.inter_op_num_threads = 1
+        opts.intra_op_num_threads = 1
+        if force_onnx_cpu and 'CPUExecutionProvider' in onnxruntime.get_available_providers():
+            self.session = onnxruntime.InferenceSession(path, providers=['CPUExecutionProvider'], sess_options=opts)
+        else:
+            self.session = onnxruntime.InferenceSession(path, sess_options=opts)
+        self.reset_states()
+        self.sample_rates = [16000]
+    def _validate_input(self, x: np.ndarray, sr: int):
+        if x.ndim == 1:
+            x = x[None]
+        if x.ndim > 2:
+            raise ValueError(f"Too many dimensions for input audio chunk {x.ndim}")
+        if sr != 16000 and (sr % 16000 == 0):
+            step = sr // 16000
+            x = x[:, ::step]
+            sr = 16000
+        if sr not in self.sample_rates:
+            raise ValueError(f"Supported sampling rates: {self.sample_rates} (or multiply of 16000)")
+        if sr / x.shape[1] > 31.25:
+            raise ValueError("Input audio chunk is too short")
+        return x, sr
+    def reset_states(self, batch_size=1):
+        self._state = np.zeros((2, batch_size, 128)).astype(np.float32)
+        self._context = np.zeros(0)
+        self._last_sr = 0
+        self._last_batch_size = 0
+    def __call__(self, x, sr: int):
+        x, sr = self._validate_input(x, sr)
+        num_samples = 512 if sr == 16000 else 256
+        if x.shape[-1] != num_samples:
+            raise ValueError(
+                f"Provided number of samples is {x.shape[-1]} (Supported values: 256 for 8000 sample rate, 512 for 16000)")
+        batch_size = x.shape[0]
+        context_size = 64 if sr == 16000 else 32
+        if not self._last_batch_size:
+            self.reset_states(batch_size)
+        if (self._last_sr) and (self._last_sr != sr):
+            self.reset_states(batch_size)
+        if (self._last_batch_size) and (self._last_batch_size != batch_size):
+            self.reset_states(batch_size)
+        if not len(self._context):
+            self._context = np.zeros((batch_size, context_size)).astype(np.float32)
+        x = np.concatenate([self._context, x], axis=1)
+        if sr in [8000, 16000]:
+            ort_inputs = {'input': x, 'state': self._state, 'sr': np.array(sr, dtype='int64')}
+            ort_outs = self.session.run(None, ort_inputs)
+            out, state = ort_outs
+            self._state = state
+        else:
+            raise ValueError()
+        self._context = x[..., -context_size:]
+        self._last_sr = sr
+        self._last_batch_size = batch_size
+        # out = torch.from_numpy(out)
+        return out
+    def audio_forward(self, audio: np.ndarray, sr: int):
+        outs = []
+        x, sr = self._validate_input(audio, sr)
+        self.reset_states()
+        num_samples = 512 if sr == 16000 else 256
+        if x.shape[1] % num_samples:
+            pad_num = num_samples - (x.shape[1] % num_samples)
+            x = np.pad(x, ((0, 0), (0, pad_num)), 'constant', constant_values=(0.0, 0.0))
+        for i in range(0, x.shape[1], num_samples):
+            wavs_batch = x[:, i:i + num_samples]
+            out_chunk = self.__call__(wavs_batch, sr)
+            outs.append(out_chunk)
+        stacked = np.concatenate(outs, axis=1)
+        return stacked
+class VADIteratorOnnx:
+    def __init__(self,
+                 threshold: float = 0.5,
+                 sampling_rate: int = 16000,
+                 min_silence_duration_ms: int = 100,
+                 max_speech_duration_s: float = float('inf'),
+                 ):
+        self.model = OnnxWrapper(VAD_MODEL_PATH, True)
+        self.threshold = threshold
+        self.sampling_rate = sampling_rate
+        if sampling_rate not in [8000, 16000]:
+            raise ValueError('VADIterator does not support sampling rates other than [8000, 16000]')
+        self.min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
+        self.max_speech_samples = int(sampling_rate * max_speech_duration_s)
+        # self.speech_pad_samples = sampling_rate * speech_pad_ms / 1000
+        self.reset_states()
+    def reset_states(self):
+        self.model.reset_states()
+        self.triggered = False
+        self.temp_end = 0
+        self.current_sample = 0
+        self.start = 0
+    def __call__(self, x: np.ndarray, return_seconds=False):
         """
+        x: np.ndarray
+            audio chunk (see examples in repo)
+        return_seconds: bool (default - False)
+            whether return timestamps in seconds (default - samples)
         """
+        window_size_samples = 512 if self.sampling_rate == 16000 else 256
+        x = x[:window_size_samples]
+        if len(x) < window_size_samples:
+            x = np.pad(x, ((0, 0), (0, window_size_samples - len(x))), 'constant', constant_values=0.0)
+        self.current_sample += window_size_samples
+        speech_prob = self.model(x, self.sampling_rate)[0,0]
+        # print(f"{self.current_sample/self.sampling_rate:.2f}: {speech_prob}")
+        if (speech_prob >= self.threshold) and self.temp_end:
+            self.temp_end = 0
+        if (speech_prob >= self.threshold) and not self.triggered:
+            self.triggered = True
+            speech_start = max(0, self.current_sample - window_size_samples)
+            self.start = speech_start
+            return {'start': int(speech_start) if not return_seconds else round(speech_start / self.sampling_rate, 1)}
+        if (speech_prob >= self.threshold) and self.current_sample - self.start >= self.max_speech_samples:
+            if self.temp_end:
+                self.temp_end = 0
+            self.start = self.current_sample
+            return {'end': int(self.current_sample) if not return_seconds else round(self.current_sample / self.sampling_rate, 1)}
+        if (speech_prob < self.threshold - 0.15) and self.triggered:
+            if not self.temp_end:
+                self.temp_end = self.current_sample
+            if self.current_sample - self.temp_end < self.min_silence_samples:
+                return None
+            else:
+                speech_end = self.temp_end - window_size_samples
+                self.temp_end = 0
+                self.triggered = False
+                return {'end': int(speech_end) if not return_seconds else round(speech_end / self.sampling_rate, 1)}
+        return None
+class VadV2:
+    def __init__(self,
+                 threshold: float = 0.5,
+                 sampling_rate: int = 16000,
+                 min_silence_duration_ms: int = 100,
+                 speech_pad_ms: int = 30,
+                 max_speech_duration_s: float = float('inf')):
+        # self.vad_iterator = VADIterator(threshold, sampling_rate, min_silence_duration_ms)
+        self.vad_iterator = VADIteratorOnnx(threshold, sampling_rate, min_silence_duration_ms, max_speech_duration_s)
+        self.speech_pad_samples = int(sampling_rate * speech_pad_ms / 1000)
+        self.sampling_rate = sampling_rate
+        self.audio_buffer = np.array([], dtype=np.float32)
+        self.start = 0
+        self.end = 0
+        self.offset = 0
+        assert speech_pad_ms <= min_silence_duration_ms, "speech_pad_ms should be less than min_silence_duration_ms"
+        self.max_speech_samples = int(sampling_rate * max_speech_duration_s)
+        self.silence_chunk_size = 0
+        self.silence_chunk_threshold = 60 / (512 / self.sampling_rate)
+    def reset(self):
+        self.audio_buffer = np.array([], dtype=np.float32)
+        self.start = 0
+        self.end = 0
+        self.offset = 0
+        self.vad_iterator.reset_states()
+    def __call__(self, x: np.ndarray = None):
+        if x is None:
+            if self.start:
+                start = max(self.offset, self.start - self.speech_pad_samples)
+                end = self.offset + len(self.audio_buffer)
+                start_ts = round(start / self.sampling_rate, 1)
+                end_ts = round(end / self.sampling_rate, 1)
+                audio_data = self.audio_buffer[start - self.offset: end - self.offset]
+                result = {
+                    "start": start_ts,
+                    "end": end_ts,
+                    "audio": audio_data,
+                }
+            else:
+                result = None
+            self.reset()
+            return result
+        self.audio_buffer = np.append(self.audio_buffer, deepcopy(x))
+        result = self.vad_iterator(x)
+        if result is not None:
+            # self.start = result.get('start', self.start)
+            # self.end = result.get('end', self.end)
+            self.silence_chunk_size = 0
+            if 'start' in result:
+                self.start = result['start']
+            if 'end' in result:
+                self.end = result['end']
+        else:
+            self.silence_chunk_size += 1
+        if self.start == 0 and len(self.audio_buffer) > self.speech_pad_samples:
+            self.offset += len(self.audio_buffer) - self.speech_pad_samples
+            self.audio_buffer = self.audio_buffer[-self.speech_pad_samples:]
+        if self.silence_chunk_size >= self.silence_chunk_threshold:
+            self.offset += len(self.audio_buffer) - self.speech_pad_samples
+            self.audio_buffer = self.audio_buffer[-self.speech_pad_samples:]
+            self.silence_chunk_size = 0
+        if self.end > self.start:
+            start = max(self.offset, self.start - self.speech_pad_samples)
+            end = self.end + self.speech_pad_samples
+            start_ts = round(start / self.sampling_rate, 1)
+            end_ts = round(end / self.sampling_rate, 1)
+            audio_data = self.audio_buffer[start - self.offset: end - self.offset]
+            self.audio_buffer = self.audio_buffer[self.end - self.offset:]
+            self.offset = self.end
+            self.start = self.end
+            # self.start = 0
+            self.end = 0
+            result = {
+                "start": start_ts,
+                "end": end_ts,
+                "audio": audio_data,
+            }
+            return result
+        return None
+class VadProcessor:
+    def __init__(
+            self,
+            prob_threshold=0.5,
+            silence_s=0.3,
+            cache_s=0.25,
+            sr=16000
+    ):
+        self.prob_thres = prob_threshold
+        self.cache_s = cache_s
+        self.sr = sr
+        self.silence_s = silence_s
+        self.vad = VadV2(self.prob_thres, self.sr, self.silence_s * 1000, self.cache_s * 1000, max_speech_duration_s=15)
+    def process_audio(self, audio_buffer: np.ndarray):
+        audio = np.array([], np.float32)
+        for i in range(0, len(audio_buffer), 512):
+            chunk = audio_buffer[i:i+512]
+            ret = self.vad(chunk)
+            if ret:
+                audio = np.append(audio, ret['audio'])
+        return audio

transcribe/helpers/whisper.py CHANGED Viewed

@@ -9,10 +9,14 @@ logger = getLogger(__name__)
 class WhisperCPP:
-    def __init__(self, warmup=True) -> None:
         models_dir = config.MODEL_DIR.as_posix()
         self.model = Model(
-            model=config.WHISPER_MODEL,
             models_dir=models_dir,
             print_realtime=False,
             print_progress=False,
@@ -47,9 +51,9 @@ class WhisperCPP:
                 audio_buffer,
                 initial_prompt=prompt,
                 language=language,
-                token_timestamps=True,
                 # split_on_word=True,
-                max_len=max_len
             )
             return output
         except Exception as e:

 class WhisperCPP:
+    def __init__(self, source_lange: str='en', warmup=True) -> None:
         models_dir = config.MODEL_DIR.as_posix()
+        if source_lange == "zh":
+            whisper_model = config.WHISPER_MODEL_ZH
+        else:
+            whisper_model = config.WHISPER_MODEL_EN
         self.model = Model(
+            model=whisper_model,
             models_dir=models_dir,
             print_realtime=False,
             print_progress=False,
                 audio_buffer,
                 initial_prompt=prompt,
                 language=language,
+                # token_timestamps=True,
                 # split_on_word=True,
+                # max_len=max_len
             )
             return output
         except Exception as e:

transcribe/pipelines/__init__.py CHANGED Viewed

@@ -1,5 +1,5 @@
 from .pipe_translate import TranslatePipe, Translate7BPipe
-from .pipe_whisper import WhisperPipe
 from .pipe_vad import VadPipe
 from .base import MetaItem

 from .pipe_translate import TranslatePipe, Translate7BPipe
+from .pipe_whisper import WhisperPipe, WhisperChinese
 from .pipe_vad import VadPipe
 from .base import MetaItem

transcribe/pipelines/pipe_translate.py CHANGED Viewed

@@ -35,3 +35,6 @@ class Translate7BPipe(TranslatePipe):
         if cls.translator is None:
             cls.translator = QwenTranslator(LLM_LARGE_MODEL_PATH, LLM_SYS_PROMPT_EN, LLM_SYS_PROMPT_ZH)


35	if cls.translator is None:
36	cls.translator = QwenTranslator(LLM_LARGE_MODEL_PATH, LLM_SYS_PROMPT_EN, LLM_SYS_PROMPT_ZH)
37
38	+
39	+
40	+

transcribe/pipelines/pipe_vad.py CHANGED Viewed

@@ -1,99 +1,41 @@
 from .base import MetaItem, BasePipe
-from ..helpers.vadprocessor import SileroVADProcessor, FixedVADIterator
 import numpy as np
 from silero_vad import get_speech_timestamps
-import torch
 from typing import List
-# import noisereduce as nr
-def collect_chunks(tss: List[dict], wav: torch.Tensor, sample_rate: int = 16000):
-    chunks = []
-    silent_samples = int(0.3 * sample_rate)  # 300ms 的静音样本数
-    silence = torch.zeros(silent_samples)  # 创建300ms的静音
-    for i in range(len(tss)):
-        # 先添加当前语音片段
-        chunks.append(wav[tss[i]['start']: tss[i]['end']])
-        # 如果不是最后一个片段，且与下一个片段间隔大于100ms，则添加静音
-        if i < len(tss) - 1:
-            gap = tss[i+1]['start'] - tss[i]['end']
-            if gap > 0.1 * sample_rate:  # 判断间隔是否大于100ms
-                chunks.append(silence)  # 添加300ms静音
-    return torch.cat(chunks)
-def collect_chunks_improved(tss: List[dict], wav: torch.Tensor, sample_rate: int = 16000):
-    chunks = []
-    silent_samples = int(0.3 * sample_rate)  # 300ms 的静音样本数
-    silence = torch.zeros(silent_samples)  # 创建300ms的静音
-    min_gap_samples = int(0.1 * sample_rate)  # 最小间隔阈值 (100ms)
-    # 对时间戳进行简单的平滑处理
-    smoothed_tss = []
-    for i, ts in enumerate(tss):
-        if i > 0 and ts['start'] - tss[i-1]['end'] < 0.02 * sample_rate:  # 如果间隔小于20ms，认为是连续的
-            smoothed_tss[-1]['end'] = ts['end']  # 合并到前一个片段
-        else:
-            smoothed_tss.append(ts)
-    for i in range(len(smoothed_tss)):
-        # 添加当前语音片段
-        chunks.append(wav[smoothed_tss[i]['start']: smoothed_tss[i]['end']])
-        # 如果不是最后一个片段，且与下一个片段间隔大于阈值，则添加静音
-        if i < len(smoothed_tss) - 1:
-            gap = smoothed_tss[i+1]['start'] - smoothed_tss[i]['end']
-            if gap > min_gap_samples:
-                # 根据间隔大小动态调整静音长度，但最大不超过300ms
-                silence_length = min(gap // 2, silent_samples)
-                chunks.append(torch.zeros(silence_length))
-    return torch.cat(chunks)
 class VadPipe(BasePipe):
-    model = None
     sample_rate = 16000
     window_size_samples = 512
     @classmethod
     def init(cls):
-        if cls.model is None:
-            cls.model = SileroVADProcessor(
-            activate_threshold=0.45,       # 降低以捕获更多音频
-            fusion_threshold=0.45,        # 提高以更好地融合语音片段
-            min_speech_duration=0.2,      # 略微降低以捕获短音节
-            max_speech_duration=20,       # 保持不变
-            min_silence_duration=300,     # 增加到300毫秒，允许说话间的自然停顿
-            sample_rate=cls.sample_rate # 采样率，音频信号的采样频率
-        )
-            cls.vac = FixedVADIterator(cls.model.silero_vad, sampling_rate=cls.sample_rate,)
-            cls.vac.reset_states()
-    def get_previous_buffer(self):
-        if len(self.previous_buffer) == 2:
-            return self.previous_buffer[-1]
-        return np.array([], dtype=np.float32)
     # def reduce_noise(self, data):
     #     return nr.reduce_noise(y=data, sr=self.sample_rate)
-    def process(self, in_data: MetaItem) -> MetaItem:
-        source_audio = in_data.source_audio
-        source_audio = np.frombuffer(source_audio, dtype=np.float32)
-        # source_audio = self.reduce_noise(source_audio)
-        send_audio = b""
-        speech_timestamps = get_speech_timestamps(torch.Tensor(source_audio), self.model.silero_vad, sampling_rate=16000)
-        if speech_timestamps:
-            send_audio = collect_chunks_improved(speech_timestamps, torch.Tensor(source_audio))
-            send_audio = send_audio.numpy()
-            in_data.audio = send_audio
-            # send_audio = self.reduce_noise(send_audio).tobytes()
-        in_data.source_audio = b""
-        return in_data

 from .base import MetaItem, BasePipe
+from ..helpers.vadprocessor import VadV2
 import numpy as np
 from silero_vad import get_speech_timestamps
 from typing import List
+import logging
+# import noisereduce as nr
 class VadPipe(BasePipe):
+    vac = None
     sample_rate = 16000
     window_size_samples = 512
+    chunk_size = 512
+    prob_threshold=0.5,
+    silence_s=0.5,
+    cache_s=0.25,
     @classmethod
     def init(cls):
+        if cls.vac is None:
+            cls.vac = VadV2(cls.prob_threshold, cls.sample_rate, cls.silence_s * 1000, cls.cache_s * 1000, max_speech_duration_s=15)
+    def process(self, in_data: MetaItem) -> MetaItem:
+        audio_buffer = np.frombuffer(in_data.source_audio)
+        vad_audio = self.vac(audio_buffer)
+        if vad_audio:
+            in_data.audio = vad_audio['audio']
+        else:
+            in_data.audio = b""
+        return in_data
     # def reduce_noise(self, data):
     #     return nr.reduce_noise(y=data, sr=self.sample_rate)

transcribe/pipelines/pipe_whisper.py CHANGED Viewed

@@ -7,16 +7,18 @@ class WhisperPipe(BasePipe):
     whisper = None
     @classmethod
     def init(cls):
         if cls.whisper is None:
             cls.whisper = WhisperCPP()
     def process(self, in_data: MetaItem) -> MetaItem:
         audio_data = in_data.audio
         source_language = in_data.source_language
-        segments = self.whisper.transcribe(audio_data, source_language) or []
         texts = "".join([s.text for s in segments])
         in_data.segments = [Segment(t0=s.t0, t1=s.t1, text=self.filter_chinese_printable(s.text)) for s in segments]
         in_data.transcribe_content = texts
@@ -30,3 +32,11 @@ class WhisperPipe(BasePipe):
             if unicodedata.category(char) != 'Cc':  # 不可打印字符的分类为 'Cc'
                 printable.append(char)
         return ''.join(printable).strip()

     whisper = None
     @classmethod
     def init(cls):
         if cls.whisper is None:
+            # cls.zh_whisper = WhisperCPP(source_lange='zh')
             cls.whisper = WhisperCPP()
     def process(self, in_data: MetaItem) -> MetaItem:
         audio_data = in_data.audio
         source_language = in_data.source_language
+        segments = self.whisper.transcribe(audio_data, source_language)
         texts = "".join([s.text for s in segments])
         in_data.segments = [Segment(t0=s.t0, t1=s.t1, text=self.filter_chinese_printable(s.text)) for s in segments]
         in_data.transcribe_content = texts
             if unicodedata.category(char) != 'Cc':  # 不可打印字符的分类为 'Cc'
                 printable.append(char)
         return ''.join(printable).strip()
+class WhisperChinese(WhisperPipe):
+    @classmethod
+    def init(cls):
+        if cls.whisper is None:
+            cls.whisper = WhisperCPP(source_lange='zh')

transcribe/strategy.py CHANGED Viewed

@@ -111,7 +111,7 @@ class TranscriptChunk:
             return 0
         score =  self._calculate_similarity(self.join(), chunk.join())
-        logger.debug(f"Compare: {self.join()} vs {chunk.join()} : {score}")
         return score
     def only_punctuation(self)->bool:

             return 0
         score =  self._calculate_similarity(self.join(), chunk.join())
+        # logger.debug(f"Compare: {self.join()} vs {chunk.join()} : {score}")
         return score
     def only_punctuation(self)->bool:

transcribe/translatepipes.py CHANGED Viewed

@@ -1,4 +1,4 @@
-from transcribe.pipelines import WhisperPipe, TranslatePipe, MetaItem, VadPipe, Translate7BPipe
 import multiprocessing  as mp
 import config
@@ -11,14 +11,18 @@ class TranslatePipes:
         # self.result_queue = mp.Queue()
         # whisper 转录
-        self._whisper_pipe = self._launch_process(WhisperPipe())
         # llm 翻译
-        self._translate_pipe = self._launch_process(TranslatePipe())
         self._translate_7b_pipe = self._launch_process(Translate7BPipe())
         # vad
-        self._vad_pipe = self._launch_process(VadPipe())
     def _launch_process(self, process_obj):
         process_obj.daemon = True
@@ -26,9 +30,10 @@ class TranslatePipes:
         return process_obj
     def wait_ready(self):
-        self._whisper_pipe.wait()
-        self._translate_pipe.wait()
-        self._vad_pipe.wait()
         self._translate_7b_pipe.wait()
     def translate(self, text, src_lang, dst_lang) -> MetaItem:
@@ -45,14 +50,20 @@ class TranslatePipes:
             transcribe_content=text,
             source_language=src_lang,
             destination_language=dst_lang)
-        self._translate_pipe.input_queue.put(item)
-        return self._translate_pipe.output_queue.get()
     def transcrible(self, audio_buffer:bytes, src_lang: str) -> MetaItem:
         item = MetaItem(audio=audio_buffer, source_language=src_lang)
-        self._whisper_pipe.input_queue.put(item)
-        return self._whisper_pipe.output_queue.get()
     def voice_detect(self, audio_buffer:bytes) -> MetaItem:
         item = MetaItem(source_audio=audio_buffer)

+from transcribe.pipelines import WhisperPipe, TranslatePipe, MetaItem, WhisperChinese, Translate7BPipe
 import multiprocessing  as mp
 import config
         # self.result_queue = mp.Queue()
         # whisper 转录
+        self._whisper_pipe_en = self._launch_process(WhisperPipe())
+        self._whisper_pipe_zh = self._launch_process(WhisperChinese())
         # llm 翻译
+        # self._translate_pipe = self._launch_process(TranslatePipe())
         self._translate_7b_pipe = self._launch_process(Translate7BPipe())
         # vad
+        # self._vad_pipe = self._launch_process(VadPipe())
+    # def reset(self):
+    #     self._vad_pipe.reset()
     def _launch_process(self, process_obj):
         process_obj.daemon = True
         return process_obj
     def wait_ready(self):
+        self._whisper_pipe_zh.wait()
+        self._whisper_pipe_en.wait()
+        # self._translate_pipe.wait()
+        # self._vad_pipe.wait()
         self._translate_7b_pipe.wait()
     def translate(self, text, src_lang, dst_lang) -> MetaItem:
             transcribe_content=text,
             source_language=src_lang,
             destination_language=dst_lang)
+        self._translate_7b_pipe.input_queue.put(item)
+        return self._translate_7b_pipe.output_queue.get()
+    def get_whisper_model(self, lang:str='en'):
+        if lang == 'zh':
+            return self._whisper_pipe_zh
+        return self._whisper_pipe_en
     def transcrible(self, audio_buffer:bytes, src_lang: str) -> MetaItem:
+        whisper_model = self.get_whisper_model(src_lang)
         item = MetaItem(audio=audio_buffer, source_language=src_lang)
+        whisper_model.input_queue.put(item)
+        return whisper_model.output_queue.get()
     def voice_detect(self, audio_buffer:bytes) -> MetaItem:
         item = MetaItem(source_audio=audio_buffer)

transcribe/whisper_llm_serve.py CHANGED Viewed

@@ -8,44 +8,48 @@ from typing import List, Optional, Iterator, Tuple, Any
 import asyncio
 import numpy as np
 import config
-# import wordninja
 from api_model import TransResult, Message, DebugResult
-from .server import ServeClientBase
 from .utils import log_block, save_to_wave, TestDataWriter
 from .translatepipes import TranslatePipes
 from .strategy import (
     TranscriptStabilityAnalyzer, TranscriptToken)
-import csv
 logger = getLogger("TranscriptionService")
-class WhisperTranscriptionService(ServeClientBase):
     """
     Whisper语音转录服务类，处理音频流转录和翻译
     """
     def __init__(self, websocket, pipe: TranslatePipes, language=None, dst_lang=None, client_uid=None):
-        super().__init__(client_uid, websocket)
         self.source_language = language  # 源语言
         self.target_language = dst_lang  # 目标翻译语言
         # 转录结果稳定性管理
         self._translate_pipe = pipe
         # 音频处理相关
         self.sample_rate = 16000
-        self.frames_np = None
         self.lock = threading.Lock()
         self._frame_queue = queue.Queue()
         # 文本分隔符，根据语言设置
         self.text_separator = self._get_text_separator(language)
         self.loop = asyncio.get_event_loop()
         # 发送就绪状态
-        self.send_ready_state()
         self._transcrible_analysis = None
         # 启动处理线程
         self._translate_thread_stop = threading.Event()
@@ -53,7 +57,8 @@ class WhisperTranscriptionService(ServeClientBase):
         self.translate_thread = self._start_thread(self._transcription_processing_loop)
         self.frame_processing_thread = self._start_thread(self._frame_processing_loop)
         # for test
         self._transcrible_time_cost = 0.
         self._translate_time_cost = 0.
@@ -82,9 +87,9 @@ class WhisperTranscriptionService(ServeClientBase):
         """根据语言返回适当的文本分隔符"""
         return "" if language == "zh" else " "
-    def send_ready_state(self) -> None:
         """发送服务就绪状态消息"""
-        self.websocket.send(json.dumps({
             "uid": self.client_uid,
             "message": self.SERVER_READY,
             "backend": "whisper_transcription"
@@ -94,10 +99,10 @@ class WhisperTranscriptionService(ServeClientBase):
         """设置源语言和目标语言"""
         self.source_language = source_lang
         self.target_language = target_lang
-        self.text_separator = self._get_text_separator(source_lang)
-        self._transcrible_analysis = TranscriptStabilityAnalyzer(self.source_language, self.text_separator)
-    def add_audio_frames(self, frame_np: np.ndarray) -> None:
         """添加音频帧到处理队列"""
         self._frame_queue.put(frame_np)
@@ -105,68 +110,21 @@ class WhisperTranscriptionService(ServeClientBase):
         """从队列获取音频帧并合并到缓冲区"""
         while not self._frame_processing_thread_stop.is_set():
             try:
-                frame_np = self._frame_queue.get(timeout=0.1)
-                if frame_np is None:
-                    logger.error("Received None frame, stopping thread")
-                with self.lock:
-                    if self.frames_np is None:
-                        self.frames_np = frame_np.copy()
-                    else:
-                        self.frames_np = np.append(self.frames_np, frame_np)
             except queue.Empty:
                 pass
-    def _apply_voice_activity_detection(self) -> None:
-        """应用语音活动检测来优化音频缓冲区"""
-        with self.lock:
-            if self.frames_np is not None:
-                # self._c+= 1
-                frame = self.frames_np.copy()
-                processed_audio = self._translate_pipe.voice_detect(frame.tobytes())
-                self.frames_np = np.frombuffer(processed_audio.audio, dtype=np.float32).copy()
-                return self.frames_np.copy()
-                # if len(frame) > self.sample_rate:
-                #     save_to_wave(f"{self._c}-org.wav", frame)
-                #     save_to_wave(f"{self._c}-vad.wav", self.frames_np)
-    def _update_audio_buffer(self, offset: int) -> None:
-        """从音频缓冲区中移除已处理的部分"""
-        with self.lock:
-            if self.frames_np is not None and offset > 0:
-                # self._c += 1
-                # before = self.frames_np.copy()
-                self.frames_np = self.frames_np[offset:]
-                # after = self.frames_np.copy()
-                # save_to_wave(f"./tests/{self._c}_before_cut_{offset}.wav", before)
-                # save_to_wave(f"./tests/{self._c}_cut.wav", before[:offset])
-                # save_to_wave(f"./tests/{self._c}_after_cut.wav", after)
-    def _get_audio_for_processing(self) -> Optional[np.ndarray]:
-        """准备用于处理的音频块"""
-        # 应用VAD处理
-        frame_np = self._apply_voice_activity_detection()
-        # frame_np = self.frames_np.copy()
-        # 没有音频帧
-        if frame_np is None:
-            return None
-        frames = frame_np.copy()
-        # 音频过短时的处理
-        if len(frames) <= 10:
-            # 极短音频段，清空并返回None
-            # self._update_audio_buffer(len(frames))
-            return None
-        if len(frames) < self.sample_rate:
-            # 不足一秒的音频，补充静音
-            silence_audio = np.zeros((self.sample_rate + 1000,), dtype=np.float32)
-            silence_audio[-len(frames):] = frames
-            return silence_audio.copy()
-        return frames
-    def _transcribe_audio(self, audio_buffer: np.ndarray) -> List[TranscriptToken]:
         """转录音频并返回转录片段"""
         log_block("Audio buffer length", f"{audio_buffer.shape[0]/self.sample_rate:.2f}", "s")
         start_time = time.perf_counter()
@@ -175,14 +133,11 @@ class WhisperTranscriptionService(ServeClientBase):
         segments = result.segments
         time_diff = (time.perf_counter() - start_time)
         logger.debug(f"📝 Transcrible Segments: {segments} ")
-        logger.debug(f"📝 Transcrible: {self.text_separator.join(seg.text for seg in segments)} ")
         log_block("📝 Transcrible output", f"{self.text_separator.join(seg.text for seg in segments)}", "")
         log_block("📝 Transcrible time", f"{time_diff:.3f}", "s")
         self._transcrible_time_cost = round(time_diff, 3)
-        return [
-            TranscriptToken(text=s.text, t0=s.t0, t1=s.t1)
-            for s in segments
-        ]
     def _translate_text(self, text: str) -> str:
         """将文本翻译为目标语言"""
@@ -216,40 +171,44 @@ class WhisperTranscriptionService(ServeClientBase):
         self._translate_time_cost = round(time_diff, 3)
         return translated_text
     def _transcription_processing_loop(self) -> None:
         """主转录处理循环"""
-        c = 0
-        while not self._translate_thread_stop.is_set():
-            if self.exit:
-                logger.info("Exiting transcription thread")
-                break
-            # 等待音频数据
-            if self.frames_np is None:
-                time.sleep(0.2)
-                logger.info("Waiting for audio data...")
-                continue
-            # 获取音频块进行处理
-            audio_buffer = self._get_audio_for_processing()
-            if audio_buffer is None:
                 time.sleep(0.2)
                 continue
-            logger.debug(f"🥤 Buffer Length: {len(audio_buffer)/self.sample_rate:.2f} ")
             # try:
-            segments = self._transcribe_audio(audio_buffer)
-            # 处理转录结果并发送到客户端
-            for result in self._process_transcription_results(segments, audio_buffer):
                 self._send_result_to_client(result)
             # except Exception as e:
             #     logger.error(f"Error processing audio: {e}")
     def _process_transcription_results(self, segments: List[TranscriptToken], audio_buffer: np.ndarray) -> Iterator[TransResult]:
         """
         处理转录结果，生成翻译结果

 import asyncio
 import numpy as np
 import config
 from api_model import TransResult, Message, DebugResult
 from .utils import log_block, save_to_wave, TestDataWriter
 from .translatepipes import TranslatePipes
 from .strategy import (
     TranscriptStabilityAnalyzer, TranscriptToken)
+from transcribe.helpers.vadprocessor import VadProcessor
+from transcribe.pipelines import MetaItem
 logger = getLogger("TranscriptionService")
+class WhisperTranscriptionService:
     """
     Whisper语音转录服务类，处理音频流转录和翻译
     """
+    SERVER_READY = "SERVER_READY"
+    DISCONNECT = "DISCONNECT"
     def __init__(self, websocket, pipe: TranslatePipes, language=None, dst_lang=None, client_uid=None):
         self.source_language = language  # 源语言
         self.target_language = dst_lang  # 目标翻译语言
+        self.client_uid = client_uid
         # 转录结果稳定性管理
+        self.websocket = websocket
         self._translate_pipe = pipe
         # 音频处理相关
         self.sample_rate = 16000
         self.lock = threading.Lock()
         self._frame_queue = queue.Queue()
+        self._vad_frame_queue = queue.Queue()
         # 文本分隔符，根据语言设置
         self.text_separator = self._get_text_separator(language)
         self.loop = asyncio.get_event_loop()
         # 发送就绪状态
         self._transcrible_analysis = None
         # 启动处理线程
         self._translate_thread_stop = threading.Event()
         self.translate_thread = self._start_thread(self._transcription_processing_loop)
         self.frame_processing_thread = self._start_thread(self._frame_processing_loop)
+        self._vad = VadProcessor()
+        self.row_number = 0
         # for test
         self._transcrible_time_cost = 0.
         self._translate_time_cost = 0.
         """根据语言返回适当的文本分隔符"""
         return "" if language == "zh" else " "
+    async def send_ready_state(self) -> None:
         """发送服务就绪状态消息"""
+        await self.websocket.send(json.dumps({
             "uid": self.client_uid,
             "message": self.SERVER_READY,
             "backend": "whisper_transcription"
         """设置源语言和目标语言"""
         self.source_language = source_lang
         self.target_language = target_lang
+        # self.text_separator = self._get_text_separator(source_lang)
+        # self._transcrible_analysis = TranscriptStabilityAnalyzer(self.source_language, self.text_separator)
+    def add_frames(self, frame_np: np.ndarray) -> None:
         """添加音频帧到处理队列"""
         self._frame_queue.put(frame_np)
         """从队列获取音频帧并合并到缓冲区"""
         while not self._frame_processing_thread_stop.is_set():
             try:
+                audio = self._frame_queue.get(timeout=0.1)
+                # save_to_wave(f"{self._c}_before_vad.wav", audio)
+                processed_audio = self._vad.process_audio(audio)
+                if processed_audio.shape[0] > 0:
+                    # vad_processed_audio = processed_audio
+                    # save_to_wave(f"{self._c}_after_vad.wav", processed_audio)
+                    # vad_frame_obj = np.frombuffer(processed_audio.audio, dtype=np.float32)
+                    logger.debug(f"Vad frame: {processed_audio.shape[0]/self.sample_rate:.2f}")
+                # apply vad speech check:
+                    self._vad_frame_queue.put(processed_audio)
             except queue.Empty:
                 pass
+    def _transcribe_audio(self, audio_buffer: np.ndarray)->MetaItem:
         """转录音频并返回转录片段"""
         log_block("Audio buffer length", f"{audio_buffer.shape[0]/self.sample_rate:.2f}", "s")
         start_time = time.perf_counter()
         segments = result.segments
         time_diff = (time.perf_counter() - start_time)
         logger.debug(f"📝 Transcrible Segments: {segments} ")
+        # logger.debug(f"📝 Transcrible: {self.text_separator.join(seg.text for seg in segments)} ")
         log_block("📝 Transcrible output", f"{self.text_separator.join(seg.text for seg in segments)}", "")
         log_block("📝 Transcrible time", f"{time_diff:.3f}", "s")
         self._transcrible_time_cost = round(time_diff, 3)
+        return result
     def _translate_text(self, text: str) -> str:
         """将文本翻译为目标语言"""
         self._translate_time_cost = round(time_diff, 3)
         return translated_text
     def _transcription_processing_loop(self) -> None:
         """主转录处理循环"""
+        while not self._translate_thread_stop.is_set():
+            audio_buffer = self._vad_frame_queue.get()
+            if audio_buffer is None or len(audio_buffer) < int(self.sample_rate):
                 time.sleep(0.2)
                 continue
+            logger.debug(f"audio buffer size: {len(audio_buffer) / self.sample_rate:.2f}s")
             # try:
+            meta_item = self._transcribe_audio(audio_buffer)
+            segments = meta_item.segments
+            logger.debug(f"Segments: {segments}")
+            if len(segments):
+                result = self._process_transcription_results_2(segments)
                 self._send_result_to_client(result)
+                time.sleep(0.1)
+            # 处理转录结果并发送到客户端
+            # for result in self._process_transcription_results(segments, audio_buffer):
+            #     self._send_result_to_client(result)
             # except Exception as e:
             #     logger.error(f"Error processing audio: {e}")
+    def _process_transcription_results_2(self, segments: List[TranscriptToken],):
+        seg = segments[0]
+        item =  TransResult(
+                seg_id=self.row_number,
+                context=seg.text,
+                from_=self.source_language,
+                to=self.target_language,
+                tran_content=self._translate_text_large(seg.text),
+                partial=False
+            )
+        self.row_number += 1
+        return item
     def _process_transcription_results(self, segments: List[TranscriptToken], audio_buffer: np.ndarray) -> Iterator[TransResult]:
         """
         处理转录结果，生成翻译结果