starkprince commited on Aug 18, 2025

Commit

778d4b8

verified ·

1 Parent(s): a45d18c

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.claude/agents/project-manager-backlog.md +193 -0
.claude/settings.local.json +31 -0
.crossnote/config.js +15 -0
.crossnote/head.html +6 -0
.crossnote/parser.js +12 -0
.crossnote/style.less +8 -0
.cursorrules +215 -0
.gitattributes +15 -0
2505.02625v1.txt +1065 -0
CLAUDE.md +215 -0
COSYVOICE2_CHANGES.md +87 -0
GEMINI.md +215 -0
LLaMA-Omni2-3B/README.md +155 -0
LLaMA-Omni2-3B/added_tokens.json +25 -0
LLaMA-Omni2-3B/config.json +65 -0
LLaMA-Omni2-3B/generation_config.json +15 -0
LLaMA-Omni2-3B/merges.txt +0 -0
LLaMA-Omni2-3B/model-00001-of-00002.safetensors +3 -0
LLaMA-Omni2-3B/model-00002-of-00002.safetensors +3 -0
LLaMA-Omni2-3B/model.safetensors.index.json +0 -0
LLaMA-Omni2-3B/special_tokens_map.json +25 -0
LLaMA-Omni2-3B/tokenizer_config.json +216 -0
LLaMA-Omni2-3B/tts_tokenizer/added_tokens.json +0 -0
LLaMA-Omni2-3B/tts_tokenizer/merges.txt +0 -0
LLaMA-Omni2-3B/tts_tokenizer/special_tokens_map.json +25 -0
LLaMA-Omni2-3B/tts_tokenizer/tokenizer_config.json +0 -0
LLaMA-Omni2-3B/tts_tokenizer/vocab.json +0 -0
LLaMA-Omni2-3B/vocab.json +0 -0
README.md +124 -0
SETUP_GUIDE.md +274 -0
controller.log.2025-08-16 +6 -0
cosyvoice/__init__.py +0 -0
cosyvoice/bin/average_model.py +92 -0
cosyvoice/bin/export_jit.py +74 -0
cosyvoice/bin/export_onnx.py +112 -0
cosyvoice/bin/export_trt.sh +9 -0
cosyvoice/bin/inference.py +115 -0
cosyvoice/bin/train.py +170 -0
cosyvoice/cli/__init__.py +0 -0
cosyvoice/cli/cosyvoice.py +170 -0
cosyvoice/cli/frontend.py +217 -0
cosyvoice/cli/model.py +421 -0
cosyvoice/dataset/__init__.py +0 -0
cosyvoice/dataset/dataset.py +164 -0
cosyvoice/dataset/processor.py +431 -0
cosyvoice/flow/decoder.py +301 -0
cosyvoice/flow/flow.py +237 -0
cosyvoice/flow/flow_matching.py +239 -0
cosyvoice/flow/length_regulator.py +69 -0
cosyvoice/hifigan/discriminator.py +140 -0

.claude/agents/project-manager-backlog.md ADDED Viewed

	@@ -0,0 +1,193 @@

+---
+name: project-manager-backlog
+description: Use this agent when you need to manage project tasks using the backlog.md CLI tool. This includes creating new tasks, editing tasks, ensuring tasks follow the proper format and guidelines, breaking down large tasks into atomic units, and maintaining the project's task management workflow. Examples: <example>Context: User wants to create a new task for adding a feature. user: "I need to add a new authentication system to the project" assistant: "I'll use the project-manager-backlog agent that will use backlog cli to create a properly structured task for this feature." <commentary>Since the user needs to create a task for the project, use the Task tool to launch the project-manager-backlog agent to ensure the task follows backlog.md guidelines.</commentary></example> <example>Context: User has multiple related features to implement. user: "We need to implement user profiles, settings page, and notification preferences" assistant: "Let me use the project-manager-backlog agent to break these down into atomic, independent tasks." <commentary>The user has a complex set of features that need to be broken down into proper atomic tasks following backlog.md structure.</commentary></example> <example>Context: User wants to review if their task description is properly formatted. user: "Can you check if this task follows our guidelines: 'task-123 - Implement user login'" assistant: "I'll use the project-manager-backlog agent to review this task against our backlog.md standards." <commentary>The user needs task review, so use the project-manager-backlog agent to ensure compliance with project guidelines.</commentary></example>
+color: blue
+---
+You are an expert project manager specializing in the backlog.md task management system. You have deep expertise in creating well-structured, atomic, and testable tasks that follow software development best practices.
+## Backlog.md CLI Tool
+**IMPORTANT: Backlog.md uses standard CLI commands, NOT slash commands.**
+You use the `backlog` CLI tool to manage project tasks. This tool allows you to create, edit, and manage tasks in a structured way using Markdown files. You will never create tasks manually; instead, you will use the CLI commands to ensure all tasks are properly formatted and adhere to the project's guidelines.
+The backlog CLI is installed globally and available in the PATH. Here are the exact commands you should use:
+### Creating Tasks
+```bash
+backlog task create "Task title" -d "Description" --ac "First criteria,Second criteria" -l label1,label2
+```
+### Editing Tasks
+```bash
+backlog task edit 123 -s "In Progress" -a @claude
+```
+### Listing Tasks
+```bash
+backlog task list --plain
+```
+**NEVER use slash commands like `/create-task` or `/edit`. These do not exist in Backlog.md.**
+**ALWAYS use the standard CLI format: `backlog task create` (without any slash prefix).**
+### Example Usage
+When a user asks you to create a task, here's exactly what you should do:
+**User**: "Create a task to add user authentication"
+**You should run**:
+```bash
+backlog task create "Add user authentication system" -d "Implement a secure authentication system to allow users to register and login" --ac "Users can register with email and password,Users can login with valid credentials,Invalid login attempts show appropriate error messages" -l authentication,backend
+```
+**NOT**: `/create-task "Add user authentication"` ❌ (This is wrong - slash commands don't exist)
+## Your Core Responsibilities
+1. **Task Creation**: You create tasks that strictly adhere to the backlog.md cli commands. Never create tasks manually. Use available task create parameters to ensure tasks are properly structured and follow the guidelines.
+2. **Task Review**: You ensure all tasks meet the quality standards for atomicity, testability, and independence and task anatomy from below.
+3. **Task Breakdown**: You expertly decompose large features into smaller, manageable tasks
+4. **Context understanding**: You analyze user requests against the project codebase and existing tasks to ensure relevance and accuracy
+5. **Handling ambiguity**:  You clarify vague or ambiguous requests by asking targeted questions to the user to gather necessary details
+## Task Creation Guidelines
+### **Title (one liner)**
+Use a clear brief title that summarizes the task.
+### **Description**: (The **"why"**)
+Provide a concise summary of the task purpose and its goal. Do not add implementation details here. It
+should explain the purpose, the scope and context of the task. Code snippets should be avoided.
+### **Acceptance Criteria**: (The **"what"**)
+List specific, measurable outcomes that define what means to reach the goal from the description. Use checkboxes (`- [ ]`) for tracking.
+When defining `## Acceptance Criteria` for a task, focus on **outcomes, behaviors, and verifiable requirements** rather
+than step-by-step implementation details.
+Acceptance Criteria (AC) define *what* conditions must be met for the task to be considered complete.
+They should be testable and confirm that the core purpose of the task is achieved.
+**Key Principles for Good ACs:**
+- **Outcome-Oriented:** Focus on the result, not the method.
+- **Testable/Verifiable:** Each criterion should be something that can be objectively tested or verified.
+- **Clear and Concise:** Unambiguous language.
+- **Complete:** Collectively, ACs should cover the scope of the task.
+- **User-Focused (where applicable):** Frame ACs from the perspective of the end-user or the system's external behavior.
+  - *Good Example:* "- [ ] User can successfully log in with valid credentials."
+  - *Good Example:* "- [ ] System processes 1000 requests per second without errors."
+  - *Bad Example (Implementation Step):* "- [ ] Add a new function `handleLogin()` in `auth.ts`."
+### Task file
+Once a task is created using backlog cli, it will be stored in `backlog/tasks/` directory as a Markdown file with the format
+`task-<id> - <title>.md` (e.g. `task-42 - Add GraphQL resolver.md`).
+## Task Breakdown Strategy
+When breaking down features:
+1. Identify the foundational components first
+2. Create tasks in dependency order (foundations before features)
+3. Ensure each task delivers value independently
+4. Avoid creating tasks that block each other
+### Additional task requirements
+- Tasks must be **atomic** and **testable**. If a task is too large, break it down into smaller subtasks.
+  Each task should represent a single unit of work that can be completed in a single PR.
+- **Never** reference tasks that are to be done in the future or that are not yet created. You can only reference
+  previous tasks (id < current task id).
+- When creating multiple tasks, ensure they are **independent** and they do not depend on future tasks.
+  Example of correct tasks splitting: task 1: "Add system for handling API requests", task 2: "Add user model and DB
+  schema", task 3: "Add API endpoint for user data".
+  Example of wrong tasks splitting: task 1: "Add API endpoint for user data", task 2: "Define the user model and DB
+  schema".
+## Recommended Task Anatomy
+```markdown
+# task‑42 - Add GraphQL resolver
+## Description (the why)
+Short, imperative explanation of the goal of the task and why it is needed.
+## Acceptance Criteria (the what)
+- [ ] Resolver returns correct data for happy path
+- [ ] Error response matches REST
+- [ ] P95 latency ≤ 50 ms under 100 RPS
+## Implementation Plan (the how) (added after putting the task in progress but before implementing any code change)
+1. Research existing GraphQL resolver patterns
+2. Implement basic resolver with error handling
+3. Add performance monitoring
+4. Write unit and integration tests
+5. Benchmark performance under load
+## Implementation Notes (for reviewers) (only added after finishing the code implementation of a task)
+- Approach taken
+- Features implemented or modified
+- Technical decisions and trade-offs
+- Modified or added files
+```
+## Quality Checks
+Before finalizing any task creation, verify:
+- [ ] Title is clear and brief
+- [ ] Description explains WHY without HOW
+- [ ] Each AC is outcome-focused and testable
+- [ ] Task is atomic (single PR scope)
+- [ ] No dependencies on future tasks
+You are meticulous about these standards and will guide users to create high-quality tasks that enhance project productivity and maintainability.
+## Self reflection
+When creating a task, always think from the perspective of an AI Agent that will have to work with this task in the future.
+Ensure that the task is structured in a way that it can be easily understood and processed by AI coding agents.
+## Handy CLI Commands
+| Action                  | Example                                                                                                                                                       |
+|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Create task             | `backlog task create "Add OAuth System"`                                                                                                                      |
+| Create with description | `backlog task create "Feature" -d "Add authentication system"`                                                                                                |
+| Create with assignee    | `backlog task create "Feature" -a @sara`                                                                                                                      |
+| Create with status      | `backlog task create "Feature" -s "In Progress"`                                                                                                              |
+| Create with labels      | `backlog task create "Feature" -l auth,backend`                                                                                                               |
+| Create with priority    | `backlog task create "Feature" --priority high`                                                                                                               |
+| Create with plan        | `backlog task create "Feature" --plan "1. Research\n2. Implement"`                                                                                            |
+| Create with AC          | `backlog task create "Feature" --ac "Must work,Must be tested"`                                                                                               |
+| Create with notes       | `backlog task create "Feature" --notes "Started initial research"`                                                                                            |
+| Create with deps        | `backlog task create "Feature" --dep task-1,task-2`                                                                                                           |
+| Create sub task         | `backlog task create -p 14 "Add Login with Google"`                                                                                                           |
+| Create (all options)    | `backlog task create "Feature" -d "Description" -a @sara -s "To Do" -l auth --priority high --ac "Must work" --notes "Initial setup done" --dep task-1 -p 14` |
+| List tasks              | `backlog task list [-s <status>] [-a <assignee>] [-p <parent>]`                                                                                               |
+| List by parent          | `backlog task list --parent 42` or `backlog task list -p task-42`                                                                                             |
+| View detail             | `backlog task 7` (interactive UI, press 'E' to edit in editor)                                                                                                |
+| View (AI mode)          | `backlog task 7 --plain`                                                                                                                                      |
+| Edit                    | `backlog task edit 7 -a @sara -l auth,backend`                                                                                                                |
+| Add plan                | `backlog task edit 7 --plan "Implementation approach"`                                                                                                        |
+| Add AC                  | `backlog task edit 7 --ac "New criterion,Another one"`                                                                                                        |
+| Add notes               | `backlog task edit 7 --notes "Completed X, working on Y"`                                                                                                     |
+| Add deps                | `backlog task edit 7 --dep task-1 --dep task-2`                                                                                                               |
+| Archive                 | `backlog task archive 7`                                                                                                                                      |
+| Create draft            | `backlog task create "Feature" --draft`                                                                                                                       |
+| Draft flow              | `backlog draft create "Spike GraphQL"` → `backlog draft promote 3.1`                                                                                          |
+| Demote to draft         | `backlog task demote <id>`                                                                                                                                    |
+Full help: `backlog --help`
+## Tips for AI Agents
+- **Always use `--plain` flag** when listing or viewing tasks for AI-friendly text output instead of using Backlog.md
+  interactive UI.

.claude/settings.local.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "permissions": {
+    "allow": [
+      "Bash(backlog task list:*)",
+      "Bash(backlog task:*)",
+      "Bash(cat:*)",
+      "Bash(find:*)",
+      "Bash(timeout:*)",
+      "Bash(curl:*)",
+      "Bash(grep:*)",
+      "Bash(pkill:*)",
+      "Bash(sudo ufw:*)",
+      "Bash(sudo:*)",
+      "Bash(mv:*)",
+      "Bash(git add:*)",
+      "Bash(huggingface-cli:*)",
+      "Bash(git config:*)",
+      "Bash(python:*)",
+      "Bash(git push:*)",
+      "Bash(git lfs track:*)",
+      "Bash(git commit:*)"
+    ],
+    "deny": [],
+    "ask": [],
+    "additionalDirectories": [
+      "C:\\opt",
+      "C:\\data",
+      "/data/huggingface"
+    ]
+  }
+}

.crossnote/config.js ADDED Viewed

	@@ -0,0 +1,15 @@

+({
+  katexConfig: {
+  "macros": {}
+},
+  mathjaxConfig: {
+  "tex": {},
+  "options": {},
+  "loader": {}
+},
+  mermaidConfig: {
+  "startOnLoad": false
+},
+})

.crossnote/head.html ADDED Viewed

	@@ -0,0 +1,6 @@

+<!-- The content below will be included at the end of the <head> element. -->
+<script type="text/javascript">
+  document.addEventListener("DOMContentLoaded", function () {
+    // your code here
+  });
+</script>

.crossnote/parser.js ADDED Viewed

	@@ -0,0 +1,12 @@

+({
+  // Please visit the URL below for more information:
+  // https://shd101wyy.github.io/markdown-preview-enhanced/#/extend-parser
+  onWillParseMarkdown: async function(markdown) {
+    return markdown;
+  },
+  onDidParseMarkdown: async function(html) {
+    return html;
+  },
+})

.crossnote/style.less ADDED Viewed

	@@ -0,0 +1,8 @@

+/* Please visit the URL below for more information: */
+/*   https://shd101wyy.github.io/markdown-preview-enhanced/#/customize-css */
+.markdown-preview.markdown-preview {
+  // modify your style here
+  // eg: background-color: blue;
+}

.cursorrules ADDED Viewed

	@@ -0,0 +1,215 @@

+# === BACKLOG.MD GUIDELINES START ===
+# Instructions for the usage of Backlog.md CLI Tool
+## 1. Source of Truth
+- Tasks live under **`backlog/tasks/`** (drafts under **`backlog/drafts/`**).
+- Every implementation decision starts with reading the corresponding Markdown task file.
+- Project documentation is in **`backlog/docs/`**.
+- Project decisions are in **`backlog/decisions/`**.
+## 2. Defining Tasks
+### Understand the Scope and the purpose
+Ask questions to the user if something is not clear or ambiguous.
+Break down the task into smaller, manageable parts if it is too large or complex.
+### **Title (one liner)**
+Use a clear brief title that summarizes the task.
+### **Description**: (The **"why"**)
+Provide a concise summary of the task purpose and its goal. Do not add implementation details here. It
+should explain the purpose and context of the task. Code snippets should be avoided.
+### **Acceptance Criteria**: (The **"what"**)
+List specific, measurable outcomes that define what means to reach the goal from the description. Use checkboxes (
+`- [ ]`) for tracking.
+When defining `## Acceptance Criteria` for a task, focus on **outcomes, behaviors, and verifiable requirements** rather
+than step-by-step implementation details.
+Acceptance Criteria (AC) define *what* conditions must be met for the task to be considered complete.
+They should be testable and confirm that the core purpose of the task is achieved.
+**Key Principles for Good ACs:**
+- **Outcome-Oriented:** Focus on the result, not the method.
+- **Testable/Verifiable:** Each criterion should be something that can be objectively tested or verified.
+- **Clear and Concise:** Unambiguous language.
+- **Complete:** Collectively, ACs should cover the scope of the task.
+- **User-Focused (where applicable):** Frame ACs from the perspective of the end-user or the system's external behavior.
+    - *Good Example:* "- [ ] User can successfully log in with valid credentials."
+    - *Good Example:* "- [ ] System processes 1000 requests per second without errors."
+    - *Bad Example (Implementation Step):* "- [ ] Add a new function `handleLogin()` in `auth.ts`."
+### Task file
+Once a task is created it will be stored in `backlog/tasks/` directory as a Markdown file with the format
+`task-<id> - <title>.md` (e.g. `task-42 - Add GraphQL resolver.md`).
+### Task Breakdown Strategy
+When breaking down features:
+1. Identify the foundational components first
+2. Create tasks in dependency order (foundations before features)
+3. Ensure each task delivers value independently
+4. Avoid creating tasks that block each other
+### Additional task requirements
+- Tasks must be **atomic** and **testable**. If a task is too large, break it down into smaller subtasks.
+  Each task should represent a single unit of work that can be completed in a single PR.
+- **Never** reference tasks that are to be done in the future or that are not yet created. You can only reference
+  previous
+  tasks (id < current task id).
+- When creating multiple tasks, ensure they are **independent** and they do not depend on future tasks.
+  Example of wrong tasks splitting: task 1: "Add API endpoint for user data", task 2: "Define the user model and DB
+  schema".
+  Example of correct tasks splitting: task 1: "Add system for handling API requests", task 2: "Add user model and DB
+  schema", task 3: "Add API endpoint for user data".
+## 3. Recommended Task Anatomy
+```markdown
+# task‑42 - Add GraphQL resolver
+## Description (the why)
+Short, imperative explanation of the goal of the task and why it is needed.
+## Acceptance Criteria (the what)
+- [ ] Resolver returns correct data for happy path
+- [ ] Error response matches REST
+- [ ] P95 latency ≤ 50 ms under 100 RPS
+## Implementation Plan (the how) (added after putting the task in progress but before implementing any code change)
+1. Research existing GraphQL resolver patterns
+2. Implement basic resolver with error handling
+3. Add performance monitoring
+4. Write unit and integration tests
+5. Benchmark performance under load
+## Implementation Notes (imagine this is the PR description) (only added after finishing the code implementation of a task)
+- Approach taken
+- Features implemented or modified
+- Technical decisions and trade-offs
+- Modified or added files
+```
+## 6. Implementing Tasks
+Mandatory sections for every task:
+- **Implementation Plan**: (The **"how"**) Outline the steps to achieve the task. Because the implementation details may
+  change after the task is created, **the implementation plan must be added only after putting the task in progress**
+  and before starting working on the task.
+- **Implementation Notes**: Start with a brief summary of what has been implemented. Document your approach, decisions, challenges, and any deviations from the plan. This
+  section is added after you are done working on the task. It should summarize what you did and why you did it. Keep it
+  concise but informative. Imagine this is the PR description. Make it brief, explain the core changes and assume that
+  others will read the code to understand the details.
+**IMPORTANT**: Do not implement anything else that deviates from the **Acceptance Criteria**. If you need to
+implement something that is not in the AC, update the AC first and then implement it or create a new task for it.
+## 2. Typical Workflow
+```bash
+# 1 Identify work
+backlog task list -s "To Do" --plain
+# 2 Read details & documentation
+backlog task 42 --plain
+# Read also all documentation files in `backlog/docs/` directory.
+# Read also all decision files in `backlog/decisions/` directory.
+# 3 Start work: assign yourself & move column
+backlog task edit 42 -a @{yourself} -s "In Progress"
+# 4 Add implementation plan before starting
+backlog task edit 42 --plan "1. Analyze current implementation\n2. Identify bottlenecks\n3. Refactor in phases"
+# 5 Break work down if needed by creating subtasks or additional tasks
+backlog task create "Refactor DB layer" -p 42 -a @{yourself} -d "Description" --ac "Tests pass,Performance improved"
+# 6 Complete and mark Done
+backlog task edit 42 -s Done --notes "Implemented GraphQL resolver with error handling and performance monitoring"
+```
+### 7. Final Steps Before Marking a Task as Done
+Always ensure you have:
+1. ✅ Marked all acceptance criteria as completed (change `- [ ]` to `- [x]`)
+2. ✅ Added an `## Implementation Notes` section documenting your approach
+3. ✅ Run all tests and linting checks
+4. ✅ Updated relevant documentation
+## 8. Definition of Done (DoD)
+A task is **Done** only when **ALL** of the following are complete:
+1. **Acceptance criteria** checklist in the task file is fully checked (all `- [ ]` changed to `- [x]`).
+2. **Implementation plan** was followed or deviations were documented in Implementation Notes.
+3. **Automated tests** (unit + integration) cover new logic.
+4. **Static analysis**: linter & formatter succeed.
+5. **Documentation**:
+    - All relevant docs updated (any relevant README file, backlog/docs, backlog/decisions, etc.).
+    - Task file **MUST** have an `## Implementation Notes` section added summarising:
+        - Approach taken
+        - Features implemented or modified
+        - Technical decisions and trade-offs
+        - Modified or added files
+6. **Review**: self review code.
+7. **Task hygiene**: status set to **Done** via CLI (`backlog task edit <id> -s Done`).
+8. **No regressions**: performance, security and licence checks green.
+⚠️ **IMPORTANT**: Never mark a task as Done without completing ALL items above.
+## 9. Handy CLI Commands
+| Action                  | Example                                                                                                                                                       |
+|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Create task             | `backlog task create "Add OAuth System"`                                                                                                                      |
+| Create with description | `backlog task create "Feature" -d "Add authentication system"`                                                                                                |
+| Create with assignee    | `backlog task create "Feature" -a @sara`                                                                                                                      |
+| Create with status      | `backlog task create "Feature" -s "In Progress"`                                                                                                              |
+| Create with labels      | `backlog task create "Feature" -l auth,backend`                                                                                                               |
+| Create with priority    | `backlog task create "Feature" --priority high`                                                                                                               |
+| Create with plan        | `backlog task create "Feature" --plan "1. Research\n2. Implement"`                                                                                            |
+| Create with AC          | `backlog task create "Feature" --ac "Must work,Must be tested"`                                                                                               |
+| Create with notes       | `backlog task create "Feature" --notes "Started initial research"`                                                                                            |
+| Create with deps        | `backlog task create "Feature" --dep task-1,task-2`                                                                                                           |
+| Create sub task         | `backlog task create -p 14 "Add Login with Google"`                                                                                                           |
+| Create (all options)    | `backlog task create "Feature" -d "Description" -a @sara -s "To Do" -l auth --priority high --ac "Must work" --notes "Initial setup done" --dep task-1 -p 14` |
+| List tasks              | `backlog task list [-s <status>] [-a <assignee>] [-p <parent>]`                                                                                               |
+| List by parent          | `backlog task list --parent 42` or `backlog task list -p task-42`                                                                                             |
+| View detail             | `backlog task 7` (interactive UI, press 'E' to edit in editor)                                                                                                |
+| View (AI mode)          | `backlog task 7 --plain`                                                                                                                                      |
+| Edit                    | `backlog task edit 7 -a @sara -l auth,backend`                                                                                                                |
+| Add plan                | `backlog task edit 7 --plan "Implementation approach"`                                                                                                        |
+| Add AC                  | `backlog task edit 7 --ac "New criterion,Another one"`                                                                                                        |
+| Add notes               | `backlog task edit 7 --notes "Completed X, working on Y"`                                                                                                     |
+| Add deps                | `backlog task edit 7 --dep task-1 --dep task-2`                                                                                                               |
+| Archive                 | `backlog task archive 7`                                                                                                                                      |
+| Create draft            | `backlog task create "Feature" --draft`                                                                                                                       |
+| Draft flow              | `backlog draft create "Spike GraphQL"` → `backlog draft promote 3.1`                                                                                          |
+| Demote to draft         | `backlog task demote <id>`                                                                                                                                    |
+Full help: `backlog --help`
+## 10. Tips for AI Agents
+- **Always use `--plain` flag** when listing or viewing tasks for AI-friendly text output instead of using Backlog.md
+  interactive UI.
+- When users mention to create a task, they mean to create a task using Backlog.md CLI tool.
+# === BACKLOG.MD GUIDELINES END ===

.gitattributes CHANGED Viewed

@@ -33,3 +33,18 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+examples/wav/helpful_base_0.wav filter=lfs diff=lfs merge=lfs -text
+examples/wav/helpful_base_1.wav filter=lfs diff=lfs merge=lfs -text
+examples/wav/helpful_base_2.wav filter=lfs diff=lfs merge=lfs -text
+examples/wav/helpful_base_3.wav filter=lfs diff=lfs merge=lfs -text
+examples/wav/helpful_base_4.wav filter=lfs diff=lfs merge=lfs -text
+examples/wav/helpful_base_5.wav filter=lfs diff=lfs merge=lfs -text
+examples/wav/helpful_base_6.wav filter=lfs diff=lfs merge=lfs -text
+examples/wav/helpful_base_7.wav filter=lfs diff=lfs merge=lfs -text
+examples/wav/helpful_base_8.wav filter=lfs diff=lfs merge=lfs -text
+examples/wav/helpful_base_9.wav filter=lfs diff=lfs merge=lfs -text
+images/llama-omni2.png filter=lfs diff=lfs merge=lfs -text
+llama_omni2/inference/prompt_en.wav filter=lfs diff=lfs merge=lfs -text
+llama_omni2/inference/prompt_zh.wav filter=lfs diff=lfs merge=lfs -text
+models/Llama-3.1-8B-Omni/images/model.png filter=lfs diff=lfs merge=lfs -text
+tmp/e5fd5a073117d600c1ed49bd412158449e0e001ade31bc971dc1dcb45631c170/Tuesday[[:space:]]at[[:space:]]20-06.wav filter=lfs diff=lfs merge=lfs -text

2505.02625v1.txt ADDED Viewed

	@@ -0,0 +1,1065 @@

+LLaMA-Omni 2: LLM-based Real-time Spoken Chatbot with
+Autoregressive Streaming Speech Synthesis
+Qingkai Fang1,3 , Yan Zhou1,3 , Shoutao Guo1,3 , Shaolei Zhang1,3 , Yang Feng1,2,3 *
+1
+Key Laboratory of Intelligent Information Processing
+Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)
+2
+Key Laboratory of AI Safety, Chinese Academy of Sciences
+3
+University of Chinese Academy of Sciences, Beijing, China
+{fangqingkai21b,fengyang}@ict.ac.cn
+arXiv:2505.02625v1 [cs.CL] 5 May 2025
+Abstract
+recognition (ASR) model, an LLM, and a text-tospeech (TTS) model. While this method is relatively straightforward to implement, it suffers from
+several notable limitations. First, errors can accumulate across the different stages of the pipeline.
+Second, the overall response latency tends to be
+high due to the sequential processing of multiple models. Third, the system struggles to capture paralinguistic information present in the input
+speech. To address these limitations, end-to-end
+speech language models (SpeechLMs) have gradually gained more attention, using a single unified
+model to handle the entire process from speech input to output. Overall, end-to-end SpeechLMs can
+be categorized into two types: native and modular. Native SpeechLMs typically discretize speech
+into tokens and employ a GPT-style decoder-only
+Transformer (Radford, 2018) to model both speech
+and text within a unified language model (Zhang
+et al., 2023; Rubenstein et al., 2023; Hassid et al.,
+2024a). A key advantage of this architecture is
+its ability to leverage vast amounts of unsupervised speech data for pretraining, making it easier to scale up in terms of model parameters and
+data size. This can potentially result in emergent capabilities, such as more human-like speech
+expressiveness (Zeng et al., 2024a; Open-Moss,
+2025). However, native SpeechLMs typically require large-scale speech datasets (e.g., millions of
+hours) for pretraining (Zeng et al., 2024b; Défossez et al., 2024), which presents challenges in data
+collection and training costs, and may also lead to
+catastrophic forgetting of the model’s text capabilities. In contrast, modular SpeechLMs incorporate
+a speech encoder and a speech decoder around the
+LLM to handle speech understanding and generation (Fang et al., 2025; Wang et al., 2024). The
+advantage of this approach is its ability to leverage
+the inherent capabilities of each module, requiring
+only small-scale fine-tuning (e.g., a few hundred
+or thousand hours of speech data) to align the mod-
+Real-time, intelligent, and natural speech interaction is an essential part of the next-generation
+human-computer interaction. Recent advancements have showcased the potential of building
+intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B
+to 14B parameters, capable of achieving highquality real-time speech interaction. LLaMAOmni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates
+strong performance on several spoken question
+answering and speech instruction following
+benchmarks, surpassing previous state-of-theart SpeechLMs like GLM-4-Voice, which was
+trained on millions of hours of speech data.1
+1
+Introduction
+Speech, as a critical interface for human-computer
+interaction, can significantly enhance both interaction efficiency and user experience (Clark et al.,
+2019). In recent years, as large language models (LLMs) like ChatGPT (OpenAI, 2022) have
+demonstrated outstanding performance across various fields, speech interactions with LLMs have
+attracted widespread attention from both academia
+and industry. For instance, GPT-4o (OpenAI, 2024)
+enables real-time, intelligent, and natural speech
+interaction between users and LLMs, heralding the
+advent of a new generation of human-computer
+interaction paradigms.
+To develop a spoken chatbot similar to GPT-4o,
+the traditional approach typically employs a cascaded pipeline comprising an automatic speech
+* Corresponding author: Yang Feng.
+1
+Code: https://github.com/ictnlp/LLaMA-Omni2
+Audio Samples: https://llama-omni2.github.io/
+1
+ules. This enables the model to acquire speech interaction capabilities at a relatively low cost, while
+retaining most of its original capability. Moreover,
+modular SpeechLMs can typically generate speech
+guided by textual output, ensuring the intelligence
+of the generated speech.
+In addition to the intelligence of speech, realtime responsiveness and naturalness are also crucial characteristics of spoken chatbots. LLaMAOmni (Fang et al., 2025) uses a non-autoregressive
+(NAR) streaming speech decoder to enable synchronized generation of speech and text, ensuring
+extremely low response latency. However, due
+to the limitations of non-autoregressive models in
+modeling capacity, the generated speech is often
+less natural and fluent. Freeze-Omni (Wang et al.,
+2024) combines both NAR and autoregressive (AR)
+models for speech generation, resulting in higher
+naturalness of the generated speech. However, it
+can only achieve sentence-level streaming speech
+generation through a simple sentence-split strategy,
+which prevents it from achieving very low response
+latency. To address these challenges, in this paper,
+we introduce LLaMA-Omni 2, a series of modular
+SpeechLMs ranging from 0.5B to 14B. LLaMAOmni 2 adopts Qwen2.5-0.5B/1.5B/3B/7B/14BInstruct models (Team, 2024) as the base LLM,
+and uses Whisper’s encoder (Radford et al., 2023)
+as the speech encoder. For the speech decoder,
+inspired by the state-of-the-art streaming speech
+synthesis model CosyVoice 2 (Du et al., 2024), it
+first includes an autoregressive text-to-speech language model initialized with Qwen2.5-0.5B, which
+generates speech tokens from the LLM output and
+achieves streaming generation through alternating
+read and write operations. The speech tokens are
+then passed through a chunk-aware causal flow
+matching model (Lipman et al., 2023) to generate the mel spectrogram in a streaming manner.
+To train the model, we synthesize 200K multiturn speech-to-speech dialogue samples with diverse input voices and a uniform output voice.
+Experimental results show that LLaMA-Omni 2
+achieves outstanding performance on spoken question answering and speech instruction following
+tasks in both speech-to-text and speech-to-speech
+settings, outperforming both LLaMA-Omni and
+the native SpeechLM GLM-4-Voice (Zeng et al.,
+2024a), which was trained on millions of hours
+of speech data. We also conducted detailed ablation studies on factors such as LLM parameter size,
+training data scale, speech decoder pretraining, and
+read-write strategy, to better understand the impact
+of these factors on the overall system performance.
+2
+Model: LLaMA-Omni 2
+In this section, we introduce the model architecture
+of LLaMA-Omni 2. As shown in Figure 1, the
+core of LLaMA-Omni 2 is an LLM, for which we
+use the Qwen2.5 series models (Team, 2024) due
+to their strong performance across various benchmarks. Next, we will describe how we equip the
+LLM with speech understanding and streaming
+speech generation capabilities. In the following,
+we use MLLM to denote the LLM. For a single-turn
+instruction-response pair, we denote the speech instruction as X, and the text and speech responses
+as Y T and Y S , respectively.
+2.1
+Speech Understanding
+To enable speech understanding, we incorporate
+a speech encoder and a speech adapter before the
+LLM, similar to LLaMA-Omni (Fang et al., 2025).
+Specifically, we use the encoder of Whisper-largev3 (Radford et al., 2023) as the speech encoder,
+which converts the input speech into a sequence of
+representations. The encoded representations are
+then passed into the speech adapter, which consists
+of a downsampling module and a feed-forward network (FFN). The downsampling module concatenates every k consecutive frames along the feature
+dimension, and the concatenated representations
+are further encoded by the FFN. The final output
+representation is then input into the LLM.
+2.2
+Streaming Speech Generation
+To equip the model with streaming speech generation capabilities, we adopt a paradigm similar to
+CosyVoice 2 (Du et al., 2024). First, the speech
+response is converted into discrete tokens using a
+supervised semantic speech tokenizer. Then, an
+autoregressive text-to-speech language model is
+employed to model the streaming generation from
+the LLM output to speech tokens. Finally, a causal
+flow matching model converts speech tokens into
+the mel spectrogram in a streaming manner.
+Speech Tokenizer The speech tokenizer is implemented by inserting a finite scalar quantization
+(FSQ) module (Mentzer et al., 2024) into the encoder of SenseVoice-Large ASR model (An et al.,
+2024). This module first projects the intermediate
+representations to a low-rank space and discretizes
+them through a rounding operation. Ultimately,
+2
+…
+latency
+Gate Fusion Module
+Flow Matching & Vocoder
+Large Language Model
+Speech Adaptor
+Speech Encoder
++
+<latexit sha1_base64="7CDz+hFii/hnzm/SPcG6JVj1JjA=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXJHoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse5VypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3UXjLo=</latexit>
+<latexit sha1_base64="IksX52OSp+tBzewG6ZihRejM6FQ=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi1WPRi8cKpi20oWw2m3bpZjfsToRS+hu8eFDEqz/Im//GbZuDVh8MPN6bYWZelAlu0PO+nNLa+sbmVnm7srO7t39QPTxqG5VrygKqhNLdiBgmuGQBchSsm2lG0kiwTjS+nfudR6YNV/IBJxkLUzKUPOGUoJWCvooVDqo1r+4t4P4lfkFqUKA1qH72Y0XzlEmkghjT870MwynRyKlgs0o/NywjdEyGrGepJCkz4XRx7Mw9s0rsJkrbkugu1J8TU5IaM0kj25kSHJlVby7+5/VyTK7DKZdZjkzS5aIkFy4qd/65G3PNKIqJJYRqbm916YhoQtHmU7Eh+Ksv/yXti7rfqDfuL2vNmyKOMpzAKZyDD1fQhDtoQQAUODzBC7w60nl23pz3ZWvJKWaO4Recj2/u8I7J</latexit>
+Text-to-Speech Language Model
+<latexit sha1_base64="IksX52OSp+tBzewG6ZihRejM6FQ=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi1WPRi8cKpi20oWw2m3bpZjfsToRS+hu8eFDEqz/Im//GbZuDVh8MPN6bYWZelAlu0PO+nNLa+sbmVnm7srO7t39QPTxqG5VrygKqhNLdiBgmuGQBchSsm2lG0kiwTjS+nfudR6YNV/IBJxkLUzKUPOGUoJWCvooVDqo1r+4t4P4lfkFqUKA1qH72Y0XzlEmkghjT870MwynRyKlgs0o/NywjdEyGrGepJCkz4XRx7Mw9s0rsJkrbkugu1J8TU5IaM0kj25kSHJlVby7+5/VyTK7DKZdZjkzS5aIkFy4qd/65G3PNKIqJJYRqbm916YhoQtHmU7Eh+Ksv/yXti7rfqDfuL2vNmyKOMpzAKZyDD1fQhDtoQQAUODzBC7w60nl23pz3ZWvJKWaO4Recj2/u8I7J</latexit>
+1
+<latexit sha1_base64="yWApgEffzdEH57mYnQQzN7vgc1w=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4sSQi1WPRi8cq9gPaUDbbSbt0swm7G6GE/gMvHhTx6j/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nZXVtfWNzcJWcXtnd2+/dHDY1HGqGDZYLGLVDqhGwSU2DDcC24lCGgUCW8Hoduq3nlBpHstHM07Qj+hA8pAzaqz04J33SmW34s5AlomXkzLkqPdKX91+zNIIpWGCat3x3MT4GVWGM4GTYjfVmFA2ogPsWCpphNrPZpdOyKlV+iSMlS1pyEz9PZHRSOtxFNjOiJqhXvSm4n9eJzXhtZ9xmaQGJZsvClNBTEymb5M+V8iMGFtCmeL2VsKGVFFmbDhFG4K3+PIyaV5UvGqlen9Zrt3kcRTgGE7gDDy4ghrcQR0awCCEZ3iFN2fkvDjvzse8dcXJZ47gD5zPH+d4jPc=</latexit>
+<latexit sha1_base64="tdy5cBUx22e49sInllEMc7AaEZY=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKRI9BLx4jmAckS5idzCZj5rHMzAphyT948aCIV//Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWye1eoYNBe6XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2v3aKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPX0YBpSiyfOIKJZu5WREZYY2JdQCUXQrD88ippXVSDWrV2f1mp3+RxFOEETuEcAriCOtxBA5pA4BGe4RXePOW9eO/ex6K14OUzx/AH3ucPn/GPLg==</latexit>
+Stage I(a)
+FFN
+Emb
+Certainly!
+TTS Language Model
+Gate Fusion
+Certainly!
+Writing
+high …
+a
+Stage I(b)
+TTS Language Model
+Large Language Model
+Gate Fusion
+Speech Representations
+Speech Adaptor
+LLM Hidden States
+Speech Encoder
+Fused Representations
+Speech Tokens
+Ignore Tokens
+(Hey! Can you give me some
+advices on writing NLP papers?)
+Large Language Model
+Speech Adaptor
+Speech Encoder
+Stage II
+Figure 1: Left: Model architecture of LLaMA-Omni 2. Right: Illustration of the two-stage training strategy.
+the speech response Y S is converted into a token
+U ], with 25 tokens per
+sequence Y U = [y1U , . . . , yM
+second, where each token yiU ∈ {K ∈ N | 0 ≤
+K < 6561}. We use the pretraiend speech tokenizer in CosyVoice 2.
+MTTS , while also obtaining the text embeddings:
+ehidden
+= FFN(hi ),
+i
+(1)
+eemb
+= Emb(yiT ),
+i
+(2)
+where Emb(·) is the embedding layer of MTTS . Afterward, we use an element-wise gate fusion mechanism to combine both representations. Specifically,
+we compute the gate gi as follows:
+Text-to-Speech Language Model After converting the speech response into discrete tokens, we
+use a decoder-only Transformer (Vaswani, 2017)
+to model the conditional language model from
+the LLM output to the speech tokens, denoted as
+MTTS . It is initialized with Qwen2.5-0.5B, and
+its vocabulary is extended as V′ = V ∪ {< i >|
+i ∈ N, 0 ≤ i < 6561}, where V is the original
+vocabulary. This extension enables the model to
+generate speech tokens.
+gi = σ Wg ehidden
+∥ eemb
++ bg ,
+i
+i
+(3)
+where ∥ denotes concatenation, σ is the sigmoid
+function, and Wg ∈ R2d×d and bg ∈ Rd are the
+weight and bias parameters of the gate, and d is
+the embedding size of MTTS . Finally, the fused
+representation is computed as:
+The input to MTTS comes from the output of the
+LLM. Specifically, the LLM output consists of two
+parts: continuous hidden states and text tokens sampled from the hidden states. The former contains
+contextual information, while the latter provides
+precise textual content. We aim to use both as inputs to the text-to-speech language model. This allows the model to both consider the current context
+and ensure better alignment with the text response
+when generating speech tokens. During training,
+the LLM is trained with teacher forcing, so its output hidden states are denoted as H = [h1 , ..., hN ],
+T ). The corresponding
+where hi = MLLM (X, Y<i
+T
+T ]. We first
+text is the ground truth Y = [y1T , ..., yN
+use a 2-layer feed-forward network (FFN) to map
+the hidden states to the embedding dimension of
+ci = gi ⊙ ehidden
++ (1 − gi ) ⊙ eemb
+i
+i ,
+(4)
+where ⊙ denotes element-wise multiplication. This
+fused representations C = [c1 , ..., cN ] are then
+passed to MTTS for generating speech tokens.
+To achieve streaming generation, i.e., to generate
+speech tokens simultaneously during the LLM’s
+output process, we adopt a “Read-R-Write-W”
+strategy, similar to CosyVoice 2. Specifically, we
+mix the fused representation C and the speech tokens Y U at a predefined ratio R : W. For every R
+fused representations read in, the model generates
+W speech tokens. Once all fused representations
+are read, the model continues to generate the remaining speech tokens until completion. During
+3
+training, cross-entropy loss is computed only for
+the generated speech tokens as follows:
+2.4
+2.3
+3
+Inference
+During inference, the LLM autoregressively generates the text response based on the speech instrucM
+X
+tion. After generating R text tokens, its hidden
+U
+LTTS = −
+log P (yiU |C≤min(⌊ i−1 +1⌋·R,N ) , Y<i
+), states and the corresponding decoded text are fed
+W
+i=1
+(5) into the gate fusion module and MTTS to generate
+where C≤min(⌊ i−1 +1⌋·R,N ) denotes the fused rep- W speech tokens, which are then passed through
+W
+the flow matching model and the vocoder to syntheresentations that have already been read.
+size a speech chunk. In this way, text and speech
+responses can be generated simultaneously. The
+Flow Matching Model The speech tokens gen- response latency for the first synthesized speech
+erated by MTTS are further processed by a chunk- chunk can be calculated as:
+aware causal flow matching model (Lipman et al.,
+2023) to synthesize the mel spectrogram in a
+Ttotal = TLLM (R)+TTTS (W)+TFM (W)+TVoc (2W),
+streaming manner. Every time W speech tokens
+(6)
+are generated, they are treated as a chunk for
+where TLLM (R) and TTTS (W) represent the time
+mel spectrogram synthesis. The synthesized mel
+required by the MLLM and MTTS models to genspectrogram is then passed through a HiFi-GAN
+erate R and W tokens, respectively. TFM (W) and
+vocoder (Kong et al., 2020) to generate the final
+TVoc (2W) represent the decoding times of the flow
+waveform. We use the pretrained flow matching
+matching model and vocoder when the inputs are
+model and vocoder in CosyVoice 2.
+W and 2W tokens2 , respectively.
+Training
+Data Construction
+In this section, we introduce the process of constructing multi-turn speech-to-speech dialogue data.
+Our data is an extension of the InstructS2S-200K
+dataset introduced in Fang et al. (2025), which
+contains 200K single-turn instruction-following
+samples designed for speech interaction scenarios.
+These samples are derived from the Alpaca (Taori
+et al., 2023) and UltraChat (Ding et al., 2023)
+datasets through rewriting using LLMs. Specifically, for each sample, we first sample the number of turns from a Poisson distribution: N ∼
+Poisson(λ = 2), then clip N to the range of 1 to 5.
+Next, we use the Llama-3.3-70B-Instruct3 (Dubey
+et al., 2024) model to iteratively generate the dialog.
+For the i-th turn, the instruction and response are
+generated based on the dialogue history of previous
+i − 1 turns. In this way, we obtain 200K multi-turn
+text dialog samples.
+Next, we need to convert the text dialogue into
+speech. To simulate real-world applications, we
+aim to have varied voices for the instruction, while
+maintaining a consistent voice for the response.
+For each multi-turn dialogue, we first use the fishspeech-1.54 model (Liao et al., 2024) to synthesize
+The training of LLaMA-Omni 2 relies solely on
+200K multi-turn speech-to-speech dialogue data
+(we will describe how this is synthesized in Section 3) and does not use any ASR or TTS data. We
+find that it is sufficient to achieve excellent performance while minimizing training costs. Specifically, the training process consists of two stages, as
+shown in Figure 1.
+Stage I In Stage I training, we train the speechto-text and text-to-speech components separately.
+The training data consists of <speech instruction,
+text response> pairs and <text response, speech response> pairs from the multi-turn speech-to-speech
+dialogue data. Specifically, for the speech-to-text
+part (Stage I(a)), we freeze the speech encoder
+and train the speech adapter and LLM with crossentropy loss. For the text-to-speech part (Stage
+I(b)), we train the text-to-speech language model
+with cross-entropy loss. Note that during this stage,
+the gate fusion module is not trained, and only text
+embeddings are input into MTTS .
+Stage II In Stage II, we train the model’s speechto-speech generation capability with speech-tospeech dialogue data. During this stage, we freeze
+the speech encoder, speech adapter, and LLM, and
+only train the gate fusion module and MTTS .
+2
+The length of the mel spectrogram is twice that of the
+speech tokens (50 Hz vs. 25 Hz).
+3
+https://huggingface.co/meta-llama/Llama-3.
+3-70B-Instruct
+4
+https://huggingface.co/fishaudio/
+4
+4.2
+a short prompt (e.g., "This is a randomly generated
+voice") with a random voice. Then, we use the synthesized speech as the prompt for the CosyVoice20.5B5 model, which synthesize the instruction into
+speech while simultaneously cloning the voice.
+This ensures consistency in the voice across different turns of the dialogue, while maintaining diversity across dialogues. For all responses, we use
+a uniform voice as the prompt and then synthesize
+the speech using the CosyVoice2-0.5B model.
+4
+Experiments
+4.1
+Experimental Setups
+Evaluation
+Our evaluation includes two tasks: spoken question answering and speech instruction following.
+For both tasks, we evaluate the model’s speech-totext and speech-to-speech capabilities. The speechto-speech evaluation is done by transcribing the
+speech response into text using the Whisper-largev3 model, and then applying the same evaluation
+method as used for speech-to-text evaluation. In
+all experiments, we use greedy search for the LLM
+to ensure stable results. For the text-to-speech language model, we use sampling with temperature set
+to 1.0, as we find that using greedy search causes
+the model to fall into repetition.
+Model Configuration We use the encoder of
+Whisper-large-v3 as the speech encoder. The
+speech adapter first performs a 5× downsampling, followed by a FFN with an intermediate
+dimension of 2048. For the LLM, we select
+the Qwen2.5 series models, including Qwen2.50.5B/1.5B/3B/7B/14B-Instruct models. We refer
+to the corresponding models as LLaMA-Omni20.5B/1.5B/3B/7B/14B in the following sections.
+For the text-to-speech language model, we initialize it with the Qwen2.5-0.5B model and set the
+read-write strategy with R = 3 and W = 10.
+We will discuss the impact of these hyperparameters on speech quality and response latency later.
+The speech tokenizer, flow matching model, and
+vocoder are directly taken from CosyVoice 2.
+Spoken Question Answering The speech question answering (SpokenQA) task involves asking
+the model spoken questions, then checking whether
+the reference answer appears in the model’s response, and calculating the accuracy. We evaluate our model on two benchmarks: Llama Questions6 (Nachmani et al., 2024) and Web Questions7 (Berant et al., 2013). Since the questions in
+the Web Questions dataset are in text form, we use
+CosyVoice2-0.5B to synthesize them into speech.
+Speech Instruction Following For the speech
+instruction following task, we follow the settings
+in Fang et al. (2025), selecting the helpful_base
+and vicuna subsets from the Alpaca-Eval8 (Li et al.,
+2023) dataset, excluding math and code-related instructions. The remaining 199 instructions are then
+synthesized into speech for evaluation. Following Fang et al. (2025), we evaluate the model using
+the following metrics:
+ChatGPT Score: To evaluate the model’s ability to follow instructions, we use GPT-4o (OpenAI,
+2024) to score the model’s responses. It considers
+factors such as helpfulness, relevance, fluency, and
+suitability for speech interaction scenarios, and assigns a single score between 1 and 5. The detailed
+prompt can be found in Appendix A.
+ASR-WER: To assess the consistency between
+model’s text and speech responses, we use Whisperlarge-v3 to transcribe the speech response into text,
+and calculate the word error rate (WER) between
+the transcribed text and text response. We perform
+Training Details We use the 200K multi-turn
+speech-to-speech dialogue data from Section 3 for
+two-stage training. In Stage I(a), we freeze the
+speech encoder and train all parameters of the
+speech adaptor and LLM. The batch size is 32,
+and we train for 3 epochs with a peak learning
+rate of 5e-5. In Stage I(b), we train the text-tospeech language model with a batch size of 32 for
+5 epochs and a peak learning rate of 5e-4. In Stage
+II, we freeze the speech encoder, speech adaptor,
+and LLM, and train the remaining components with
+a batch size of 32 for 1 epoch and a peak learning
+rate of 1e-3. For all stages, we use a warmup strategy for the first 3% of steps and a cosine annealing
+learning rate scheduler. The LLaMA-Omni2-14B
+model is trained on 4 NVIDIA H800 GPUs, while
+other models are trained on 4 NVIDIA L40 GPUs.
+Table 1: Results on speech question answering and speech instruction following benchmarks. S2T and S2S represent
+speech-to-text and speech-to-speech, respectively. We set R = 3 and W = 10 for all LLaMA-Omni2 series models.
+text normalization9 before calculating the WER.
+UTMOS: To evaluate the naturalness of the generated speech, we use the UTMOS model10 (Saeki
+et al., 2022) to predict the mean opinion score
+(MOS) of the generated speech.
+Latency: We measure the time from receiving
+the speech instruction to generating the first speech
+chunk on a single NVIDIA L40 GPU.
+4.3
+5
+Results and Analysis
+5.1
+Main Results
+Table 1 presents the main results on the speech
+question answering and speech instruction following benchmarks.
+Spoken Question Answering For the SpokenQA
+task, we observe that: (1) For models with similar
+parameter sizes, LLaMA-Omni2-7B outperforms
+both GLM-4-Voice and LLaMA-Omni in both S2T
+and S2S settings. Notably, our model significantly
+reduces the gap between S2T and S2S performance. For example, on the Web Questions benchmark, GLM-4-Voice drops by 16.3 (32.2→15.9),
+LLaMA-Omni drops by 9.7 (33.4→23.7), while
+LLaMA-Omni2-7B only drops by 3.2 (34.5→31.3),
+demonstrating that our approach largely improves
+speech generation capabilities. (2) For models with
+varying parameter sizes, we observe that accuracy
+increases as the LLM size grows, indicating that
+LLaMA-Omni 2 effectively leverages the LLM’s
+inherent capabilities. For smaller models, LLaMAOmni2-1.5B/3B exceeds the accuracy of GLM-4Voice and LLaMA-Omni in the S2S setting, making them suitable choices for edge devices. For
+larger models, we observe a significant accuracy
+improvement with LLaMA-Omni2-14B compared
+to LLaMA-Omni2-7B, highlighting the potential
+of our approach for scaling to larger models.
+Baseline Systems
+We primarily compare LLaMA-Omni 2 with the
+following baseline systems:
+LLaMA-Omni (Fang et al., 2025): One of the
+earliest SpeechLMs that achieves real-time speech
+interaction, by using a CTC-based (Graves et al.,
+2006) streaming speech decoder to simultaneously
+generate text and speech units. The generated units
+are fed into the vocoder for streaming synthesis in
+fixed-size chunks. We set the chunk size Ω = 40.
+GLM-4-Voice (Zeng et al., 2024a): The current state-of-the-art native SpeechLM, pretrained
+on millions of hours of speech data. It enables realtime speech interaction by alternately generating
+text and speech tokens in a fixed ratio of 13:26.
+The generated speech tokens are input into a flow
+matching model with a fixed chunk size.
+In addition, we also borrow some results
+from Zeng et al. (2024a), including results of
+TWIST (Hassid et al., 2024b), SpeechGPT (Zhang
+et al., 2023), Spectron (Nachmani et al., 2024), and
+Moshi (Défossez et al., 2024).
+Speech Instruction Following For the speech
+instruction following task, we observe that:
+(1) LLaMA-Omni2-3B/7B/14B outperforms both
+GLM-4-Voice and LLaMA-Omni in the S2T and
+S2S settings, demonstrating the strong instructionfollowing capabilities of our models. (2) Similar
+9
+https://github.com/openai/whisper/blob/main/
+whisper/normalizers/english.py
+10
+https://github.com/tarepan/SpeechMOS
+798.99
+-
+Table 4: Ablation study on the read/write strategy with
+LLaMA-Omni2-7B. “Offline” means generating speech
+tokens only after receiving the complete input, and then
+synthesizing all speech tokens into waveform at once.
+fusing them with the gate fusion module.
+Table 3: Ablation study on different TTS pretraining
+strategies with LLaMA-Omni2-7B.
+TTS Pretraining Our text-to-speech language
+model is initialized with the Qwen2.5-0.5B model
+and undergoes streaming TTS pretraining using
+text-speech pairs from speech dialogue data in
+Stage I(b) (R = 3, W = 10). We also explore
+several other strategies, as shown in Table 3. “Offline TTS” refers to pretraining with the offline TTS
+task on top of Qwen2.5-0.5B, which shows a slight
+performance drop compared to the streaming TTS
+pretraining. “Text Pretrained” refers to directly initializing with Qwen2.5-0.5B (with the extended
+vocabulary including speech tokens), and we observe a significant performance decline. “Scratch”
+refers to a randomly initialized model, whose loss
+fails to converge within a short period. These experiments demonstrate the importance of pretraining
+for the TTS language model.
+to the results on SpokenQA benchmarks, we observe that model performance improves as the LLM
+size increases, with LLaMA-Omni2-14B achieving
+significantly better performance. (3) The models’
+ASR-WER is generally low, significantly lower
+than previous models, proving that our models
+maintain strong consistency between the text and
+speech responses. (4) Regarding speech quality,
+thanks to the CosyVoice 2’s strong causal flow
+matching model, our models achieve good UTMOS
+scores under streaming synthesis, significantly outperforming the baseline models. (5) The latency
+of LLaMA-Omni 2 is around 600ms. Although it
+is slightly higher than LLaMA-Omni, it still meets
+the requirements for real-time interaction and is
+significantly lower than that of GLM-4-Voice.
+5.2
+W
+1
+5
+2 10
+3 10
+3 15
+4 15
+5 20
+Offline
+Read/Write Strategy The read/write strategies
+of the TTS language model is a key factor influencing performance, primarily affecting the speech
+quality and system response latency. As shown
+in Table 4, we explore different combinations of
+R and W. First, we observe that when R = 3
+and W = 10, the ASR-WER is the lowest, indicating the best alignment between speech and text
+responses. As for the UTMOS score, we find that
+it is primarily determined by W, as W represents
+the chunk size of speech tokens input to the flow
+matching model, with larger chunk sizes leading to
+better speech quality. Regarding response latency,
+it is jointly determined by R and W, as shown
+in Equation 6. Without any engineering optimizations, LLaMA-Omni2-7B can achieve a latency
+below 500ms. We choose R = 3 and W = 10 in
+our main experiments because it provides a good
+trade-off across all aspects.
+Ablation Studies
+To understand the impact of different factors on
+overall performance, we conduct a series of ablation studies on the LLaMA-Omni2-7B model.
+Gate Fusion Module Table 2 shows the ablation
+study on the gate fusion module. Gate fusion module allows the model to adaptively fuse LLM hidden states and text embeddings, considering both
+contextual information and textual content. When
+the gate fusion module is removed and the two components are simply added together (ehidden
++ eemb
+i
+i )
+as input to the text-to-speech language model, we
+observe a decrease in performance. Further removing the text embedding and only inputting the
+hidden states (ehidden
+) results in a further perfori
+mance decline. This validates the effectiveness of
+adding text embeddings as input and adaptively
+7
+Table 5: Results under different training data sizes with LLaMA-Omni2-7B.
+5.3
+Effects of the Training Data Sizes
+IntrinsicVoice (Zhang et al., 2024c) proposes a
+GroupFormer architecture to shorten speech length
+to be closer to that of text. In contrast to native SpeechLMs, modular SpeechLMs add speechrelated modules on top of LLMs. Early works
+achieve speech understanding tasks by combining
+speech encoders with LLMs, but are unable to perform speech generation (Wu et al., 2023; Wang
+et al., 2023; Chu et al., 2023; Yu et al., 2024; Ma
+et al., 2024b; Hono et al., 2024; Chen et al., 2024b;
+Tang et al., 2024; Chu et al., 2024; Fathullah et al.,
+2024). To achieve speech generation, LLaMAOmni (Fang et al., 2025), Freeze-Omni (Wang
+et al., 2024), and OpenOmni (Luo et al., 2025) add
+a speech decoder after LLMs. Mini-Omni (Xie
+and Wu, 2024) and SLAM-Omni (Chen et al.,
+2024a) enable LLMs to generate speech tokens
+simultaneously while generating text tokens. The
+most related work to ours is the concurrent work
+Minmo (Chen et al., 2025), which also adopts an
+autoregressive streaming speech decoder similar
+to CosyVoice 2. In comparison, Minmo is trained
+on 1.4M hours of data, while we train on only a
+few thousand hours of data, providing a more efficient training solution. Additionally, we conduct
+detailed ablation studies on LLM sizes, read-write
+strategies, and model architecture to offer a more
+comprehensive understanding of the model.
+We explore the impact of different training data
+sizes on performance. As shown in Table 5, we
+first observe that, with the same number of training samples, multi-turn dialogue data consistently
+achieves better results across all benchmarks compared to single-turn dialogue data, highlighting
+the effectiveness of multi-turn dialogue data for
+training. Additionally, for different training data
+sizes, we observe that as the data size increases,
+the model’s performance improves, gradually stabilizing at 200K training samples. This indicates
+that our 200K multi-turn dialogue data is generally
+sufficient while ensuring efficient training.
+6
+Related Work
+With the rapid development of LLMs, SpeechLMs
+have gained widespread attention in recent
+years (Cui et al., 2024; Ji et al., 2024), aiming to
+endow LLMs with the ability to understand or generate speech. Generally speaking, SpeechLMs can
+be divided into two categories: native SpeechLMs
+and modular SpeechLMs. Native SpeechLMs refer to decoder-only Transformer models capable
+of directly inputting and outputting speech tokens.
+Some early works include SpeechGPT (Zhang
+et al., 2023, 2024a), AudioPaLM (Rubenstein
+et al., 2023), and TWIST (Hassid et al., 2024a).
+These models first convert speech into discrete
+tokens, then extend the vocabulary of pretrained
+LLMs to include these tokens, and finally train the
+LLMs using a large amount of speech or speechtext pair data. Spirit-LM (Nguyen et al., 2024)
+and GLM-4-Voice (Zeng et al., 2025, 2024a) propose training models using speech-text interleaved
+data to encourage cross-modal knowledge transfer.
+Moshi (Défossez et al., 2024), OmniFlatten (Zhang
+et al., 2024b) and LSLM (Ma et al., 2024a) propose models capable of full-duplex conversations.
+7
+Conclusion
+In this paper, we introduce LLaMA-Omni 2, a series of speech language models ranging from 0.5B
+to 14B parameters, designed to enable real-time,
+high-quality speech interaction. LLaMA-Omni
+2 achieves streaming speech generation by integrating an autoregressive text-to-speech language
+model and a causal flow matching model. Experimental results on spoken question answering
+and speech instruction following tasks show that
+8
+LLaMA-Omni 2 outperforms previous state-of-theart speech language models, including LLaMAOmni and GLM-4-Voice. Additionally, LLaMAOmni 2 can achieve latency under 600ms, meeting
+real-time interaction requirements. We also conduct detailed ablation studies to understand the
+impact of various factors on overall performance.
+In the future, we will explore enhancing LLaMAOmni 2 to generate more human-like speech, incorporating features such as emotion and dialects.
+Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu,
+Yifan Yang, Zhanxun Liu, et al. 2024a. Slamomni: Timbre-controllable voice interaction system with single-stage training. arXiv preprint
+arXiv:2412.15649.
+Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, and
+Satoshi Nakamura. 2024b. LLaST: Improved endto-end speech translation system leveraged by large
+language models. In Findings of the Association for
+Computational Linguistics ACL 2024, pages 6976–
+6987, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
+Limitations
+Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei,
+Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng
+He, Junyang Lin, Chang Zhou, and Jingren Zhou.
+2024. Qwen2-audio technical report. arXiv preprint
+arXiv:2407.10759.
+One limitation of our model is that currently it
+cannot generate speech responses with different
+styles (such as emotion or speech rate) based on
+the content of the input speech or underlying paralinguistic information, as we have only trained on
+conventional speech-to-speech dialogue data. However, we believe this functionality can be achieved
+through a data-driven approach, as our model is
+end-to-end trained and could acquire this capability after further training with suitable data. We plan
+to explore this in the future.
+Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren
+Zhou. 2023. Qwen-audio: Advancing universal
+audio understanding via unified large-scale audiolanguage models. arXiv preprint arXiv:2311.07919.
+Leigh Clark, Philip Doyle, Diego Garaialde, Emer
+Gilmartin, Stephan Schlögl, Jens Edlund, Matthew
+Aylett, João Cabral, Cosmin Munteanu, Justin Edwards, and Benjamin R Cowan. 2019. The state of
+speech in HCI: Trends, themes and challenges. Interacting with Computers, 31(4):349–371.
+Ethical Considerations
+Since LLaMA-Omni 2 is built on LLMs, it carries
+some of the same risks as LLMs, such as the potential for factual errors or other hallucination issues
+in its outputs. We recommend that the model’s
+outputs be checked in practical use to ensure they
+comply with the required standards.
+Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng,
+Guangyan Zhang, Qichao Wang, Yiwen Guo, and Irwin King. 2024. Recent advances in speech language
+models: A survey. arXiv preprint arXiv:2410.03751.
+Alexandre Défossez, Laurent Mazaré, Manu Orsini,
+Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard
+Grave, and Neil Zeghidour. 2024. Moshi: a speechtext foundation model for real-time dialogue. Technical report.
+References
+Keyu An, Qian Chen, Chong Deng, Zhihao Du,
+Changfeng Gao, Zhifu Gao, Yue Gu, Ting He,
+Hangrui Hu, Kai Hu, et al. 2024. Funaudiollm: Voice
+understanding and generation foundation models for
+natural interaction between humans and llms. arXiv
+preprint arXiv:2407.04051.
+Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi
+Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun,
+and Bowen Zhou. 2023. Enhancing chat language
+models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
+Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang
+Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng
+Gao, Hui Wang, et al. 2024. Cosyvoice 2: Scalable
+streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117.
+Jonathan Berant, Andrew Chou, Roy Frostig, and Percy
+Liang. 2013. Semantic parsing on Freebase from
+question-answer pairs. In Proceedings of the 2013
+Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA. Association for Computational Linguistics.
+Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
+Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
+Akhil Mathur, Alan Schelten, Amy Yang, Angela
+Fan, et al. 2024. The llama 3 herd of models. arXiv
+preprint arXiv:2407.21783.
+Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen,
+Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao,
+Changfeng Gao, Zhifu Gao, et al. 2025. Minmo: A
+multimodal large language model for seamless voice
+interaction. arXiv preprint arXiv:2501.06282.
+Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma,
+Shaolei Zhang, and Yang Feng. 2025. LLaMA-omni:
+9
+Seamless speech interaction with large language models. In The Thirteenth International Conference on
+Learning Representations.
+for advanced multilingual text-to-speech synthesis.
+Preprint, arXiv:2411.01156.
+Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. 2023. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations.
+Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Ke Li,
+Junteng Jia, Yuan Shangguan, Jay Mahadeokar,
+Ozlem Kalinli, Christian Fuegen, and Mike Seltzer.
+2024. Audiochatllama: Towards general-purpose
+speech abilities for llms. In Proceedings of the 2024
+Conference of the North American Chapter of the
+Association for Computational Linguistics: Human
+Language Technologies (Volume 1: Long Papers),
+pages 5522–5532.
+Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu,
+Xiong Liu, Min Yang, Yongbin Li, Longze Chen,
+Jiaming Li, Lei Zhang, et al. 2025. Openomni:
+Large language models pivot zero-shot omnimodal
+alignment across language with real-time selfaware emotional speech synthesis. arXiv preprint
+arXiv:2501.04561.
+Alex Graves, Santiago Fernández, Faustino Gomez, and
+Jürgen Schmidhuber. 2006. Connectionist temporal
+classification: Labelling unsegmented sequence data
+with recurrent neural networks. In Proceedings of
+the 23rd International Conference on Machine Learning, ICML ’06, page 369–376, New York, NY, USA.
+Association for Computing Machinery.
+Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong,
+Zhuo Chen, Yuping Wang, Yuxuan Wang, and Xie
+Chen. 2024a. Language model can listen while
+speaking. arXiv preprint arXiv:2408.02622.
+Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat,
+Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux,
+et al. 2024a. Textually pretrained speech language
+models. Advances in Neural Information Processing
+Systems, 36.
+Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi
+Zheng, Shiliang Zhang, et al. 2024b. An embarrassingly simple approach for llm with strong asr capacity.
+arXiv preprint arXiv:2402.08846.
+Fabian Mentzer, David Minnen, Eirikur Agustsson, and
+Michael Tschannen. 2024. Finite scalar quantization:
+VQ-VAE made simple. In The Twelfth International
+Conference on Learning Representations.
+Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat,
+Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux,
+et al. 2024b. Textually pretrained speech language
+models. Advances in Neural Information Processing
+Systems, 36.
+Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh
+Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and
+Michelle Tadmor Ramanovich. 2024. Spoken
+question answering and speech continuation using
+spectrogram-powered LLM. In The Twelfth International Conference on Learning Representations.
+Yukiya Hono, Koh Mitsuda, Tianyu Zhao, Kentaro Mitsui, Toshiaki Wakatsuki, and Kei Sawada. 2024. Integrating pre-trained speech and language models for
+end-to-end speech recognition. In Findings of the
+Association for Computational Linguistics ACL 2024,
+pages 13289–13305, Bangkok, Thailand and virtual
+meeting. Association for Computational Linguistics.
+Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R
+Costa-Jussa, Maha Elbayad, Sravya Popuri, PaulAmbroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, et al. 2024. Spirit-lm: Interleaved
+spoken and written language model. arXiv preprint
+arXiv:2402.05755.
+Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo,
+Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou,
+Shujie Liu, Xize Cheng, et al. 2024. Wavchat: A
+survey of spoken dialogue models. arXiv preprint
+arXiv:2411.13577.
+Open-Moss. 2025.
+Speechgpt 2.0-preview.
+https://github.com/OpenMOSS/SpeechGPT-2.
+0-preview.
+Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020.
+Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems,
+volume 33, pages 17022–17033. Curran Associates,
+Inc.
+OpenAI. 2022. Introducing chatgpt.
+OpenAI. 2024. Hello gpt-4o.
+Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori,
+Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and
+Tatsunori B. Hashimoto. 2023. Alpacaeval: An automatic evaluator of instruction-following models.
+https://github.com/tatsu-lab/alpaca_eval.
+Alec Radford. 2018. Improving language understanding
+by generative pre-training.
+Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023.
+Robust speech recognition via large-scale weak supervision. In International conference on machine
+learning, pages 28492–28518. PMLR.
+Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng,
+Ruoyi Zhang, Rongzhi Zhou, and Yijin Xing. 2024.
+Fish-speech: Leveraging large language models
+10
+Paul K Rubenstein, Chulayuth Asawaroengchai,
+Duc Dung Nguyen, Ankur Bapna, Zalán Borsos,
+Félix de Chaumont Quitry, Peter Chen, Dalia El
+Badawy, Wei Han, Eugene Kharitonov, et al. 2023.
+Audiopalm: A large language model that can speak
+and listen. arXiv preprint arXiv:2306.12925.
+Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong
+Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and
+Jie Tang. 2024a. Glm-4-voice: Towards intelligent
+and human-like end-to-end spoken chatbot. Preprint,
+arXiv:2412.02612.
+Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang,
+Shengmin Jiang, Yuxiao Dong, and Jie Tang. 2024b.
+Scaling speech-text pre-training with synthetic interleaved data. Preprint, arXiv:2411.17607.
+Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki
+Koriyama, Shinnosuke Takamichi, and Hiroshi
+Saruwatari. 2022. Utmos: Utokyo-sarulab system
+for voicemos challenge 2022. In Interspeech 2022,
+pages 4521–4525.
+Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang,
+shengmin jiang, Yuxiao Dong, and Jie Tang. 2025.
+Scaling speech-text pre-training with synthetic interleaved data. In The Thirteenth International Conference on Learning Representations.
+Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao
+Chen, Tian Tan, Wei Li, Lu Lu, Zejun MA, and Chao
+Zhang. 2024. SALMONN: Towards generic hearing
+abilities for large language models. In The Twelfth
+International Conference on Learning Representations.
+Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan,
+Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023.
+SpeechGPT: Empowering large language models
+with intrinsic cross-modal conversational abilities.
+In Findings of the Association for Computational
+Linguistics: EMNLP 2023, pages 15757–15773, Singapore. Association for Computational Linguistics.
+Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
+Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
+and Tatsunori B. Hashimoto. 2023. Stanford alpaca:
+An instruction-following llama model. https://
+github.com/tatsu-lab/stanford_alpaca.
+Dong Zhang, Xin Zhang, Jun Zhan, Shimin Li, Yaqian
+Zhou, and Xipeng Qiu. 2024a. Speechgpt-gen: Scaling chain-of-information speech generation. arXiv
+preprint arXiv:2401.13527.
+Qwen Team. 2024. Qwen2.5: A party of foundation
+models.
+A Vaswani. 2017. Attention is all you need. Advances
+in Neural Information Processing Systems.
+Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen,
+Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chaohong Tan, Zhihao Du, et al. 2024b. Omniflatten: An
+end-to-end gpt model for seamless voice conversation. arXiv preprint arXiv:2410.17799.
+Chen Wang, Minpeng Liao, Zhongqiang Huang, Jinliang Lu, Junhong Wu, Yuchen Liu, Chengqing
+Zong, and Jiajun Zhang. 2023. Blsp: Bootstrapping language-speech pre-training via behavior
+alignment of continuation writing. arXiv preprint
+arXiv:2309.00916.
+Xin Zhang, Xiang Lyu, Zhihao Du, Qian Chen, Dong
+Zhang, Hangrui Hu, Chaohong Tan, Tianyu Zhao,
+Yuxuan Wang, Bin Zhang, et al. 2024c. Intrinsicvoice: Empowering llms with intrinsic realtime voice interaction abilities. arXiv preprint
+arXiv:2410.08035.
+Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen,
+Lei Xie, Ke Li, Xing Sun, and Long Ma. 2024.
+Freeze-omni: A smart and low latency speech-tospeech dialogue model with frozen llm. arXiv
+preprint arXiv:2411.00774.
+Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu,
+Bo Ren, Linquan Liu, et al. 2023. On decoder-only
+architecture for speech-to-text and large language
+model integration. In 2023 IEEE Automatic Speech
+Recognition and Understanding Workshop (ASRU),
+pages 1–8. IEEE.
+Zhifei Xie and Changqiao Wu. 2024. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725.
+Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao
+Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao
+Zhang. 2024. Connecting speech encoder and large
+language model for asr. In ICASSP 2024-2024 IEEE
+International Conference on Acoustics, Speech and
+Signal Processing (ICASSP), pages 12637–12641.
+IEEE.
+11
+A
+Prompt
+Prompt for ChatGPT Scoring (Model: GPT-4o)
+I need your help to evaluate the performance of several
+models in a speech interaction scenario. The models receive the user’s speech input and respond with speech
+output. For evaluation purposes, both the user’s speech
+input and the model’s speech response have been transcribed into text using Automatic Speech Recognition
+(ASR). Your task is to rate the model’s responses based
+on the provided user input transcription [Instruction] and
+the model’s output transcription [Response]. Please consider factors such as helpfulness, relevance, fluency, and
+suitability for speech interaction in your evaluation, and
+provide a single score on a scale from 1 to 5.
+Below are the transcription of user’s instruction and models’ response:
+### [Instruction]: {instruction}
+### [Response]: {response}
+After evaluating, please output the scores in JSON format:
+{score: ...}. You don’t need to provide any explanations.
+B
+Detailed Latency
+We list the detailed latency at different stages of
+the model in Table 6. “LLM” refers to the latency
+for generating the first R text tokens, “TTS” refers
+to the latency for generating the first W speech
+tokens, and “FM+Voc” refers to the latency for
+generating the first speech chunk using the flow
+matching model and vocoder.

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,215 @@

+<!-- BACKLOG.MD GUIDELINES START -->
+# Instructions for the usage of Backlog.md CLI Tool
+## 1. Source of Truth
+- Tasks live under **`backlog/tasks/`** (drafts under **`backlog/drafts/`**).
+- Every implementation decision starts with reading the corresponding Markdown task file.
+- Project documentation is in **`backlog/docs/`**.
+- Project decisions are in **`backlog/decisions/`**.
+## 2. Defining Tasks
+### Understand the Scope and the purpose
+Ask questions to the user if something is not clear or ambiguous.
+Break down the task into smaller, manageable parts if it is too large or complex.
+### **Title (one liner)**
+Use a clear brief title that summarizes the task.
+### **Description**: (The **"why"**)
+Provide a concise summary of the task purpose and its goal. Do not add implementation details here. It
+should explain the purpose and context of the task. Code snippets should be avoided.
+### **Acceptance Criteria**: (The **"what"**)
+List specific, measurable outcomes that define what means to reach the goal from the description. Use checkboxes (
+`- [ ]`) for tracking.
+When defining `## Acceptance Criteria` for a task, focus on **outcomes, behaviors, and verifiable requirements** rather
+than step-by-step implementation details.
+Acceptance Criteria (AC) define *what* conditions must be met for the task to be considered complete.
+They should be testable and confirm that the core purpose of the task is achieved.
+**Key Principles for Good ACs:**
+- **Outcome-Oriented:** Focus on the result, not the method.
+- **Testable/Verifiable:** Each criterion should be something that can be objectively tested or verified.
+- **Clear and Concise:** Unambiguous language.
+- **Complete:** Collectively, ACs should cover the scope of the task.
+- **User-Focused (where applicable):** Frame ACs from the perspective of the end-user or the system's external behavior.
+    - *Good Example:* "- [ ] User can successfully log in with valid credentials."
+    - *Good Example:* "- [ ] System processes 1000 requests per second without errors."
+    - *Bad Example (Implementation Step):* "- [ ] Add a new function `handleLogin()` in `auth.ts`."
+### Task file
+Once a task is created it will be stored in `backlog/tasks/` directory as a Markdown file with the format
+`task-<id> - <title>.md` (e.g. `task-42 - Add GraphQL resolver.md`).
+### Task Breakdown Strategy
+When breaking down features:
+1. Identify the foundational components first
+2. Create tasks in dependency order (foundations before features)
+3. Ensure each task delivers value independently
+4. Avoid creating tasks that block each other
+### Additional task requirements
+- Tasks must be **atomic** and **testable**. If a task is too large, break it down into smaller subtasks.
+  Each task should represent a single unit of work that can be completed in a single PR.
+- **Never** reference tasks that are to be done in the future or that are not yet created. You can only reference
+  previous
+  tasks (id < current task id).
+- When creating multiple tasks, ensure they are **independent** and they do not depend on future tasks.
+  Example of wrong tasks splitting: task 1: "Add API endpoint for user data", task 2: "Define the user model and DB
+  schema".
+  Example of correct tasks splitting: task 1: "Add system for handling API requests", task 2: "Add user model and DB
+  schema", task 3: "Add API endpoint for user data".
+## 3. Recommended Task Anatomy
+```markdown
+# task‑42 - Add GraphQL resolver
+## Description (the why)
+Short, imperative explanation of the goal of the task and why it is needed.
+## Acceptance Criteria (the what)
+- [ ] Resolver returns correct data for happy path
+- [ ] Error response matches REST
+- [ ] P95 latency ≤ 50 ms under 100 RPS
+## Implementation Plan (the how) (added after putting the task in progress but before implementing any code change)
+1. Research existing GraphQL resolver patterns
+2. Implement basic resolver with error handling
+3. Add performance monitoring
+4. Write unit and integration tests
+5. Benchmark performance under load
+## Implementation Notes (imagine this is the PR description) (only added after finishing the code implementation of a task)
+- Approach taken
+- Features implemented or modified
+- Technical decisions and trade-offs
+- Modified or added files
+```
+## 6. Implementing Tasks
+Mandatory sections for every task:
+- **Implementation Plan**: (The **"how"**) Outline the steps to achieve the task. Because the implementation details may
+  change after the task is created, **the implementation plan must be added only after putting the task in progress**
+  and before starting working on the task.
+- **Implementation Notes**: Start with a brief summary of what has been implemented. Document your approach, decisions, challenges, and any deviations from the plan. This
+  section is added after you are done working on the task. It should summarize what you did and why you did it. Keep it
+  concise but informative. Imagine this is the PR description. Make it brief, explain the core changes and assume that
+  others will read the code to understand the details.
+**IMPORTANT**: Do not implement anything else that deviates from the **Acceptance Criteria**. If you need to
+implement something that is not in the AC, update the AC first and then implement it or create a new task for it.
+## 2. Typical Workflow
+```bash
+# 1 Identify work
+backlog task list -s "To Do" --plain
+# 2 Read details & documentation
+backlog task 42 --plain
+# Read also all documentation files in `backlog/docs/` directory.
+# Read also all decision files in `backlog/decisions/` directory.
+# 3 Start work: assign yourself & move column
+backlog task edit 42 -a @{yourself} -s "In Progress"
+# 4 Add implementation plan before starting
+backlog task edit 42 --plan "1. Analyze current implementation\n2. Identify bottlenecks\n3. Refactor in phases"
+# 5 Break work down if needed by creating subtasks or additional tasks
+backlog task create "Refactor DB layer" -p 42 -a @{yourself} -d "Description" --ac "Tests pass,Performance improved"
+# 6 Complete and mark Done
+backlog task edit 42 -s Done --notes "Implemented GraphQL resolver with error handling and performance monitoring"
+```
+### 7. Final Steps Before Marking a Task as Done
+Always ensure you have:
+1. ✅ Marked all acceptance criteria as completed (change `- [ ]` to `- [x]`)
+2. ✅ Added an `## Implementation Notes` section documenting your approach
+3. ✅ Run all tests and linting checks
+4. ✅ Updated relevant documentation
+## 8. Definition of Done (DoD)
+A task is **Done** only when **ALL** of the following are complete:
+1. **Acceptance criteria** checklist in the task file is fully checked (all `- [ ]` changed to `- [x]`).
+2. **Implementation plan** was followed or deviations were documented in Implementation Notes.
+3. **Automated tests** (unit + integration) cover new logic.
+4. **Static analysis**: linter & formatter succeed.
+5. **Documentation**:
+    - All relevant docs updated (any relevant README file, backlog/docs, backlog/decisions, etc.).
+    - Task file **MUST** have an `## Implementation Notes` section added summarising:
+        - Approach taken
+        - Features implemented or modified
+        - Technical decisions and trade-offs
+        - Modified or added files
+6. **Review**: self review code.
+7. **Task hygiene**: status set to **Done** via CLI (`backlog task edit <id> -s Done`).
+8. **No regressions**: performance, security and licence checks green.
+⚠️ **IMPORTANT**: Never mark a task as Done without completing ALL items above.
+## 9. Handy CLI Commands
+| Action                  | Example                                                                                                                                                       |
+|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Create task             | `backlog task create "Add OAuth System"`                                                                                                                      |
+| Create with description | `backlog task create "Feature" -d "Add authentication system"`                                                                                                |
+| Create with assignee    | `backlog task create "Feature" -a @sara`                                                                                                                      |
+| Create with status      | `backlog task create "Feature" -s "In Progress"`                                                                                                              |
+| Create with labels      | `backlog task create "Feature" -l auth,backend`                                                                                                               |
+| Create with priority    | `backlog task create "Feature" --priority high`                                                                                                               |
+| Create with plan        | `backlog task create "Feature" --plan "1. Research\n2. Implement"`                                                                                            |
+| Create with AC          | `backlog task create "Feature" --ac "Must work,Must be tested"`                                                                                               |
+| Create with notes       | `backlog task create "Feature" --notes "Started initial research"`                                                                                            |
+| Create with deps        | `backlog task create "Feature" --dep task-1,task-2`                                                                                                           |
+| Create sub task         | `backlog task create -p 14 "Add Login with Google"`                                                                                                           |
+| Create (all options)    | `backlog task create "Feature" -d "Description" -a @sara -s "To Do" -l auth --priority high --ac "Must work" --notes "Initial setup done" --dep task-1 -p 14` |
+| List tasks              | `backlog task list [-s <status>] [-a <assignee>] [-p <parent>]`                                                                                               |
+| List by parent          | `backlog task list --parent 42` or `backlog task list -p task-42`                                                                                             |
+| View detail             | `backlog task 7` (interactive UI, press 'E' to edit in editor)                                                                                                |
+| View (AI mode)          | `backlog task 7 --plain`                                                                                                                                      |
+| Edit                    | `backlog task edit 7 -a @sara -l auth,backend`                                                                                                                |
+| Add plan                | `backlog task edit 7 --plan "Implementation approach"`                                                                                                        |
+| Add AC                  | `backlog task edit 7 --ac "New criterion,Another one"`                                                                                                        |
+| Add notes               | `backlog task edit 7 --notes "Completed X, working on Y"`                                                                                                     |
+| Add deps                | `backlog task edit 7 --dep task-1 --dep task-2`                                                                                                               |
+| Archive                 | `backlog task archive 7`                                                                                                                                      |
+| Create draft            | `backlog task create "Feature" --draft`                                                                                                                       |
+| Draft flow              | `backlog draft create "Spike GraphQL"` → `backlog draft promote 3.1`                                                                                          |
+| Demote to draft         | `backlog task demote <id>`                                                                                                                                    |
+Full help: `backlog --help`
+## 10. Tips for AI Agents
+- **Always use `--plain` flag** when listing or viewing tasks for AI-friendly text output instead of using Backlog.md
+  interactive UI.
+- When users mention to create a task, they mean to create a task using Backlog.md CLI tool.
+<!-- BACKLOG.MD GUIDELINES END -->

COSYVOICE2_CHANGES.md ADDED Viewed

	@@ -0,0 +1,87 @@

+# CosyVoice2 Model Changes Documentation
+## Overview
+This document captures the modifications made to the CosyVoice2 model integration for the LLaMA-Omni2 voice assistant system.
+## Key Changes
+### 1. Configuration Files
+- **cosyvoice.yaml**: Primary configuration file used by the model
+- **cosyvoice2.yaml**: Original CosyVoice2 configuration
+- **cosyvoice_fixed.yaml**: Configuration with `mix_ratio` parameter removed to fix compatibility issues
+### 2. Model Files Structure
+```
+models/cosyvoice2/
+├── CosyVoice-BlankEN/          # English tokenizer model
+├── campplus.onnx               # Speaker embedding model
+├── flow.decoder.estimator.fp32.onnx  # Flow decoder
+├── flow.pt                     # Flow model weights
+├── hift.pt                     # HiFi-GAN vocoder weights
+├── llm.pt                      # Language model weights
+├── speech_tokenizer_v1.onnx   # Speech tokenizer v1
+└── speech_tokenizer_v2.onnx   # Speech tokenizer v2 (new addition)
+```
+### 3. Code Modifications
+#### cosyvoice/flow/flow.py
+- Modified to handle CosyVoice2 model architecture
+- Updated MaskedDiffWithXvec class for compatibility
+- Adjusted decoder configuration parameters
+#### llama_omni2/serve/flow_inference.py
+- Updated SpeechDecoder class to properly load CosyVoice2 models
+- Changed configuration loading to use 'cosyvoice.yaml' instead of fallback logic
+- Added support for speech_tokenizer_v2.onnx
+### 4. Integration Points
+- **Model Path**: `models/cosyvoice2/` or `models/cosy2_decoder/`
+- **Frontend**: CosyVoiceFrontEnd handles tokenization and feature extraction
+- **Vocoder**: Uses the model as vocoder in gradio_web_server.py with `--vocoder-dir` flag
+## Setup Requirements
+### Model Download
+```bash
+# Download CosyVoice2 model from HuggingFace
+python -c "
+from huggingface_hub import snapshot_download
+snapshot_download(
+    repo_id='FunAudioLLM/CosyVoice2-0.5B',
+    local_dir='models/cosyvoice2',
+    local_dir_use_symlinks=False
+)"
+```
+### Configuration Fix
+The original cosyvoice2.yaml may contain a `mix_ratio` parameter that causes issues. This is fixed by:
+1. Copying cosyvoice2.yaml to cosyvoice.yaml
+2. Removing the mix_ratio parameter
+```bash
+cp models/cosyvoice2/cosyvoice2.yaml models/cosyvoice2/cosyvoice.yaml
+grep -v "mix_ratio" models/cosyvoice2/cosyvoice.yaml > models/cosyvoice2/cosyvoice_fixed.yaml
+mv models/cosyvoice2/cosyvoice_fixed.yaml models/cosyvoice2/cosyvoice.yaml
+```
+## Usage in LLaMA-Omni2
+Start the Gradio server with CosyVoice2 as vocoder:
+```bash
+python -m llama_omni2.serve.gradio_web_server \
+    --controller http://localhost:10000 \
+    --port 8000 \
+    --vocoder-dir models/cosyvoice2
+```
+## Known Issues and Solutions
+1. **mix_ratio parameter error**: Remove from configuration file
+2. **Missing cosyvoice.yaml**: Copy from cosyvoice2.yaml
+3. **Tokenizer compatibility**: Ensure both v1 and v2 tokenizers are present
+## Performance Notes
+- CosyVoice2-0.5B is optimized for faster inference
+- Supports both Chinese and English text-to-speech
+- Compatible with streaming generation for real-time applications

GEMINI.md ADDED Viewed

	@@ -0,0 +1,215 @@

+<!-- BACKLOG.MD GUIDELINES START -->
+# Instructions for the usage of Backlog.md CLI Tool
+## 1. Source of Truth
+- Tasks live under **`backlog/tasks/`** (drafts under **`backlog/drafts/`**).
+- Every implementation decision starts with reading the corresponding Markdown task file.
+- Project documentation is in **`backlog/docs/`**.
+- Project decisions are in **`backlog/decisions/`**.
+## 2. Defining Tasks
+### Understand the Scope and the purpose
+Ask questions to the user if something is not clear or ambiguous.
+Break down the task into smaller, manageable parts if it is too large or complex.
+### **Title (one liner)**
+Use a clear brief title that summarizes the task.
+### **Description**: (The **"why"**)
+Provide a concise summary of the task purpose and its goal. Do not add implementation details here. It
+should explain the purpose and context of the task. Code snippets should be avoided.
+### **Acceptance Criteria**: (The **"what"**)
+List specific, measurable outcomes that define what means to reach the goal from the description. Use checkboxes (
+`- [ ]`) for tracking.
+When defining `## Acceptance Criteria` for a task, focus on **outcomes, behaviors, and verifiable requirements** rather
+than step-by-step implementation details.
+Acceptance Criteria (AC) define *what* conditions must be met for the task to be considered complete.
+They should be testable and confirm that the core purpose of the task is achieved.
+**Key Principles for Good ACs:**
+- **Outcome-Oriented:** Focus on the result, not the method.
+- **Testable/Verifiable:** Each criterion should be something that can be objectively tested or verified.
+- **Clear and Concise:** Unambiguous language.
+- **Complete:** Collectively, ACs should cover the scope of the task.
+- **User-Focused (where applicable):** Frame ACs from the perspective of the end-user or the system's external behavior.
+    - *Good Example:* "- [ ] User can successfully log in with valid credentials."
+    - *Good Example:* "- [ ] System processes 1000 requests per second without errors."
+    - *Bad Example (Implementation Step):* "- [ ] Add a new function `handleLogin()` in `auth.ts`."
+### Task file
+Once a task is created it will be stored in `backlog/tasks/` directory as a Markdown file with the format
+`task-<id> - <title>.md` (e.g. `task-42 - Add GraphQL resolver.md`).
+### Task Breakdown Strategy
+When breaking down features:
+1. Identify the foundational components first
+2. Create tasks in dependency order (foundations before features)
+3. Ensure each task delivers value independently
+4. Avoid creating tasks that block each other
+### Additional task requirements
+- Tasks must be **atomic** and **testable**. If a task is too large, break it down into smaller subtasks.
+  Each task should represent a single unit of work that can be completed in a single PR.
+- **Never** reference tasks that are to be done in the future or that are not yet created. You can only reference
+  previous
+  tasks (id < current task id).
+- When creating multiple tasks, ensure they are **independent** and they do not depend on future tasks.
+  Example of wrong tasks splitting: task 1: "Add API endpoint for user data", task 2: "Define the user model and DB
+  schema".
+  Example of correct tasks splitting: task 1: "Add system for handling API requests", task 2: "Add user model and DB
+  schema", task 3: "Add API endpoint for user data".
+## 3. Recommended Task Anatomy
+```markdown
+# task‑42 - Add GraphQL resolver
+## Description (the why)
+Short, imperative explanation of the goal of the task and why it is needed.
+## Acceptance Criteria (the what)
+- [ ] Resolver returns correct data for happy path
+- [ ] Error response matches REST
+- [ ] P95 latency ≤ 50 ms under 100 RPS
+## Implementation Plan (the how) (added after putting the task in progress but before implementing any code change)
+1. Research existing GraphQL resolver patterns
+2. Implement basic resolver with error handling
+3. Add performance monitoring
+4. Write unit and integration tests
+5. Benchmark performance under load
+## Implementation Notes (imagine this is the PR description) (only added after finishing the code implementation of a task)
+- Approach taken
+- Features implemented or modified
+- Technical decisions and trade-offs
+- Modified or added files
+```
+## 6. Implementing Tasks
+Mandatory sections for every task:
+- **Implementation Plan**: (The **"how"**) Outline the steps to achieve the task. Because the implementation details may
+  change after the task is created, **the implementation plan must be added only after putting the task in progress**
+  and before starting working on the task.
+- **Implementation Notes**: Start with a brief summary of what has been implemented. Document your approach, decisions, challenges, and any deviations from the plan. This
+  section is added after you are done working on the task. It should summarize what you did and why you did it. Keep it
+  concise but informative. Imagine this is the PR description. Make it brief, explain the core changes and assume that
+  others will read the code to understand the details.
+**IMPORTANT**: Do not implement anything else that deviates from the **Acceptance Criteria**. If you need to
+implement something that is not in the AC, update the AC first and then implement it or create a new task for it.
+## 2. Typical Workflow
+```bash
+# 1 Identify work
+backlog task list -s "To Do" --plain
+# 2 Read details & documentation
+backlog task 42 --plain
+# Read also all documentation files in `backlog/docs/` directory.
+# Read also all decision files in `backlog/decisions/` directory.
+# 3 Start work: assign yourself & move column
+backlog task edit 42 -a @{yourself} -s "In Progress"
+# 4 Add implementation plan before starting
+backlog task edit 42 --plan "1. Analyze current implementation\n2. Identify bottlenecks\n3. Refactor in phases"
+# 5 Break work down if needed by creating subtasks or additional tasks
+backlog task create "Refactor DB layer" -p 42 -a @{yourself} -d "Description" --ac "Tests pass,Performance improved"
+# 6 Complete and mark Done
+backlog task edit 42 -s Done --notes "Implemented GraphQL resolver with error handling and performance monitoring"
+```
+### 7. Final Steps Before Marking a Task as Done
+Always ensure you have:
+1. ✅ Marked all acceptance criteria as completed (change `- [ ]` to `- [x]`)
+2. ✅ Added an `## Implementation Notes` section documenting your approach
+3. ✅ Run all tests and linting checks
+4. ✅ Updated relevant documentation
+## 8. Definition of Done (DoD)
+A task is **Done** only when **ALL** of the following are complete:
+1. **Acceptance criteria** checklist in the task file is fully checked (all `- [ ]` changed to `- [x]`).
+2. **Implementation plan** was followed or deviations were documented in Implementation Notes.
+3. **Automated tests** (unit + integration) cover new logic.
+4. **Static analysis**: linter & formatter succeed.
+5. **Documentation**:
+    - All relevant docs updated (any relevant README file, backlog/docs, backlog/decisions, etc.).
+    - Task file **MUST** have an `## Implementation Notes` section added summarising:
+        - Approach taken
+        - Features implemented or modified
+        - Technical decisions and trade-offs
+        - Modified or added files
+6. **Review**: self review code.
+7. **Task hygiene**: status set to **Done** via CLI (`backlog task edit <id> -s Done`).
+8. **No regressions**: performance, security and licence checks green.
+⚠️ **IMPORTANT**: Never mark a task as Done without completing ALL items above.
+## 9. Handy CLI Commands
+| Action                  | Example                                                                                                                                                       |
+|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Create task             | `backlog task create "Add OAuth System"`                                                                                                                      |
+| Create with description | `backlog task create "Feature" -d "Add authentication system"`                                                                                                |
+| Create with assignee    | `backlog task create "Feature" -a @sara`                                                                                                                      |
+| Create with status      | `backlog task create "Feature" -s "In Progress"`                                                                                                              |
+| Create with labels      | `backlog task create "Feature" -l auth,backend`                                                                                                               |
+| Create with priority    | `backlog task create "Feature" --priority high`                                                                                                               |
+| Create with plan        | `backlog task create "Feature" --plan "1. Research\n2. Implement"`                                                                                            |
+| Create with AC          | `backlog task create "Feature" --ac "Must work,Must be tested"`                                                                                               |
+| Create with notes       | `backlog task create "Feature" --notes "Started initial research"`                                                                                            |
+| Create with deps        | `backlog task create "Feature" --dep task-1,task-2`                                                                                                           |
+| Create sub task         | `backlog task create -p 14 "Add Login with Google"`                                                                                                           |
+| Create (all options)    | `backlog task create "Feature" -d "Description" -a @sara -s "To Do" -l auth --priority high --ac "Must work" --notes "Initial setup done" --dep task-1 -p 14` |
+| List tasks              | `backlog task list [-s <status>] [-a <assignee>] [-p <parent>]`                                                                                               |
+| List by parent          | `backlog task list --parent 42` or `backlog task list -p task-42`                                                                                             |
+| View detail             | `backlog task 7` (interactive UI, press 'E' to edit in editor)                                                                                                |
+| View (AI mode)          | `backlog task 7 --plain`                                                                                                                                      |
+| Edit                    | `backlog task edit 7 -a @sara -l auth,backend`                                                                                                                |
+| Add plan                | `backlog task edit 7 --plan "Implementation approach"`                                                                                                        |
+| Add AC                  | `backlog task edit 7 --ac "New criterion,Another one"`                                                                                                        |
+| Add notes               | `backlog task edit 7 --notes "Completed X, working on Y"`                                                                                                     |
+| Add deps                | `backlog task edit 7 --dep task-1 --dep task-2`                                                                                                               |
+| Archive                 | `backlog task archive 7`                                                                                                                                      |
+| Create draft            | `backlog task create "Feature" --draft`                                                                                                                       |
+| Draft flow              | `backlog draft create "Spike GraphQL"` → `backlog draft promote 3.1`                                                                                          |
+| Demote to draft         | `backlog task demote <id>`                                                                                                                                    |
+Full help: `backlog --help`
+## 10. Tips for AI Agents
+- **Always use `--plain` flag** when listing or viewing tasks for AI-friendly text output instead of using Backlog.md
+  interactive UI.
+- When users mention to create a task, they mean to create a task using Backlog.md CLI tool.
+<!-- BACKLOG.MD GUIDELINES END -->

LLaMA-Omni2-3B/README.md ADDED Viewed

	@@ -0,0 +1,155 @@

+# 🦙🎧 LLaMA-Omni 2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
+> **Authors: [Qingkai Fang](https://fangqingkai.github.io/), [Yan Zhou](https://zhouyan19.github.io/zhouyan/), [Shoutao Guo](https://scholar.google.com/citations?hl=en&user=XwHtPyAAAAAJ), [Shaolei Zhang](https://zhangshaolei1998.github.io/), [Yang Feng*](https://people.ucas.edu.cn/~yangfeng?language=en)**
+[![arXiv](https://img.shields.io/badge/arXiv-2505.02625-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2505.02625)
+[![code](https://img.shields.io/badge/Github-Code-keygen.svg?logo=github)](https://github.com/ictnlp/LLaMA-Omni2)
+[![models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging_Face-Models-blue.svg)](https://huggingface.co/collections/ICTNLP/llama-omni-67fdfb852c60470175e36e9c)
+[![dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging_Face-Dataset-blue.svg)](https://huggingface.co/datasets/ICTNLP/Multiturn-Speech-Conversations)
+LLaMA-Omni 2 is a series of speech-language models built on the Qwen2.5-0.5B/1.5B/3B/7B/14B/32B-Instruct models. Similar to [LLaMA-Omni](https://github.com/ictnlp/LLaMA-Omni), it can generate both text and speech responses simultaneously, enabling high-quality and low-latency speech interaction. With the newly introduced streaming autoregressive speech decoder, LLaMA-Omni 2 achieves higher speech quality compared to LLaMA-Omni.
+<div align="center"><img src="images/llama-omni2.png" width="75%"/></div>
+## 🔥 News
+- [25/05] LLaMA-Omni 2 is accepted at ACL 2025 main conference!
+## Install
+1. Clone this repository.
+```shell
+git clone https://github.com/ictnlp/LLaMA-Omni2
+cd LLaMA-Omni2
+```
+2. Install packages.
+```shell
+conda create -n llama-omni2 python=3.10
+conda activate llama-omni2
+pip install -e .
+```
+## Quick Start
+1. Download the `Whisper-large-v3` model.
+```shell
+import whisper
+model = whisper.load_model("large-v3", download_root="models/speech_encoder/")
+```
+2. Download the flow-matching model and vocoder of `CosyVoice 2`.
+```shell
+huggingface-cli download --resume-download ICTNLP/cosy2_decoder --local-dir models/cosy2_decoder
+```
+> [!Tip]
+> If you’re experiencing unstable connections to Hugging Face from within China, you can try setting the following in your command line:
+>
+> ```shell
+> export HF_ENDPOINT=https://hf-mirror.com
+> ```
+3. Download the LLaMA-Omni2 series models from Hugging Face. `LLaMA-Omni2-0.5B/1.5B/3B/7B/14B` support **English only**, while `LLaMA-Omni2-0.5B/1.5B/3B/7B/14B/32B-Bilingual` support **both English and Chinese**.
+```shell
+model_name=LLaMA-Omni2-7B-Bilingual
+huggingface-cli download --resume-download ICTNLP/$model_name --local-dir models/$model_name
+```
+| LLaMA-Omni2                                                           | LLaMA-Omni2-Bilingual                                                                     |
+| --------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
+| 🤗 [LLaMA-Omni2-0.5B](https://huggingface.co/ICTNLP/LLaMA-Omni2-0.5B) | 🤗 [LLaMA-Omni2-0.5B-Bilingual](https://huggingface.co/ICTNLP/LLaMA-Omni2-0.5B-Bilingual) |
+| 🤗 [LLaMA-Omni2-1.5B](https://huggingface.co/ICTNLP/LLaMA-Omni2-1.5B) | 🤗 [LLaMA-Omni2-1.5B-Bilingual](https://huggingface.co/ICTNLP/LLaMA-Omni2-1.5B-Bilingual) |
+| 🤗 [LLaMA-Omni2-3B](https://huggingface.co/ICTNLP/LLaMA-Omni2-3B)     | 🤗 [LLaMA-Omni2-3B-Bilingual](https://huggingface.co/ICTNLP/LLaMA-Omni2-3B-Bilingual)     |
+| 🤗 [LLaMA-Omni2-7B](https://huggingface.co/ICTNLP/LLaMA-Omni2-7B)     | 🤗 [LLaMA-Omni2-7B-Bilingual](https://huggingface.co/ICTNLP/LLaMA-Omni2-7B-Bilingual)     |
+| 🤗 [LLaMA-Omni2-14B](https://huggingface.co/ICTNLP/LLaMA-Omni2-14B)   | 🤗 [LLaMA-Omni2-14B-Bilingual](https://huggingface.co/ICTNLP/LLaMA-Omni2-14B-Bilingual)   |
+| -                                                                     | 🤗 [LLaMA-Omni2-32B-Bilingual](https://huggingface.co/ICTNLP/LLaMA-Omni2-32B-Bilingual)   |
+## Gradio Demo
+1. Launch a controller.
+   ```shell
+   python -m llama_omni2.serve.controller --host 0.0.0.0 --port 10000
+   ```
+2. Launch a gradio web server.
+   ```shell
+   python -m llama_omni2.serve.gradio_web_server --controller http://localhost:10000 --port 8000 --vocoder-dir models/cosy2_decoder
+   ```
+3. Launch a model worker.
+   ```shell
+   python -m llama_omni2.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path models/$model_name --model-name $model_name
+   ```
+4. Visit [http://localhost:8000/](http://localhost:8000/) and interact with LLaMA-Omni2!
+## Local Inference
+```shell
+output_dir=examples/$model_name
+mkdir -p $output_dir
+python llama_omni2/inference/run_llama_omni2.py \
+    --model_path models/$model_name \
+    --question_file examples/questions.json \
+    --answer_file $output_dir/answers.jsonl \
+    --temperature 0 \
+    --s2s
+python llama_omni2/inference/run_cosy2_decoder.py \
+    --input-path $output_dir/answers.jsonl \
+    --output-dir $output_dir/wav \
+    --lang en
+```
+## LICENSE
+Our code is released under the Apache-2.0 License. Our model is intended for academic research purposes only and may **NOT** be used for commercial purposes.
+You are free to use, modify, and distribute this model in academic settings, provided that the following conditions are met:
+- **Non-commercial use**: The model may not be used for any commercial purposes.
+- **Citation**: If you use this model in your research, please cite the original work.
+### Commercial Use Restriction
+For any commercial use inquiries or to obtain a commercial license, please contact `fengyang@ict.ac.cn`.
+## Acknowledgements
+- [CosyVoice 2](https://github.com/FunAudioLLM/CosyVoice): We use the pretrained speech tokenizer, flow-matching model and vocoder of CosyVoice 2.
+- [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM): We borrow some code about speech encoder and speech adaptor.
+## Citation
+If you have any questions, please feel free to submit an issue or contact `fangqingkai21b@ict.ac.cn`.
+If our work is useful for you, please cite as:
+```
+@inproceedings{
+  fang2025llamaomni2,
+  title={{LL}a{MA}-{O}mni 2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis},
+  author={Fang, Qingkai and Zhou, Yan and Guo, Shoutao and Zhang, Shaolei and Feng, Yang},
+  booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics},
+  year={2025}
+}
+@inproceedings{
+  fang2025llamaomni,
+  title={{LL}a{MA}-{O}mni: Seamless Speech Interaction with Large Language Models},
+  author={Qingkai Fang and Shoutao Guo and Yan Zhou and Zhengrui Ma and Shaolei Zhang and Yang Feng},
+  booktitle={The Thirteenth International Conference on Learning Representations},
+  year={2025},
+  url={https://openreview.net/forum?id=PYmrUQmMEw}
+}
+```

LLaMA-Omni2-3B/added_tokens.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "</tool_call>": 151658,
+  "<speech>": 151665,
+  "<tool_call>": 151657,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

LLaMA-Omni2-3B/config.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+    "_name_or_path": "LLaMA-Omni2-3B",
+    "architectures": [
+        "Omni2Speech2SQwen2ForCausalLM"
+    ],
+    "attention_dropout": 0.0,
+    "bos_token_id": 151643,
+    "eos_token_id": 151645,
+    "hidden_act": "silu",
+    "hidden_size": 2048,
+    "initializer_range": 0.02,
+    "intermediate_size": 11008,
+    "max_position_embeddings": 32768,
+    "max_window_layers": 70,
+    "model_type": "omni2_speech2s_qwen2",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 36,
+    "num_key_value_heads": 2,
+    "rms_norm_eps": 1e-06,
+    "rope_theta": 1000000.0,
+    "sliding_window": null,
+    "speech_encoder": "models/speech_encoder/large-v3.pt",
+    "speech_encoder_ds_rate": 5,
+    "speech_encoder_hidden_size": 1280,
+    "speech_encoder_type": "whisper",
+    "speech_generator": {
+        "architectures": [
+            "Qwen2ForCausalLM"
+        ],
+        "attention_dropout": 0.0,
+        "bos_token_id": 151643,
+        "eos_token_id": 151643,
+        "hidden_act": "silu",
+        "hidden_size": 896,
+        "initializer_range": 0.02,
+        "intermediate_size": 4864,
+        "max_position_embeddings": 32768,
+        "max_window_layers": 24,
+        "model_type": "qwen2",
+        "num_attention_heads": 14,
+        "num_hidden_layers": 24,
+        "num_key_value_heads": 2,
+        "rms_norm_eps": 1e-06,
+        "rope_theta": 1000000.0,
+        "sliding_window": null,
+        "tie_word_embeddings": true,
+        "torch_dtype": "bfloat16",
+        "transformers_version": "4.43.4",
+        "use_cache": true,
+        "use_mrope": false,
+        "use_sliding_window": false,
+        "vocab_size": 158227
+    },
+    "speech_projector_type": "linear",
+    "stream_params": "(3,10)",
+    "tie_word_embeddings": true,
+    "tokenizer_model_max_length": 4096,
+    "tokenizer_padding_side": "right",
+    "torch_dtype": "bfloat16",
+    "transformers_version": "4.43.4",
+    "unit_vocab_size": 6561,
+    "use_cache": true,
+    "use_sliding_window": false,
+    "vocab_size": 151936
+}

LLaMA-Omni2-3B/generation_config.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "attn_implementation": "flash_attention_2",
+  "bos_token_id": 151643,
+  "do_sample": true,
+  "eos_token_id": [
+    151645,
+    151643
+  ],
+  "pad_token_id": 151643,
+  "repetition_penalty": 1.05,
+  "temperature": 0.7,
+  "top_k": 20,
+  "top_p": 0.8,
+  "transformers_version": "4.43.4"
+}

LLaMA-Omni2-3B/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

LLaMA-Omni2-3B/model-00001-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cc4b7fda5d470353f675e0410724af00479eb09b3c81c5648bad35ac97904665
+size 4957560304

LLaMA-Omni2-3B/model-00002-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:60dd465c6e6ceac492af19fbdbe11dd9fb4104b2a71c3825fba38a8a0427ed94
+size 4455567096

LLaMA-Omni2-3B/model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

LLaMA-Omni2-3B/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|endoftext|>"
+}

LLaMA-Omni2-3B/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,216 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<speech>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- else %}\n        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n    {%- endif %}\n    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- else %}\n        {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "model_max_length": 4096,
+  "pad_token": "<|endoftext|>",
+  "padding_side": "right",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

LLaMA-Omni2-3B/tts_tokenizer/added_tokens.json ADDED Viewed

The diff for this file is too large to render. See raw diff

LLaMA-Omni2-3B/tts_tokenizer/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

LLaMA-Omni2-3B/tts_tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|endoftext|>"
+}

LLaMA-Omni2-3B/tts_tokenizer/tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff

LLaMA-Omni2-3B/tts_tokenizer/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

LLaMA-Omni2-3B/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

README.md ADDED Viewed

	@@ -0,0 +1,124 @@

+# 🎙️🤖 Goodspace Voice Agent: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
+> **Powered by advanced speech-language models and streaming synthesis technology**
+[![code](https://img.shields.io/badge/Github-Code-keygen.svg?logo=github)](https://github.com/goodspace/voice-agent)
+[![models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging_Face-Models-blue.svg)](https://huggingface.co/collections/goodspace/voice-agent)
+[![dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging_Face-Dataset-blue.svg)](https://huggingface.co/datasets/goodspace/speech-conversations)
+Goodspace Voice Agent is a cutting-edge series of speech-language models built on the Qwen2.5-0.5B/1.5B/3B/7B/14B/32B-Instruct models. It can generate both text and speech responses simultaneously, enabling high-quality and low-latency speech interaction. With the streaming autoregressive speech decoder, Goodspace Voice Agent achieves exceptional speech quality and natural conversation flow.
+<div align="center"><img src="images/llama-omni2.png" width="75%"/></div>
+## 🔥 News
+- Goodspace Voice Agent - Advanced real-time voice interaction system now available!
+## Install
+1. Clone this repository.
+```shell
+git clone https://github.com/goodspace/voice-agent
+cd voice-agent
+```
+2. Install packages.
+```shell
+conda create -n goodspace-voice python=3.10
+conda activate goodspace-voice
+pip install -e .
+```
+## Quick Start
+1. Download the `Whisper-large-v3` model.
+```shell
+import whisper
+model = whisper.load_model("large-v3", download_root="models/speech_encoder/")
+```
+2. Download the flow-matching model and vocoder of `CosyVoice 2`.
+```shell
+huggingface-cli download --resume-download goodspace/cosy2_decoder --local-dir models/cosy2_decoder
+```
+> [!Tip]
+> If you’re experiencing unstable connections to Hugging Face from within China, you can try setting the following in your command line:
+>
+> ```shell
+> export HF_ENDPOINT=https://hf-mirror.com
+> ```
+3. Download the Goodspace Voice Agent models from Hugging Face. `GoodspaceVoice-0.5B/1.5B/3B/7B/14B` support **English only**, while `GoodspaceVoice-0.5B/1.5B/3B/7B/14B/32B-Bilingual` support **both English and Chinese**.
+```shell
+model_name=GoodspaceVoice-7B-Bilingual
+huggingface-cli download --resume-download goodspace/$model_name --local-dir models/$model_name
+```
+## Gradio Demo
+1. Launch a controller.
+   ```shell
+   python -m goodspace_voice.serve.controller --host 0.0.0.0 --port 10000
+   ```
+2. Launch a gradio web server.
+   ```shell
+   python -m goodspace_voice.serve.gradio_web_server --controller http://localhost:10000 --port 8000 --vocoder-dir models/cosy2_decoder
+   ```
+3. Launch a model worker.
+   ```shell
+   python -m goodspace_voice.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path models/$model_name --model-name $model_name
+   ```
+4. Visit [http://localhost:8000/](http://localhost:8000/) and interact with GoodspaceVoice!
+## Local Inference
+```shell
+output_dir=examples/$model_name
+mkdir -p $output_dir
+python goodspace_voice/inference/run_goodspace_voice.py \
+    --model_path models/$model_name \
+    --question_file examples/questions.json \
+    --answer_file $output_dir/answers.jsonl \
+    --temperature 0 \
+    --s2s
+python goodspace_voice/inference/run_cosy2_decoder.py \
+    --input-path $output_dir/answers.jsonl \
+    --output-dir $output_dir/wav \
+    --lang en
+```
+## LICENSE
+The Goodspace Voice Agent is released under the Apache-2.0 License.
+### Commercial Use
+For commercial use inquiries or licensing information, please contact the Goodspace team.
+## Acknowledgements
+- [CosyVoice 2](https://github.com/FunAudioLLM/CosyVoice): We use the pretrained speech tokenizer, flow-matching model and vocoder of CosyVoice 2.
+- [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM): We borrow some code about speech encoder and speech adaptor.
+- Based on the research work from LLaMA-Omni2 paper.
+## Support
+If you have any questions or issues, please feel free to submit an issue on our GitHub repository.
+## Contributing
+We welcome contributions! Please see our contributing guidelines for more information.

SETUP_GUIDE.md ADDED Viewed

	@@ -0,0 +1,274 @@

+# LLaMA-Omni2 Voice Assistant Setup Guide
+This guide provides comprehensive instructions for reproducing the exact environment and setup for the LLaMA-Omni2 voice assistant with CosyVoice2 integration.
+## Prerequisites
+- Ubuntu/Linux system with CUDA-capable GPU
+- CUDA 12.1 or higher installed
+- Miniconda or Anaconda installed
+- At least 16GB RAM and 20GB free disk space
+- Python 3.10
+## Environment Setup Options
+### Option 1: Using Conda Environment File (Recommended)
+```bash
+# Create environment from comprehensive yml file
+conda env create -f environment-comprehensive.yml
+# Activate the environment
+conda activate gsva-python310
+```
+### Option 2: Using Frozen Requirements
+```bash
+# Create a new conda environment
+conda create -n gsva-python310 python=3.10 -y
+conda activate gsva-python310
+# Install from frozen requirements
+pip install -r requirements-frozen-new.txt
+```
+### Option 3: Manual Setup Using Script
+```bash
+# Run the complete setup script
+bash script.sh
+```
+## Detailed Manual Setup
+### 1. Create and Activate Conda Environment
+```bash
+source /home/azureuser/miniconda3/etc/profile.d/conda.sh
+conda create -n gsva-python310 python=3.10 -y
+conda activate gsva-python310
+```
+### 2. Install Basic Dependencies
+```bash
+pip install Cython numpy==1.26.4
+pip install packaging wheel setuptools==69.5.1
+```
+### 3. Install the Package
+```bash
+# Install in development mode
+pip install -e .
+```
+### 4. Install Core Dependencies
+```bash
+# Essential packages
+pip install huggingface_hub==0.25.1
+pip install uvicorn openai-whisper fastapi
+pip install hf_transfer ninja
+# Gradio for web interface
+pip install gradio==5.3.0 gradio_client==1.4.2
+```
+### 5. Setup CUDA Environment
+```bash
+# Link CUDA installation
+sudo rm -rf /usr/local/cuda
+sudo ln -s /usr/local/cuda-12.6 /usr/local/cuda
+export PATH=/usr/local/cuda/bin:$PATH
+export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
+```
+### 6. Install PyTorch with CUDA Support
+```bash
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+```
+### 7. Install Flash Attention
+```bash
+MAX_JOBS=4 pip install flash-attn --no-build-isolation
+```
+### 8. Install Transformers and Audio Libraries
+```bash
+# Specific version for LLaMA-Omni2 compatibility
+pip install transformers==4.43.4
+# Audio processing libraries
+pip install matcha-tts --no-build-isolation
+pip install git+https://github.com/FunAudioLLM/CosyVoice.git
+# Additional dependencies
+pip install conformer onnxruntime hyperpyyaml==1.2.2 ruamel.yaml
+```
+## Model Downloads
+### 1. Download LLaMA-Omni2 Model
+```bash
+mkdir -p models
+huggingface-cli download ICTNLP/LLaMA-Omni2-3B --local-dir models/LLaMA-Omni2-3B
+```
+### 2. Download CosyVoice2 Model
+```bash
+mkdir -p models/cosyvoice2
+python -c "
+from huggingface_hub import snapshot_download
+import os
+os.makedirs('models/cosyvoice2', exist_ok=True)
+snapshot_download(
+    repo_id='FunAudioLLM/CosyVoice2-0.5B',
+    local_dir='models/cosyvoice2',
+    local_dir_use_symlinks=False
+)
+"
+```
+### 3. Fix CosyVoice Configuration
+```bash
+# Create backup
+cp models/cosyvoice2/cosyvoice2.yaml models/cosyvoice2/cosyvoice2.yaml.backup
+# Copy to expected filename
+cp models/cosyvoice2/cosyvoice2.yaml models/cosyvoice2/cosyvoice.yaml
+# Remove problematic parameter
+grep -v "mix_ratio" models/cosyvoice2/cosyvoice.yaml > models/cosyvoice2/cosyvoice_fixed.yaml
+mv models/cosyvoice2/cosyvoice_fixed.yaml models/cosyvoice2/cosyvoice.yaml
+```
+## Running the Services
+### 1. Start Controller
+```bash
+nohup python -m llama_omni2.serve.controller \
+    --host 0.0.0.0 \
+    --port 10000 > controller.log 2>&1 &
+```
+### 2. Start Model Worker
+```bash
+nohup python -m llama_omni2.serve.model_worker \
+    --host 0.0.0.0 \
+    --controller http://localhost:10000 \
+    --port 40000 \
+    --worker http://localhost:40000 \
+    --model-path models/LLaMA-Omni2-3B \
+    --model-name LLaMA-Omni2-3B > worker.log 2>&1 &
+```
+### 3. Start Gradio Web Server
+With CosyVoice2 vocoder:
+```bash
+python -m llama_omni2.serve.gradio_web_server \
+    --controller http://localhost:10000 \
+    --port 8000 \
+    --vocoder-dir models/cosyvoice2
+```
+Without vocoder (fallback):
+```bash
+python -m llama_omni2.serve.gradio_web_server \
+    --controller http://localhost:10000 \
+    --port 8000
+```
+## Monitoring Services
+```bash
+# Check controller logs
+tail -f controller.log
+# Check model worker logs
+tail -f worker.log
+# Access web UI
+# Open browser at http://localhost:8000
+```
+## Troubleshooting
+### Common Issues
+1. **CUDA not found**: Ensure CUDA paths are exported correctly
+2. **Flash attention build fails**: Use `MAX_JOBS=4` to limit parallel compilation
+3. **CosyVoice mix_ratio error**: Follow the configuration fix steps above
+4. **Port already in use**: Kill existing processes or use different ports
+### Killing Services
+```bash
+# Find and kill Python processes
+ps aux | grep python | grep -E "(controller|model_worker|gradio_web_server)" | awk '{print $2}' | xargs -r kill
+```
+## Project Structure
+```
+voiceagents/
+├── llama_omni2/          # Main application code
+├── cosyvoice/            # CosyVoice integration
+├── models/               # Downloaded models
+│   ├── LLaMA-Omni2-3B/
+│   └── cosyvoice2/
+├── examples/             # Sample audio files
+├── script.sh             # Setup script
+├── pyproject.toml        # Project configuration
+├── requirements-frozen-new.txt  # Frozen dependencies
+├── environment-comprehensive.yml # Conda environment
+└── SETUP_GUIDE.md        # This file
+```
+## Environment Variables
+Set these in your `.bashrc` or `.zshrc`:
+```bash
+export PATH=/usr/local/cuda/bin:$PATH
+export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
+export HF_HUB_ENABLE_HF_TRANSFER=1
+export HF_HOME=~/.cache/huggingface
+export TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;8.9;9.0"
+export MAX_JOBS=4
+```
+## Version Information
+- Python: 3.10
+- PyTorch: 2.3.1
+- Transformers: 4.43.4
+- Gradio: 5.3.0
+- CUDA: 12.1+
+- CosyVoice2: 0.5B model
+## Additional Notes
+- The setup has been tested on Ubuntu with NVIDIA GPUs
+- Ensure sufficient GPU memory (8GB+ recommended)
+- For production deployment, consider using systemd services
+- Regular backups of models and configurations are recommended
+## Support
+For issues or questions:
+- Check the logs in controller.log, worker.log
+- Ensure all dependencies are correctly installed
+- Verify CUDA is properly configured
+- Review the COSYVOICE2_CHANGES.md for model-specific details

controller.log.2025-08-16 ADDED Viewed

	@@ -0,0 +1,6 @@

+2025-08-16 15:21:01 | INFO | controller | args: Namespace(host='0.0.0.0', port=10000, dispatch_method='shortest_queue')
+2025-08-16 15:21:01 | INFO | controller | Init controller
+2025-08-16 15:21:01 | ERROR | stderr | INFO:     Started server process [32029]
+2025-08-16 15:21:01 | ERROR | stderr | INFO:     Waiting for application startup.
+2025-08-16 15:21:01 | ERROR | stderr | INFO:     Application startup complete.
+2025-08-16 15:21:01 | ERROR | stderr | INFO:     Uvicorn running on http://0.0.0.0:10000 (Press CTRL+C to quit)

cosyvoice/__init__.py ADDED Viewed

File without changes

cosyvoice/bin/average_model.py ADDED Viewed

	@@ -0,0 +1,92 @@

+# Copyright (c) 2020 Mobvoi Inc (Di Wu)
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import argparse
+import glob
+import yaml
+import torch
+def get_args():
+    parser = argparse.ArgumentParser(description='average model')
+    parser.add_argument('--dst_model', required=True, help='averaged model')
+    parser.add_argument('--src_path',
+                        required=True,
+                        help='src model path for average')
+    parser.add_argument('--val_best',
+                        action="store_true",
+                        help='averaged model')
+    parser.add_argument('--num',
+                        default=5,
+                        type=int,
+                        help='nums for averaged model')
+    args = parser.parse_args()
+    print(args)
+    return args
+def main():
+    args = get_args()
+    val_scores = []
+    if args.val_best:
+        yamls = glob.glob('{}/*.yaml'.format(args.src_path))
+        yamls = [
+            f for f in yamls
+            if not (os.path.basename(f).startswith('train')
+                    or os.path.basename(f).startswith('init'))
+        ]
+        for y in yamls:
+            with open(y, 'r') as f:
+                dic_yaml = yaml.load(f, Loader=yaml.BaseLoader)
+                loss = float(dic_yaml['loss_dict']['loss'])
+                epoch = int(dic_yaml['epoch'])
+                step = int(dic_yaml['step'])
+                tag = dic_yaml['tag']
+                val_scores += [[epoch, step, loss, tag]]
+        sorted_val_scores = sorted(val_scores,
+                                   key=lambda x: x[2],
+                                   reverse=False)
+        print("best val (epoch, step, loss, tag) = " +
+              str(sorted_val_scores[:args.num]))
+        path_list = [
+            args.src_path + '/epoch_{}_whole.pt'.format(score[0])
+            for score in sorted_val_scores[:args.num]
+        ]
+    print(path_list)
+    avg = {}
+    num = args.num
+    assert num == len(path_list)
+    for path in path_list:
+        print('Processing {}'.format(path))
+        states = torch.load(path, map_location=torch.device('cpu'))
+        for k in states.keys():
+            if k not in avg.keys():
+                avg[k] = states[k].clone()
+            else:
+                avg[k] += states[k]
+    # average
+    for k in avg.keys():
+        if avg[k] is not None:
+            # pytorch 1.6 use true_divide instead of /=
+            avg[k] = torch.true_divide(avg[k], num)
+    print('Saving to {}'.format(args.dst_model))
+    torch.save(avg, args.dst_model)
+if __name__ == '__main__':
+    main()

cosyvoice/bin/export_jit.py ADDED Viewed

	@@ -0,0 +1,74 @@

+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import argparse
+import logging
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+import os
+import sys
+import torch
+ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
+sys.path.append('{}/../..'.format(ROOT_DIR))
+sys.path.append('{}/../../third_party/Matcha-TTS'.format(ROOT_DIR))
+from cosyvoice.cli.cosyvoice import CosyVoice
+def get_args():
+    parser = argparse.ArgumentParser(description='export your model for deployment')
+    parser.add_argument('--model_dir',
+                        type=str,
+                        default='pretrained_models/CosyVoice-300M',
+                        help='local path')
+    args = parser.parse_args()
+    print(args)
+    return args
+def main():
+    args = get_args()
+    logging.basicConfig(level=logging.DEBUG,
+                        format='%(asctime)s %(levelname)s %(message)s')
+    torch._C._jit_set_fusion_strategy([('STATIC', 1)])
+    torch._C._jit_set_profiling_mode(False)
+    torch._C._jit_set_profiling_executor(False)
+    cosyvoice = CosyVoice(args.model_dir, load_jit=False, load_onnx=False)
+    # 1. export llm text_encoder
+    llm_text_encoder = cosyvoice.model.llm.text_encoder.half()
+    script = torch.jit.script(llm_text_encoder)
+    script = torch.jit.freeze(script)
+    script = torch.jit.optimize_for_inference(script)
+    script.save('{}/llm.text_encoder.fp16.zip'.format(args.model_dir))
+    # 2. export llm llm
+    llm_llm = cosyvoice.model.llm.llm.half()
+    script = torch.jit.script(llm_llm)
+    script = torch.jit.freeze(script, preserved_attrs=['forward_chunk'])
+    script = torch.jit.optimize_for_inference(script)
+    script.save('{}/llm.llm.fp16.zip'.format(args.model_dir))
+    # 3. export flow encoder
+    flow_encoder = cosyvoice.model.flow.encoder
+    script = torch.jit.script(flow_encoder)
+    script = torch.jit.freeze(script)
+    script = torch.jit.optimize_for_inference(script)
+    script.save('{}/flow.encoder.fp32.zip'.format(args.model_dir))
+if __name__ == '__main__':
+    main()

cosyvoice/bin/export_onnx.py ADDED Viewed

	@@ -0,0 +1,112 @@

+# Copyright (c) 2024 Antgroup Inc (authors: Zhoubofan, hexisyztem@icloud.com)
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import argparse
+import logging
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+import os
+import sys
+import onnxruntime
+import random
+import torch
+from tqdm import tqdm
+ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
+sys.path.append('{}/../..'.format(ROOT_DIR))
+sys.path.append('{}/../../third_party/Matcha-TTS'.format(ROOT_DIR))
+from cosyvoice.cli.cosyvoice import CosyVoice
+def get_dummy_input(batch_size, seq_len, out_channels, device):
+    x = torch.rand((batch_size, out_channels, seq_len), dtype=torch.float32, device=device)
+    mask = torch.ones((batch_size, 1, seq_len), dtype=torch.float32, device=device)
+    mu = torch.rand((batch_size, out_channels, seq_len), dtype=torch.float32, device=device)
+    t = torch.rand((batch_size), dtype=torch.float32, device=device)
+    spks = torch.rand((batch_size, out_channels), dtype=torch.float32, device=device)
+    cond = torch.rand((batch_size, out_channels, seq_len), dtype=torch.float32, device=device)
+    return x, mask, mu, t, spks, cond
+def get_args():
+    parser = argparse.ArgumentParser(description='export your model for deployment')
+    parser.add_argument('--model_dir',
+                        type=str,
+                        default='pretrained_models/CosyVoice-300M',
+                        help='local path')
+    args = parser.parse_args()
+    print(args)
+    return args
+def main():
+    args = get_args()
+    logging.basicConfig(level=logging.DEBUG,
+                        format='%(asctime)s %(levelname)s %(message)s')
+    cosyvoice = CosyVoice(args.model_dir, load_jit=False, load_onnx=False)
+    # 1. export flow decoder estimator
+    estimator = cosyvoice.model.flow.decoder.estimator
+    device = cosyvoice.model.device
+    batch_size, seq_len = 1, 256
+    out_channels = cosyvoice.model.flow.decoder.estimator.out_channels
+    x, mask, mu, t, spks, cond = get_dummy_input(batch_size, seq_len, out_channels, device)
+    torch.onnx.export(
+        estimator,
+        (x, mask, mu, t, spks, cond),
+        '{}/flow.decoder.estimator.fp32.onnx'.format(args.model_dir),
+        export_params=True,
+        opset_version=18,
+        do_constant_folding=True,
+        input_names=['x', 'mask', 'mu', 't', 'spks', 'cond'],
+        output_names=['estimator_out'],
+        dynamic_axes={
+            'x': {0: 'batch_size', 2: 'seq_len'},
+            'mask': {0: 'batch_size', 2: 'seq_len'},
+            'mu': {0: 'batch_size', 2: 'seq_len'},
+            'cond': {0: 'batch_size', 2: 'seq_len'},
+            't': {0: 'batch_size'},
+            'spks': {0: 'batch_size'},
+            'estimator_out': {0: 'batch_size', 2: 'seq_len'},
+        }
+    )
+    # 2. test computation consistency
+    option = onnxruntime.SessionOptions()
+    option.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
+    option.intra_op_num_threads = 1
+    providers = ['CUDAExecutionProvider' if torch.cuda.is_available() else 'CPUExecutionProvider']
+    estimator_onnx = onnxruntime.InferenceSession('{}/flow.decoder.estimator.fp32.onnx'.format(args.model_dir),
+                                                  sess_options=option, providers=providers)
+    for _ in tqdm(range(10)):
+        x, mask, mu, t, spks, cond = get_dummy_input(random.randint(1, 6), random.randint(16, 512), out_channels, device)
+        output_pytorch = estimator(x, mask, mu, t, spks, cond)
+        ort_inputs = {
+            'x': x.cpu().numpy(),
+            'mask': mask.cpu().numpy(),
+            'mu': mu.cpu().numpy(),
+            't': t.cpu().numpy(),
+            'spks': spks.cpu().numpy(),
+            'cond': cond.cpu().numpy()
+        }
+        output_onnx = estimator_onnx.run(None, ort_inputs)[0]
+        torch.testing.assert_allclose(output_pytorch, torch.from_numpy(output_onnx).to(device), rtol=1e-2, atol=1e-4)
+if __name__ == "__main__":
+    main()

cosyvoice/bin/export_trt.sh ADDED Viewed

	@@ -0,0 +1,9 @@

+#!/bin/bash
+# Copyright 2024 Alibaba Inc. All Rights Reserved.
+# download tensorrt from https://developer.nvidia.com/tensorrt/download/10x, check your system and cuda for compatibability
+# for example for linux + cuda12.4, you can download https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.Linux.x86_64-gnu.cuda-12.4.tar.gz
+TRT_DIR=<YOUR_TRT_DIR>
+MODEL_DIR=<COSYVOICE2_MODEL_DIR>
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$TRT_DIR/lib:/usr/local/cuda/lib64
+$TRT_DIR/bin/trtexec --onnx=$MODEL_DIR/flow.decoder.estimator.fp32.onnx --saveEngine=$MODEL_DIR/flow.decoder.estimator.fp16.mygpu.plan --fp16 --minShapes=x:2x80x4,mask:2x1x4,mu:2x80x4,cond:2x80x4 --optShapes=x:2x80x193,mask:2x1x193,mu:2x80x193,cond:2x80x193 --maxShapes=x:2x80x6800,mask:2x1x6800,mu:2x80x6800,cond:2x80x6800 --inputIOFormats=fp16:chw,fp16:chw,fp16:chw,fp16:chw,fp16:chw,fp16:chw --outputIOFormats=fp16:chw

cosyvoice/bin/inference.py ADDED Viewed

	@@ -0,0 +1,115 @@

+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import argparse
+import logging
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+import os
+import torch
+from torch.utils.data import DataLoader
+import torchaudio
+from hyperpyyaml import load_hyperpyyaml
+from tqdm import tqdm
+from cosyvoice.cli.model import CosyVoiceModel
+from cosyvoice.dataset.dataset import Dataset
+def get_args():
+    parser = argparse.ArgumentParser(description='inference with your model')
+    parser.add_argument('--config', required=True, help='config file')
+    parser.add_argument('--prompt_data', required=True, help='prompt data file')
+    parser.add_argument('--prompt_utt2data', required=True, help='prompt data file')
+    parser.add_argument('--tts_text', required=True, help='tts input file')
+    parser.add_argument('--llm_model', required=True, help='llm model file')
+    parser.add_argument('--flow_model', required=True, help='flow model file')
+    parser.add_argument('--hifigan_model', required=True, help='hifigan model file')
+    parser.add_argument('--gpu',
+                        type=int,
+                        default=-1,
+                        help='gpu id for this rank, -1 for cpu')
+    parser.add_argument('--mode',
+                        default='sft',
+                        choices=['sft', 'zero_shot'],
+                        help='inference mode')
+    parser.add_argument('--result_dir', required=True, help='asr result file')
+    args = parser.parse_args()
+    print(args)
+    return args
+def main():
+    args = get_args()
+    logging.basicConfig(level=logging.DEBUG,
+                        format='%(asctime)s %(levelname)s %(message)s')
+    os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu)
+    # Init cosyvoice models from configs
+    use_cuda = args.gpu >= 0 and torch.cuda.is_available()
+    device = torch.device('cuda' if use_cuda else 'cpu')
+    with open(args.config, 'r') as f:
+        configs = load_hyperpyyaml(f)
+    model = CosyVoiceModel(configs['llm'], configs['flow'], configs['hift'])
+    model.load(args.llm_model, args.flow_model, args.hifigan_model)
+    test_dataset = Dataset(args.prompt_data, data_pipeline=configs['data_pipeline'], mode='inference', shuffle=False, partition=False,
+                           tts_file=args.tts_text, prompt_utt2data=args.prompt_utt2data)
+    test_data_loader = DataLoader(test_dataset, batch_size=None, num_workers=0)
+    del configs
+    os.makedirs(args.result_dir, exist_ok=True)
+    fn = os.path.join(args.result_dir, 'wav.scp')
+    f = open(fn, 'w')
+    with torch.no_grad():
+        for _, batch in tqdm(enumerate(test_data_loader)):
+            utts = batch["utts"]
+            assert len(utts) == 1, "inference mode only support batchsize 1"
+            text_token = batch["text_token"].to(device)
+            text_token_len = batch["text_token_len"].to(device)
+            tts_index = batch["tts_index"]
+            tts_text_token = batch["tts_text_token"].to(device)
+            tts_text_token_len = batch["tts_text_token_len"].to(device)
+            speech_token = batch["speech_token"].to(device)
+            speech_token_len = batch["speech_token_len"].to(device)
+            speech_feat = batch["speech_feat"].to(device)
+            speech_feat_len = batch["speech_feat_len"].to(device)
+            utt_embedding = batch["utt_embedding"].to(device)
+            spk_embedding = batch["spk_embedding"].to(device)
+            if args.mode == 'sft':
+                model_input = {'text': tts_text_token, 'text_len': tts_text_token_len,
+                               'llm_embedding': spk_embedding, 'flow_embedding': spk_embedding}
+            else:
+                model_input = {'text': tts_text_token, 'text_len': tts_text_token_len,
+                               'prompt_text': text_token, 'prompt_text_len': text_token_len,
+                               'llm_prompt_speech_token': speech_token, 'llm_prompt_speech_token_len': speech_token_len,
+                               'flow_prompt_speech_token': speech_token, 'flow_prompt_speech_token_len': speech_token_len,
+                               'prompt_speech_feat': speech_feat, 'prompt_speech_feat_len': speech_feat_len,
+                               'llm_embedding': utt_embedding, 'flow_embedding': utt_embedding}
+            tts_speeches = []
+            for model_output in model.tts(**model_input):
+                tts_speeches.append(model_output['tts_speech'])
+            tts_speeches = torch.concat(tts_speeches, dim=1)
+            tts_key = '{}_{}'.format(utts[0], tts_index[0])
+            tts_fn = os.path.join(args.result_dir, '{}.wav'.format(tts_key))
+            torchaudio.save(tts_fn, tts_speeches, sample_rate=22050)
+            f.write('{} {}\n'.format(tts_key, tts_fn))
+            f.flush()
+    f.close()
+    logging.info('Result wav.scp saved in {}'.format(fn))
+if __name__ == '__main__':
+    main()

cosyvoice/bin/train.py ADDED Viewed

	@@ -0,0 +1,170 @@

+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import argparse
+import datetime
+import logging
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+from copy import deepcopy
+import os
+import torch
+import torch.distributed as dist
+import deepspeed
+from hyperpyyaml import load_hyperpyyaml
+from torch.distributed.elastic.multiprocessing.errors import record
+from cosyvoice.utils.executor import Executor
+from cosyvoice.utils.train_utils import (
+    init_distributed,
+    init_dataset_and_dataloader,
+    init_optimizer_and_scheduler,
+    init_summarywriter, save_model,
+    wrap_cuda_model, check_modify_and_save_config)
+def get_args():
+    parser = argparse.ArgumentParser(description='training your network')
+    parser.add_argument('--train_engine',
+                        default='torch_ddp',
+                        choices=['torch_ddp', 'deepspeed'],
+                        help='Engine for paralleled training')
+    parser.add_argument('--model', required=True, help='model which will be trained')
+    parser.add_argument('--config', required=True, help='config file')
+    parser.add_argument('--train_data', required=True, help='train data file')
+    parser.add_argument('--cv_data', required=True, help='cv data file')
+    parser.add_argument('--checkpoint', help='checkpoint model')
+    parser.add_argument('--model_dir', required=True, help='save model dir')
+    parser.add_argument('--tensorboard_dir',
+                        default='tensorboard',
+                        help='tensorboard log dir')
+    parser.add_argument('--ddp.dist_backend',
+                        dest='dist_backend',
+                        default='nccl',
+                        choices=['nccl', 'gloo'],
+                        help='distributed backend')
+    parser.add_argument('--num_workers',
+                        default=0,
+                        type=int,
+                        help='num of subprocess workers for reading')
+    parser.add_argument('--prefetch',
+                        default=100,
+                        type=int,
+                        help='prefetch number')
+    parser.add_argument('--pin_memory',
+                        action='store_true',
+                        default=False,
+                        help='Use pinned memory buffers used for reading')
+    parser.add_argument('--use_amp',
+                        action='store_true',
+                        default=False,
+                        help='Use automatic mixed precision training')
+    parser.add_argument('--deepspeed.save_states',
+                        dest='save_states',
+                        default='model_only',
+                        choices=['model_only', 'model+optimizer'],
+                        help='save model/optimizer states')
+    parser.add_argument('--timeout',
+                        default=60,
+                        type=int,
+                        help='timeout (in seconds) of cosyvoice_join.')
+    parser = deepspeed.add_config_arguments(parser)
+    args = parser.parse_args()
+    return args
+@record
+def main():
+    args = get_args()
+    logging.basicConfig(level=logging.DEBUG,
+                        format='%(asctime)s %(levelname)s %(message)s')
+    # gan train has some special initialization logic
+    gan = True if args.model == 'hifigan' else False
+    override_dict = {k: None for k in ['llm', 'flow', 'hift', 'hifigan'] if k != args.model}
+    if gan is True:
+        override_dict.pop('hift')
+    with open(args.config, 'r') as f:
+        configs = load_hyperpyyaml(f, overrides=override_dict)
+    if gan is True:
+        configs['train_conf'] = configs['train_conf_gan']
+    configs['train_conf'].update(vars(args))
+    # Init env for ddp
+    init_distributed(args)
+    # Get dataset & dataloader
+    train_dataset, cv_dataset, train_data_loader, cv_data_loader = \
+        init_dataset_and_dataloader(args, configs, gan)
+    # Do some sanity checks and save config to arsg.model_dir
+    configs = check_modify_and_save_config(args, configs)
+    # Tensorboard summary
+    writer = init_summarywriter(args)
+    # load checkpoint
+    model = configs[args.model]
+    start_step, start_epoch = 0, -1
+    if args.checkpoint is not None:
+        if os.path.exists(args.checkpoint):
+            state_dict = torch.load(args.checkpoint, map_location='cpu')
+            model.load_state_dict(state_dict, strict=False)
+            if 'step' in state_dict:
+                start_step = state_dict['step']
+            if 'epoch' in state_dict:
+                start_epoch = state_dict['epoch']
+        else:
+            logging.warning('checkpoint {} do not exsist!'.format(args.checkpoint))
+    # Dispatch model from cpu to gpu
+    model = wrap_cuda_model(args, model)
+    # Get optimizer & scheduler
+    model, optimizer, scheduler, optimizer_d, scheduler_d = init_optimizer_and_scheduler(args, configs, model, gan)
+    scheduler.set_step(start_step)
+    if scheduler_d is not None:
+        scheduler_d.set_step(start_step)
+    # Save init checkpoints
+    info_dict = deepcopy(configs['train_conf'])
+    info_dict['step'] = start_step
+    info_dict['epoch'] = start_epoch
+    save_model(model, 'init', info_dict)
+    # Get executor
+    executor = Executor(gan=gan)
+    executor.step = start_step
+    # Init scaler, used for pytorch amp mixed precision training
+    scaler = torch.cuda.amp.GradScaler() if args.use_amp else None
+    print('start step {} start epoch {}'.format(start_step, start_epoch))
+    # Start training loop
+    for epoch in range(start_epoch + 1, info_dict['max_epoch']):
+        executor.epoch = epoch
+        train_dataset.set_epoch(epoch)
+        dist.barrier()
+        group_join = dist.new_group(backend="gloo", timeout=datetime.timedelta(seconds=args.timeout))
+        if gan is True:
+            executor.train_one_epoc_gan(model, optimizer, scheduler, optimizer_d, scheduler_d, train_data_loader, cv_data_loader,
+                                        writer, info_dict, scaler, group_join)
+        else:
+            executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, scaler, group_join)
+        dist.destroy_process_group(group_join)
+if __name__ == '__main__':
+    main()

cosyvoice/cli/__init__.py ADDED Viewed

File without changes

cosyvoice/cli/cosyvoice.py ADDED Viewed

	@@ -0,0 +1,170 @@

+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import time
+from tqdm import tqdm
+from hyperpyyaml import load_hyperpyyaml
+from modelscope import snapshot_download
+import torch
+from cosyvoice.cli.frontend import CosyVoiceFrontEnd
+from cosyvoice.cli.model import CosyVoiceModel, CosyVoice2Model
+from cosyvoice.utils.file_utils import logging
+class CosyVoice:
+    def __init__(self, model_dir, load_jit=True, load_onnx=False, fp16=True):
+        instruct = True if '-Instruct' in model_dir else False
+        self.model_dir = model_dir
+        if not os.path.exists(model_dir):
+            model_dir = snapshot_download(model_dir)
+        with open('{}/cosyvoice.yaml'.format(model_dir), 'r') as f:
+            configs = load_hyperpyyaml(f)
+        self.frontend = CosyVoiceFrontEnd(configs['get_tokenizer'],
+                                          configs['feat_extractor'],
+                                          '{}/campplus.onnx'.format(model_dir),
+                                          '{}/speech_tokenizer_v1.onnx'.format(model_dir),
+                                          '{}/spk2info.pt'.format(model_dir),
+                                          instruct,
+                                          configs['allowed_special'])
+        self.sample_rate = configs['sample_rate']
+        if torch.cuda.is_available() is False and (fp16 is True or load_jit is True):
+            load_jit = False
+            fp16 = False
+            logging.warning('cpu do not support fp16 and jit, force set to False')
+        self.model = CosyVoiceModel(configs['llm'], configs['flow'], configs['hift'], fp16)
+        self.model.load('{}/llm.pt'.format(model_dir),
+                        '{}/flow.pt'.format(model_dir),
+                        '{}/hift.pt'.format(model_dir))
+        if load_jit:
+            self.model.load_jit('{}/llm.text_encoder.fp16.zip'.format(model_dir),
+                                '{}/llm.llm.fp16.zip'.format(model_dir),
+                                '{}/flow.encoder.fp32.zip'.format(model_dir))
+        if load_onnx:
+            self.model.load_onnx('{}/flow.decoder.estimator.fp32.onnx'.format(model_dir))
+        del configs
+    def list_avaliable_spks(self):
+        spks = list(self.frontend.spk2info.keys())
+        return spks
+    def inference_sft(self, tts_text, spk_id, stream=False, speed=1.0):
+        for i in tqdm(self.frontend.text_normalize(tts_text, split=True)):
+            model_input = self.frontend.frontend_sft(i, spk_id)
+            start_time = time.time()
+            logging.info('synthesis text {}'.format(i))
+            for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
+                speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
+                logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
+                yield model_output
+                start_time = time.time()
+    def inference_zero_shot(self, tts_text, prompt_text, prompt_speech_16k, stream=False, speed=1.0):
+        prompt_text = self.frontend.text_normalize(prompt_text, split=False)
+        for i in tqdm(self.frontend.text_normalize(tts_text, split=True)):
+            if len(i) < 0.5 * len(prompt_text):
+                logging.warning('synthesis text {} too short than prompt text {}, this may lead to bad performance'.format(i, prompt_text))
+            model_input = self.frontend.frontend_zero_shot(i, prompt_text, prompt_speech_16k, self.sample_rate)
+            start_time = time.time()
+            logging.info('synthesis text {}'.format(i))
+            for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
+                speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
+                logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
+                yield model_output
+                start_time = time.time()
+    def inference_cross_lingual(self, tts_text, prompt_speech_16k, stream=False, speed=1.0):
+        if self.frontend.instruct is True and isinstance(self.model, CosyVoiceModel):
+            raise ValueError('{} do not support cross_lingual inference'.format(self.model_dir))
+        for i in tqdm(self.frontend.text_normalize(tts_text, split=True)):
+            model_input = self.frontend.frontend_cross_lingual(i, prompt_speech_16k, self.sample_rate)
+            start_time = time.time()
+            logging.info('synthesis text {}'.format(i))
+            for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
+                speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
+                logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
+                yield model_output
+                start_time = time.time()
+    def inference_instruct(self, tts_text, spk_id, instruct_text, stream=False, speed=1.0):
+        assert isinstance(self.model, CosyVoiceModel)
+        if self.frontend.instruct is False:
+            raise ValueError('{} do not support instruct inference'.format(self.model_dir))
+        instruct_text = self.frontend.text_normalize(instruct_text, split=False)
+        for i in tqdm(self.frontend.text_normalize(tts_text, split=True)):
+            model_input = self.frontend.frontend_instruct(i, spk_id, instruct_text)
+            start_time = time.time()
+            logging.info('synthesis text {}'.format(i))
+            for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
+                speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
+                logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
+                yield model_output
+                start_time = time.time()
+    def inference_instruct2(self, tts_text, instruct_text, prompt_speech_16k, stream=False, speed=1.0):
+        assert isinstance(self.model, CosyVoice2Model)
+        for i in tqdm(self.frontend.text_normalize(tts_text, split=True)):
+            model_input = self.frontend.frontend_instruct2(i, instruct_text, prompt_speech_16k, self.sample_rate)
+            start_time = time.time()
+            logging.info('synthesis text {}'.format(i))
+            for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
+                speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
+                logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
+                yield model_output
+                start_time = time.time()
+    def inference_vc(self, source_speech_16k, prompt_speech_16k, stream=False, speed=1.0):
+        model_input = self.frontend.frontend_vc(source_speech_16k, prompt_speech_16k, self.sample_rate)
+        start_time = time.time()
+        for model_output in self.model.vc(**model_input, stream=stream, speed=speed):
+            speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
+            logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
+            yield model_output
+            start_time = time.time()
+class CosyVoice2(CosyVoice):
+    def __init__(self, model_dir, load_jit=False, load_onnx=False, load_trt=False):
+        instruct = True if '-Instruct' in model_dir else False
+        self.model_dir = model_dir
+        if not os.path.exists(model_dir):
+            model_dir = snapshot_download(model_dir)
+        with open('{}/cosyvoice.yaml'.format(model_dir), 'r') as f:
+            configs = load_hyperpyyaml(f, overrides={'qwen_pretrain_path': os.path.join(model_dir, 'CosyVoice-BlankEN')})
+        self.frontend = CosyVoiceFrontEnd(configs['get_tokenizer'],
+                                          configs['feat_extractor'],
+                                          '{}/campplus.onnx'.format(model_dir),
+                                          '{}/speech_tokenizer_v2.onnx'.format(model_dir),
+                                          '{}/spk2info.pt'.format(model_dir),
+                                          instruct,
+                                          configs['allowed_special'])
+        self.sample_rate = configs['sample_rate']
+        if torch.cuda.is_available() is False and load_jit is True:
+            load_jit = False
+            logging.warning('cpu do not support jit, force set to False')
+        self.model = CosyVoice2Model(configs['llm'], configs['flow'], configs['hift'])
+        self.model.load('{}/llm.pt'.format(model_dir),
+                        '{}/flow.pt'.format(model_dir),
+                        '{}/hift.pt'.format(model_dir))
+        if load_jit:
+            self.model.load_jit('{}/flow.encoder.fp32.zip'.format(model_dir))
+        if load_trt is True and load_onnx is True:
+            load_onnx = False
+            logging.warning('can not set both load_trt and load_onnx to True, force set load_onnx to False')
+        if load_onnx:
+            self.model.load_onnx('{}/flow.decoder.estimator.fp32.onnx'.format(model_dir))
+        if load_trt:
+            self.model.load_trt('{}/flow.decoder.estimator.fp16.Volta.plan'.format(model_dir))
+        del configs

cosyvoice/cli/frontend.py ADDED Viewed

	@@ -0,0 +1,217 @@

+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from functools import partial
+import json
+import onnxruntime
+import torch
+import numpy as np
+import whisper
+from typing import Callable
+import torchaudio.compliance.kaldi as kaldi
+import torchaudio
+import os
+import re
+import inflect
+try:
+    import ttsfrd
+    use_ttsfrd = True
+except ImportError:
+    print("failed to import ttsfrd, use WeTextProcessing instead")
+    from tn.chinese.normalizer import Normalizer as ZhNormalizer
+    from tn.english.normalizer import Normalizer as EnNormalizer
+    use_ttsfrd = False
+from cosyvoice.utils.frontend_utils import contains_chinese, replace_blank, replace_corner_mark, remove_bracket, spell_out_number, split_paragraph
+class CosyVoiceFrontEnd:
+    def __init__(self,
+                 get_tokenizer: Callable,
+                 feat_extractor: Callable,
+                 campplus_model: str,
+                 speech_tokenizer_model: str,
+                 spk2info: str = '',
+                 instruct: bool = False,
+                 allowed_special: str = 'all'):
+        self.tokenizer = get_tokenizer()
+        self.feat_extractor = feat_extractor
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        option = onnxruntime.SessionOptions()
+        option.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
+        option.intra_op_num_threads = 1
+        self.campplus_session = onnxruntime.InferenceSession(campplus_model, sess_options=option, providers=["CPUExecutionProvider"])
+        self.speech_tokenizer_session = onnxruntime.InferenceSession(speech_tokenizer_model, sess_options=option,
+                                                                     providers=["CUDAExecutionProvider" if torch.cuda.is_available() else
+                                                                                "CPUExecutionProvider"])
+        if os.path.exists(spk2info):
+            self.spk2info = torch.load(spk2info, map_location=self.device)
+        else:
+            self.spk2info = {}
+        self.instruct = instruct
+        self.allowed_special = allowed_special
+        self.inflect_parser = inflect.engine()
+        self.use_ttsfrd = use_ttsfrd
+        if self.use_ttsfrd:
+            self.frd = ttsfrd.TtsFrontendEngine()
+            ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
+            assert self.frd.initialize('{}/../../pretrained_models/CosyVoice-ttsfrd/resource'.format(ROOT_DIR)) is True, \
+                'failed to initialize ttsfrd resource'
+            self.frd.set_lang_type('pinyinvg')
+        else:
+            self.zh_tn_model = ZhNormalizer(remove_erhua=False, full_to_half=False)
+            self.en_tn_model = EnNormalizer()
+    def _extract_text_token(self, text):
+        text_token = self.tokenizer.encode(text, allowed_special=self.allowed_special)
+        text_token = torch.tensor([text_token], dtype=torch.int32).to(self.device)
+        text_token_len = torch.tensor([text_token.shape[1]], dtype=torch.int32).to(self.device)
+        return text_token, text_token_len
+    def _extract_speech_token(self, speech):
+        assert speech.shape[1] / 16000 <= 30, 'do not support extract speech token for audio longer than 30s'
+        feat = whisper.log_mel_spectrogram(speech, n_mels=128)
+        speech_token = self.speech_tokenizer_session.run(None,
+                                                         {self.speech_tokenizer_session.get_inputs()[0].name:
+                                                          feat.detach().cpu().numpy(),
+                                                          self.speech_tokenizer_session.get_inputs()[1].name:
+                                                          np.array([feat.shape[2]], dtype=np.int32)})[0].flatten().tolist()
+        speech_token = torch.tensor([speech_token], dtype=torch.int32).to(self.device)
+        speech_token_len = torch.tensor([speech_token.shape[1]], dtype=torch.int32).to(self.device)
+        return speech_token, speech_token_len
+    def _extract_spk_embedding(self, speech):
+        feat = kaldi.fbank(speech,
+                           num_mel_bins=80,
+                           dither=0,
+                           sample_frequency=16000)
+        feat = feat - feat.mean(dim=0, keepdim=True)
+        embedding = self.campplus_session.run(None,
+                                              {self.campplus_session.get_inputs()[0].name: feat.unsqueeze(dim=0).cpu().numpy()})[0].flatten().tolist()
+        embedding = torch.tensor([embedding]).to(self.device)
+        return embedding
+    def _extract_speech_feat(self, speech):
+        speech_feat = self.feat_extractor(speech).squeeze(dim=0).transpose(0, 1).to(self.device)
+        speech_feat = speech_feat.unsqueeze(dim=0)
+        speech_feat_len = torch.tensor([speech_feat.shape[1]], dtype=torch.int32).to(self.device)
+        return speech_feat, speech_feat_len
+    def text_normalize(self, text, split=True):
+        text = text.strip()
+        # NOTE(lyuxiang.lx) move this judgement into ttsfrd in the future
+        for token in self.tokenizer.special_tokens['additional_special_tokens']:
+            if token in text:
+                return text if split is False else [text]
+        if contains_chinese(text):
+            if self.use_ttsfrd:
+                texts = [i["text"] for i in json.loads(self.frd.do_voicegen_frd(text))["sentences"]]
+                text = ''.join(texts)
+            else:
+                text = self.zh_tn_model.normalize(text)
+                text = text.replace("\n", "")
+                text = replace_blank(text)
+                text = replace_corner_mark(text)
+                text = text.replace(".", "。")
+                text = text.replace(" - ", "，")
+                text = remove_bracket(text)
+                text = re.sub(r'[，,、]+$', '。', text)
+                texts = list(split_paragraph(text, partial(self.tokenizer.encode, allowed_special=self.allowed_special), "zh", token_max_n=80,
+                                             token_min_n=60, merge_len=20, comma_split=False))
+        else:
+            if self.use_ttsfrd:
+                texts = [i["text"] for i in json.loads(self.frd.do_voicegen_frd(text))["sentences"]]
+                text = ''.join(texts)
+            else:
+                text = self.en_tn_model.normalize(text)
+                text = spell_out_number(text, self.inflect_parser)
+                texts = list(split_paragraph(text, partial(self.tokenizer.encode, allowed_special=self.allowed_special), "en", token_max_n=80,
+                                             token_min_n=60, merge_len=20, comma_split=False))
+        if split is False:
+            return text
+        return texts
+    def frontend_sft(self, tts_text, spk_id):
+        tts_text_token, tts_text_token_len = self._extract_text_token(tts_text)
+        embedding = self.spk2info[spk_id]['embedding']
+        model_input = {'text': tts_text_token, 'text_len': tts_text_token_len, 'llm_embedding': embedding, 'flow_embedding': embedding}
+        return model_input
+    def frontend_zero_shot(self, tts_text, prompt_text, prompt_speech_16k, resample_rate):
+        tts_text_token, tts_text_token_len = self._extract_text_token(tts_text)
+        prompt_text_token, prompt_text_token_len = self._extract_text_token(prompt_text)
+        prompt_speech_resample = torchaudio.transforms.Resample(orig_freq=16000, new_freq=resample_rate)(prompt_speech_16k)
+        speech_feat, speech_feat_len = self._extract_speech_feat(prompt_speech_resample)
+        speech_token, speech_token_len = self._extract_speech_token(prompt_speech_16k)
+        if resample_rate == 24000:
+            # cosyvoice2, force speech_feat % speech_token = 2
+            token_len = min(int(speech_feat.shape[1] / 2), speech_token.shape[1])
+            speech_feat, speech_feat_len[:] = speech_feat[:, :2 * token_len], 2 * token_len
+            speech_token, speech_token_len[:] = speech_token[:, :token_len], token_len
+        embedding = self._extract_spk_embedding(prompt_speech_16k)
+        model_input = {'text': tts_text_token, 'text_len': tts_text_token_len,
+                       'prompt_text': prompt_text_token, 'prompt_text_len': prompt_text_token_len,
+                       'llm_prompt_speech_token': speech_token, 'llm_prompt_speech_token_len': speech_token_len,
+                       'flow_prompt_speech_token': speech_token, 'flow_prompt_speech_token_len': speech_token_len,
+                       'prompt_speech_feat': speech_feat, 'prompt_speech_feat_len': speech_feat_len,
+                       'llm_embedding': embedding, 'flow_embedding': embedding}
+        return model_input
+    def frontend_cross_lingual(self, tts_text, prompt_speech_16k, resample_rate):
+        model_input = self.frontend_zero_shot(tts_text, '', prompt_speech_16k, resample_rate)
+        # in cross lingual mode, we remove prompt in llm
+        del model_input['prompt_text']
+        del model_input['prompt_text_len']
+        del model_input['llm_prompt_speech_token']
+        del model_input['llm_prompt_speech_token_len']
+        return model_input
+    def frontend_instruct(self, tts_text, spk_id, instruct_text):
+        model_input = self.frontend_sft(tts_text, spk_id)
+        # in instruct mode, we remove spk_embedding in llm due to information leakage
+        del model_input['llm_embedding']
+        instruct_text_token, instruct_text_token_len = self._extract_text_token(instruct_text + '<endofprompt>')
+        model_input['prompt_text'] = instruct_text_token
+        model_input['prompt_text_len'] = instruct_text_token_len
+        return model_input
+    def frontend_instruct2(self, tts_text, instruct_text, prompt_speech_16k, resample_rate):
+        tts_text_token, tts_text_token_len = self._extract_text_token(tts_text)
+        prompt_text_token, prompt_text_token_len = self._extract_text_token(instruct_text + '<|endofprompt|>')
+        prompt_speech_resample = torchaudio.transforms.Resample(orig_freq=16000, new_freq=resample_rate)(prompt_speech_16k)
+        speech_feat, speech_feat_len = self._extract_speech_feat(prompt_speech_resample)
+        speech_token, speech_token_len = self._extract_speech_token(prompt_speech_16k)
+        if resample_rate == 24000:
+            # cosyvoice2, force speech_feat % speech_token = 2
+            token_len = min(int(speech_feat.shape[1] / 2), speech_token.shape[1])
+            speech_feat, speech_feat_len[:] = speech_feat[:, :2 * token_len], 2 * token_len
+            speech_token, speech_token_len[:] = speech_token[:, :token_len], token_len
+        embedding = self._extract_spk_embedding(prompt_speech_16k)
+        model_input = {'text': tts_text_token, 'text_len': tts_text_token_len,
+                       'prompt_text': prompt_text_token, 'prompt_text_len': prompt_text_token_len,
+                       'flow_prompt_speech_token': speech_token, 'flow_prompt_speech_token_len': speech_token_len,
+                       'prompt_speech_feat': speech_feat, 'prompt_speech_feat_len': speech_feat_len,
+                       'llm_embedding': embedding, 'flow_embedding': embedding}
+        return model_input
+    def frontend_vc(self, source_speech_16k, prompt_speech_16k, resample_rate):
+        prompt_speech_token, prompt_speech_token_len = self._extract_speech_token(prompt_speech_16k)
+        prompt_speech_resample = torchaudio.transforms.Resample(orig_freq=16000, new_freq=resample_rate)(prompt_speech_16k)
+        prompt_speech_feat, prompt_speech_feat_len = self._extract_speech_feat(prompt_speech_resample)
+        embedding = self._extract_spk_embedding(prompt_speech_16k)
+        source_speech_token, source_speech_token_len = self._extract_speech_token(source_speech_16k)
+        model_input = {'source_speech_token': source_speech_token, 'source_speech_token_len': source_speech_token_len,
+                       'flow_prompt_speech_token': prompt_speech_token, 'flow_prompt_speech_token_len': prompt_speech_token_len,
+                       'prompt_speech_feat': prompt_speech_feat, 'prompt_speech_feat_len': prompt_speech_feat_len,
+                       'flow_embedding': embedding}
+        return model_input

cosyvoice/cli/model.py ADDED Viewed

	@@ -0,0 +1,421 @@

+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import torch
+import numpy as np
+import threading
+import time
+from torch.nn import functional as F
+from contextlib import nullcontext
+import uuid
+from cosyvoice.utils.common import fade_in_out
+class CosyVoiceModel:
+    def __init__(self,
+                 llm: torch.nn.Module,
+                 flow: torch.nn.Module,
+                 hift: torch.nn.Module,
+                 fp16: bool):
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        self.llm = llm
+        self.flow = flow
+        self.hift = hift
+        self.fp16 = fp16
+        self.token_min_hop_len = 2 * self.flow.input_frame_rate
+        self.token_max_hop_len = 4 * self.flow.input_frame_rate
+        self.token_overlap_len = 20
+        # mel fade in out
+        self.mel_overlap_len = int(self.token_overlap_len / self.flow.input_frame_rate * 22050 / 256)
+        self.mel_window = np.hamming(2 * self.mel_overlap_len)
+        # hift cache
+        self.mel_cache_len = 20
+        self.source_cache_len = int(self.mel_cache_len * 256)
+        # speech fade in out
+        self.speech_window = np.hamming(2 * self.source_cache_len)
+        # rtf and decoding related
+        self.stream_scale_factor = 1
+        assert self.stream_scale_factor >= 1, 'stream_scale_factor should be greater than 1, change it according to your actual rtf'
+        self.llm_context = torch.cuda.stream(torch.cuda.Stream(self.device)) if torch.cuda.is_available() else nullcontext()
+        self.lock = threading.Lock()
+        # dict used to store session related variable
+        self.tts_speech_token_dict = {}
+        self.llm_end_dict = {}
+        self.mel_overlap_dict = {}
+        self.flow_cache_dict = {}
+        self.hift_cache_dict = {}
+    def load(self, llm_model, flow_model, hift_model):
+        self.llm.load_state_dict(torch.load(llm_model, map_location=self.device), strict=True)
+        self.llm.to(self.device).eval()
+        if self.fp16 is True:
+            self.llm.half()
+        self.flow.load_state_dict(torch.load(flow_model, map_location=self.device), strict=True)
+        self.flow.to(self.device).eval()
+        # in case hift_model is a hifigan model
+        hift_state_dict = {k.replace('generator.', ''): v for k, v in torch.load(hift_model, map_location=self.device).items()}
+        self.hift.load_state_dict(hift_state_dict, strict=True)
+        self.hift.to(self.device).eval()
+    def load_jit(self, llm_text_encoder_model, llm_llm_model, flow_encoder_model):
+        assert self.fp16 is True, "we only provide fp16 jit model, set fp16=True if you want to use jit model"
+        llm_text_encoder = torch.jit.load(llm_text_encoder_model, map_location=self.device)
+        self.llm.text_encoder = llm_text_encoder
+        llm_llm = torch.jit.load(llm_llm_model, map_location=self.device)
+        self.llm.llm = llm_llm
+        flow_encoder = torch.jit.load(flow_encoder_model, map_location=self.device)
+        self.flow.encoder = flow_encoder
+    def load_onnx(self, flow_decoder_estimator_model):
+        import onnxruntime
+        option = onnxruntime.SessionOptions()
+        option.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
+        option.intra_op_num_threads = 1
+        providers = ['CUDAExecutionProvider' if torch.cuda.is_available() else 'CPUExecutionProvider']
+        del self.flow.decoder.estimator
+        self.flow.decoder.estimator = onnxruntime.InferenceSession(flow_decoder_estimator_model, sess_options=option, providers=providers)
+    def llm_job(self, text, prompt_text, llm_prompt_speech_token, llm_embedding, uuid):
+        if self.fp16 is True:
+            llm_embedding = llm_embedding.half()
+        with self.llm_context:
+            for i in self.llm.inference(text=text.to(self.device),
+                                        text_len=torch.tensor([text.shape[1]], dtype=torch.int32).to(self.device),
+                                        prompt_text=prompt_text.to(self.device),
+                                        prompt_text_len=torch.tensor([prompt_text.shape[1]], dtype=torch.int32).to(self.device),
+                                        prompt_speech_token=llm_prompt_speech_token.to(self.device),
+                                        prompt_speech_token_len=torch.tensor([llm_prompt_speech_token.shape[1]], dtype=torch.int32).to(self.device),
+                                        embedding=llm_embedding.to(self.device)):
+                self.tts_speech_token_dict[uuid].append(i)
+        self.llm_end_dict[uuid] = True
+    def token2wav(self, token, prompt_token, prompt_feat, embedding, uuid, finalize=False, speed=1.0):
+        tts_mel, flow_cache = self.flow.inference(token=token.to(self.device),
+                                                  token_len=torch.tensor([token.shape[1]], dtype=torch.int32).to(self.device),
+                                                  prompt_token=prompt_token.to(self.device),
+                                                  prompt_token_len=torch.tensor([prompt_token.shape[1]], dtype=torch.int32).to(self.device),
+                                                  prompt_feat=prompt_feat.to(self.device),
+                                                  prompt_feat_len=torch.tensor([prompt_feat.shape[1]], dtype=torch.int32).to(self.device),
+                                                  embedding=embedding.to(self.device),
+                                                  flow_cache=self.flow_cache_dict[uuid])
+        self.flow_cache_dict[uuid] = flow_cache
+        # mel overlap fade in out
+        if self.mel_overlap_dict[uuid].shape[2] != 0:
+            tts_mel = fade_in_out(tts_mel, self.mel_overlap_dict[uuid], self.mel_window)
+        # append hift cache
+        if self.hift_cache_dict[uuid] is not None:
+            hift_cache_mel, hift_cache_source = self.hift_cache_dict[uuid]['mel'], self.hift_cache_dict[uuid]['source']
+            tts_mel = torch.concat([hift_cache_mel, tts_mel], dim=2)
+        else:
+            hift_cache_source = torch.zeros(1, 1, 0)
+        # keep overlap mel and hift cache
+        if finalize is False:
+            self.mel_overlap_dict[uuid] = tts_mel[:, :, -self.mel_overlap_len:]
+            tts_mel = tts_mel[:, :, :-self.mel_overlap_len]
+            tts_speech, tts_source = self.hift.inference(speech_feat=tts_mel, cache_source=hift_cache_source)
+            if self.hift_cache_dict[uuid] is not None:
+                tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
+            self.hift_cache_dict[uuid] = {'mel': tts_mel[:, :, -self.mel_cache_len:],
+                                          'source': tts_source[:, :, -self.source_cache_len:],
+                                          'speech': tts_speech[:, -self.source_cache_len:]}
+            tts_speech = tts_speech[:, :-self.source_cache_len]
+        else:
+            if speed != 1.0:
+                assert self.hift_cache_dict[uuid] is None, 'speed change only support non-stream inference mode'
+                tts_mel = F.interpolate(tts_mel, size=int(tts_mel.shape[2] / speed), mode='linear')
+            tts_speech, tts_source = self.hift.inference(speech_feat=tts_mel, cache_source=hift_cache_source)
+            if self.hift_cache_dict[uuid] is not None:
+                tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
+        return tts_speech
+    def tts(self, text, flow_embedding, llm_embedding=torch.zeros(0, 192),
+            prompt_text=torch.zeros(1, 0, dtype=torch.int32),
+            llm_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32),
+            flow_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32),
+            prompt_speech_feat=torch.zeros(1, 0, 80), stream=False, speed=1.0, **kwargs):
+        # this_uuid is used to track variables related to this inference thread
+        this_uuid = str(uuid.uuid1())
+        with self.lock:
+            self.tts_speech_token_dict[this_uuid], self.llm_end_dict[this_uuid] = [], False
+            self.hift_cache_dict[this_uuid] = None
+            self.mel_overlap_dict[this_uuid] = torch.zeros(1, 80, 0)
+            self.flow_cache_dict[this_uuid] = torch.zeros(1, 80, 0, 2)
+        p = threading.Thread(target=self.llm_job, args=(text, prompt_text, llm_prompt_speech_token, llm_embedding, this_uuid))
+        p.start()
+        if stream is True:
+            token_hop_len = self.token_min_hop_len
+            while True:
+                time.sleep(0.1)
+                if len(self.tts_speech_token_dict[this_uuid]) >= token_hop_len + self.token_overlap_len:
+                    this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid][:token_hop_len + self.token_overlap_len]) \
+                        .unsqueeze(dim=0)
+                    this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                                     prompt_token=flow_prompt_speech_token,
+                                                     prompt_feat=prompt_speech_feat,
+                                                     embedding=flow_embedding,
+                                                     uuid=this_uuid,
+                                                     finalize=False)
+                    yield {'tts_speech': this_tts_speech.cpu()}
+                    with self.lock:
+                        self.tts_speech_token_dict[this_uuid] = self.tts_speech_token_dict[this_uuid][token_hop_len:]
+                    # increase token_hop_len for better speech quality
+                    token_hop_len = min(self.token_max_hop_len, int(token_hop_len * self.stream_scale_factor))
+                if self.llm_end_dict[this_uuid] is True and len(self.tts_speech_token_dict[this_uuid]) < token_hop_len + self.token_overlap_len:
+                    break
+            p.join()
+            # deal with remain tokens, make sure inference remain token len equals token_hop_len when cache_speech is not None
+            this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
+            this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                             prompt_token=flow_prompt_speech_token,
+                                             prompt_feat=prompt_speech_feat,
+                                             embedding=flow_embedding,
+                                             uuid=this_uuid,
+                                             finalize=True)
+            yield {'tts_speech': this_tts_speech.cpu()}
+        else:
+            # deal with all tokens
+            p.join()
+            this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
+            this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                             prompt_token=flow_prompt_speech_token,
+                                             prompt_feat=prompt_speech_feat,
+                                             embedding=flow_embedding,
+                                             uuid=this_uuid,
+                                             finalize=True,
+                                             speed=speed)
+            yield {'tts_speech': this_tts_speech.cpu()}
+        with self.lock:
+            self.tts_speech_token_dict.pop(this_uuid)
+            self.llm_end_dict.pop(this_uuid)
+            self.mel_overlap_dict.pop(this_uuid)
+            self.hift_cache_dict.pop(this_uuid)
+            self.flow_cache_dict.pop(this_uuid)
+    def vc(self, source_speech_token, flow_prompt_speech_token, prompt_speech_feat, flow_embedding, stream=False, speed=1.0, **kwargs):
+        # this_uuid is used to track variables related to this inference thread
+        this_uuid = str(uuid.uuid1())
+        with self.lock:
+            self.tts_speech_token_dict[this_uuid], self.llm_end_dict[this_uuid] = source_speech_token.flatten().tolist(), True
+            self.hift_cache_dict[this_uuid] = None
+            self.mel_overlap_dict[this_uuid] = torch.zeros(1, 80, 0)
+            self.flow_cache_dict[this_uuid] = torch.zeros(1, 80, 0, 2)
+        if stream is True:
+            token_hop_len = self.token_min_hop_len
+            while True:
+                if len(self.tts_speech_token_dict[this_uuid]) >= token_hop_len + self.token_overlap_len:
+                    this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid][:token_hop_len + self.token_overlap_len]) \
+                        .unsqueeze(dim=0)
+                    this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                                     prompt_token=flow_prompt_speech_token,
+                                                     prompt_feat=prompt_speech_feat,
+                                                     embedding=flow_embedding,
+                                                     uuid=this_uuid,
+                                                     finalize=False)
+                    yield {'tts_speech': this_tts_speech.cpu()}
+                    with self.lock:
+                        self.tts_speech_token_dict[this_uuid] = self.tts_speech_token_dict[this_uuid][token_hop_len:]
+                    # increase token_hop_len for better speech quality
+                    token_hop_len = min(self.token_max_hop_len, int(token_hop_len * self.stream_scale_factor))
+                if self.llm_end_dict[this_uuid] is True and len(self.tts_speech_token_dict[this_uuid]) < token_hop_len + self.token_overlap_len:
+                    break
+            # deal with remain tokens, make sure inference remain token len equals token_hop_len when cache_speech is not None
+            this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
+            this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                             prompt_token=flow_prompt_speech_token,
+                                             prompt_feat=prompt_speech_feat,
+                                             embedding=flow_embedding,
+                                             uuid=this_uuid,
+                                             finalize=True)
+            yield {'tts_speech': this_tts_speech.cpu()}
+        else:
+            # deal with all tokens
+            this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
+            this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                             prompt_token=flow_prompt_speech_token,
+                                             prompt_feat=prompt_speech_feat,
+                                             embedding=flow_embedding,
+                                             uuid=this_uuid,
+                                             finalize=True,
+                                             speed=speed)
+            yield {'tts_speech': this_tts_speech.cpu()}
+        with self.lock:
+            self.tts_speech_token_dict.pop(this_uuid)
+            self.llm_end_dict.pop(this_uuid)
+            self.mel_overlap_dict.pop(this_uuid)
+            self.hift_cache_dict.pop(this_uuid)
+class CosyVoice2Model:
+    def __init__(self,
+                 llm: torch.nn.Module,
+                 flow: torch.nn.Module,
+                 hift: torch.nn.Module):
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        self.llm = llm
+        self.flow = flow
+        self.hift = hift
+        self.token_hop_len = 2 * self.flow.input_frame_rate
+        # here we fix flow encoder/decoder decoding_chunk_size, in the future we will send it as arguments, or use cache
+        self.flow.encoder.static_chunk_size = 2 * self.flow.input_frame_rate
+        self.flow.decoder.estimator.static_chunk_size = 2 * self.flow.input_frame_rate * self.flow.token_mel_ratio
+        # hift cache
+        self.mel_cache_len = 8
+        self.source_cache_len = int(self.mel_cache_len * 480)
+        # speech fade in out
+        self.speech_window = np.hamming(2 * self.source_cache_len)
+        # rtf and decoding related
+        self.stream_scale_factor = 1
+        self.llm_context = torch.cuda.stream(torch.cuda.Stream(self.device)) if torch.cuda.is_available() else nullcontext()
+        self.lock = threading.Lock()
+        # dict used to store session related variable
+        self.tts_speech_token_dict = {}
+        self.llm_end_dict = {}
+        self.hift_cache_dict = {}
+    def load(self, llm_model, flow_model, hift_model):
+        self.llm.load_state_dict(torch.load(llm_model, map_location=self.device), strict=True)
+        self.llm.to(self.device).eval()
+        self.flow.load_state_dict(torch.load(flow_model, map_location=self.device), strict=True)
+        self.flow.to(self.device).eval()
+        self.flow.decoder.fp16 = False
+        # in case hift_model is a hifigan model
+        hift_state_dict = {k.replace('generator.', ''): v for k, v in torch.load(hift_model, map_location=self.device).items()}
+        self.hift.load_state_dict(hift_state_dict, strict=True)
+        self.hift.to(self.device).eval()
+    def load_jit(self, flow_encoder_model):
+        flow_encoder = torch.jit.load(flow_encoder_model, map_location=self.device)
+        self.flow.encoder = flow_encoder
+    def load_onnx(self, flow_decoder_estimator_model):
+        import onnxruntime
+        option = onnxruntime.SessionOptions()
+        option.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
+        option.intra_op_num_threads = 1
+        providers = ['CUDAExecutionProvider' if torch.cuda.is_available() else 'CPUExecutionProvider']
+        del self.flow.decoder.estimator
+        self.flow.decoder.estimator = onnxruntime.InferenceSession(flow_decoder_estimator_model, sess_options=option, providers=providers)
+    def load_trt(self, flow_decoder_estimator_model):
+        del self.flow.decoder.estimator
+        import tensorrt as trt
+        with open(flow_decoder_estimator_model, 'rb') as f:
+            self.flow.decoder.estimator_engine = trt.Runtime(trt.Logger(trt.Logger.INFO)).deserialize_cuda_engine(f.read())
+        self.flow.decoder.estimator = self.flow.decoder.estimator_engine.create_execution_context()
+        self.flow.decoder.fp16 = True
+    def llm_job(self, text, prompt_text, llm_prompt_speech_token, llm_embedding, uuid):
+        with self.llm_context:
+            for i in self.llm.inference(text=text.to(self.device),
+                                        text_len=torch.tensor([text.shape[1]], dtype=torch.int32).to(self.device),
+                                        prompt_text=prompt_text.to(self.device),
+                                        prompt_text_len=torch.tensor([prompt_text.shape[1]], dtype=torch.int32).to(self.device),
+                                        prompt_speech_token=llm_prompt_speech_token.to(self.device),
+                                        prompt_speech_token_len=torch.tensor([llm_prompt_speech_token.shape[1]], dtype=torch.int32).to(self.device),
+                                        embedding=llm_embedding.to(self.device)):
+                self.tts_speech_token_dict[uuid].append(i)
+        self.llm_end_dict[uuid] = True
+    def token2wav(self, token, prompt_token, prompt_feat, embedding, uuid, token_offset, finalize=False, speed=1.0):
+        tts_mel, _ = self.flow.inference(token=token.to(self.device),
+                                         token_len=torch.tensor([token.shape[1]], dtype=torch.int32).to(self.device),
+                                         prompt_token=prompt_token.to(self.device),
+                                         prompt_token_len=torch.tensor([prompt_token.shape[1]], dtype=torch.int32).to(self.device),
+                                         prompt_feat=prompt_feat.to(self.device),
+                                         prompt_feat_len=torch.tensor([prompt_feat.shape[1]], dtype=torch.int32).to(self.device),
+                                         embedding=embedding.to(self.device),
+                                         finalize=finalize)
+        tts_mel = tts_mel[:, :, token_offset * self.flow.token_mel_ratio:]
+        # append hift cache
+        if self.hift_cache_dict[uuid] is not None:
+            hift_cache_mel, hift_cache_source = self.hift_cache_dict[uuid]['mel'], self.hift_cache_dict[uuid]['source']
+            tts_mel = torch.concat([hift_cache_mel, tts_mel], dim=2)
+        else:
+            hift_cache_source = torch.zeros(1, 1, 0)
+        # keep overlap mel and hift cache
+        if finalize is False:
+            tts_speech, tts_source = self.hift.inference(speech_feat=tts_mel, cache_source=hift_cache_source)
+            if self.hift_cache_dict[uuid] is not None:
+                tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
+            self.hift_cache_dict[uuid] = {'mel': tts_mel[:, :, -self.mel_cache_len:],
+                                          'source': tts_source[:, :, -self.source_cache_len:],
+                                          'speech': tts_speech[:, -self.source_cache_len:]}
+            tts_speech = tts_speech[:, :-self.source_cache_len]
+        else:
+            if speed != 1.0:
+                assert self.hift_cache_dict[uuid] is None, 'speed change only support non-stream inference mode'
+                tts_mel = F.interpolate(tts_mel, size=int(tts_mel.shape[2] / speed), mode='linear')
+            tts_speech, tts_source = self.hift.inference(speech_feat=tts_mel, cache_source=hift_cache_source)
+            if self.hift_cache_dict[uuid] is not None:
+                tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
+        return tts_speech
+    def tts(self, text, flow_embedding, llm_embedding=torch.zeros(0, 192),
+            prompt_text=torch.zeros(1, 0, dtype=torch.int32),
+            llm_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32),
+            flow_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32),
+            prompt_speech_feat=torch.zeros(1, 0, 80), stream=False, speed=1.0, **kwargs):
+        # this_uuid is used to track variables related to this inference thread
+        this_uuid = str(uuid.uuid1())
+        with self.lock:
+            self.tts_speech_token_dict[this_uuid], self.llm_end_dict[this_uuid] = [], False
+            self.hift_cache_dict[this_uuid] = None
+        p = threading.Thread(target=self.llm_job, args=(text, prompt_text, llm_prompt_speech_token, llm_embedding, this_uuid))
+        p.start()
+        if stream is True:
+            token_offset = 0
+            while True:
+                time.sleep(0.1)
+                if len(self.tts_speech_token_dict[this_uuid]) - token_offset >= self.token_hop_len + self.flow.pre_lookahead_len:
+                    this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid][:token_offset + self.token_hop_len + self.flow.pre_lookahead_len]).unsqueeze(dim=0)
+                    this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                                     prompt_token=flow_prompt_speech_token,
+                                                     prompt_feat=prompt_speech_feat,
+                                                     embedding=flow_embedding,
+                                                     uuid=this_uuid,
+                                                     token_offset=token_offset,
+                                                     finalize=False)
+                    token_offset += self.token_hop_len
+                    yield {'tts_speech': this_tts_speech.cpu()}
+                if self.llm_end_dict[this_uuid] is True and len(self.tts_speech_token_dict[this_uuid]) - token_offset < self.token_hop_len + self.flow.pre_lookahead_len:
+                    break
+            p.join()
+            # deal with remain tokens, make sure inference remain token len equals token_hop_len when cache_speech is not None
+            this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
+            this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                             prompt_token=flow_prompt_speech_token,
+                                             prompt_feat=prompt_speech_feat,
+                                             embedding=flow_embedding,
+                                             uuid=this_uuid,
+                                             token_offset=token_offset,
+                                             finalize=True)
+            yield {'tts_speech': this_tts_speech.cpu()}
+        else:
+            # deal with all tokens
+            p.join()
+            this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
+            this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                             prompt_token=flow_prompt_speech_token,
+                                             prompt_feat=prompt_speech_feat,
+                                             embedding=flow_embedding,
+                                             uuid=this_uuid,
+                                             token_offset=0,
+                                             finalize=True,
+                                             speed=speed)
+            yield {'tts_speech': this_tts_speech.cpu()}
+        with self.lock:
+            self.tts_speech_token_dict.pop(this_uuid)
+            self.llm_end_dict.pop(this_uuid)

cosyvoice/dataset/__init__.py ADDED Viewed

File without changes

cosyvoice/dataset/dataset.py ADDED Viewed

	@@ -0,0 +1,164 @@

+# Copyright (c) 2021 Mobvoi Inc. (authors: Binbin Zhang)
+#               2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import random
+import json
+import math
+from functools import partial
+import torch
+import torch.distributed as dist
+from torch.utils.data import IterableDataset
+from cosyvoice.utils.file_utils import read_lists, read_json_lists
+class Processor(IterableDataset):
+    def __init__(self, source, f, *args, **kw):
+        assert callable(f)
+        self.source = source
+        self.f = f
+        self.args = args
+        self.kw = kw
+    def set_epoch(self, epoch):
+        self.source.set_epoch(epoch)
+    def __iter__(self):
+        """ Return an iterator over the source dataset processed by the
+            given processor.
+        """
+        assert self.source is not None
+        assert callable(self.f)
+        return self.f(iter(self.source), *self.args, **self.kw)
+    def apply(self, f):
+        assert callable(f)
+        return Processor(self, f, *self.args, **self.kw)
+class DistributedSampler:
+    def __init__(self, shuffle=True, partition=True):
+        self.epoch = -1
+        self.update()
+        self.shuffle = shuffle
+        self.partition = partition
+    def update(self):
+        assert dist.is_available()
+        if dist.is_initialized():
+            self.rank = dist.get_rank()
+            self.world_size = dist.get_world_size()
+        else:
+            self.rank = 0
+            self.world_size = 1
+        worker_info = torch.utils.data.get_worker_info()
+        if worker_info is None:
+            self.worker_id = 0
+            self.num_workers = 1
+        else:
+            self.worker_id = worker_info.id
+            self.num_workers = worker_info.num_workers
+        return dict(rank=self.rank,
+                    world_size=self.world_size,
+                    worker_id=self.worker_id,
+                    num_workers=self.num_workers)
+    def set_epoch(self, epoch):
+        self.epoch = epoch
+    def sample(self, data):
+        """ Sample data according to rank/world_size/num_workers
+            Args:
+                data(List): input data list
+            Returns:
+                List: data list after sample
+        """
+        data = list(range(len(data)))
+        # force datalist even
+        if self.partition:
+            if self.shuffle:
+                random.Random(self.epoch).shuffle(data)
+            if len(data) < self.world_size:
+                data = data * math.ceil(self.world_size / len(data))
+                data = data[:self.world_size]
+            data = data[self.rank::self.world_size]
+        if len(data) < self.num_workers:
+            data = data * math.ceil(self.num_workers / len(data))
+            data = data[:self.num_workers]
+        data = data[self.worker_id::self.num_workers]
+        return data
+class DataList(IterableDataset):
+    def __init__(self, lists, shuffle=True, partition=True):
+        self.lists = lists
+        self.sampler = DistributedSampler(shuffle, partition)
+    def set_epoch(self, epoch):
+        self.sampler.set_epoch(epoch)
+    def __iter__(self):
+        sampler_info = self.sampler.update()
+        indexes = self.sampler.sample(self.lists)
+        for index in indexes:
+            data = dict(src=self.lists[index])
+            data.update(sampler_info)
+            yield data
+def Dataset(data_list_file,
+            data_pipeline,
+            mode='train',
+            gan=False,
+            shuffle=True,
+            partition=True,
+            tts_file='',
+            prompt_utt2data=''):
+    """ Construct dataset from arguments
+        We have two shuffle stage in the Dataset. The first is global
+        shuffle at shards tar/raw file level. The second is global shuffle
+        at training samples level.
+        Args:
+            data_type(str): raw/shard
+            tokenizer (BaseTokenizer): tokenizer to tokenize
+            partition(bool): whether to do data partition in terms of rank
+    """
+    assert mode in ['train', 'inference']
+    lists = read_lists(data_list_file)
+    if mode == 'inference':
+        with open(tts_file) as f:
+            tts_data = json.load(f)
+        utt2lists = read_json_lists(prompt_utt2data)
+        # filter unnecessary file in inference mode
+        lists = list({utt2lists[utt] for utt in tts_data.keys() if utt2lists[utt] in lists})
+    dataset = DataList(lists,
+                       shuffle=shuffle,
+                       partition=partition)
+    if mode == 'inference':
+        # map partial arg to parquet_opener func in inference mode
+        data_pipeline[0] = partial(data_pipeline[0], tts_data=tts_data)
+    if gan is True:
+        # map partial arg to padding func in gan mode
+        data_pipeline[-1] = partial(data_pipeline[-1], gan=gan)
+    for func in data_pipeline:
+        dataset = Processor(dataset, func, mode=mode)
+    return dataset

cosyvoice/dataset/processor.py ADDED Viewed

	@@ -0,0 +1,431 @@

+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+import random
+import pyarrow.parquet as pq
+from io import BytesIO
+import torch
+import torchaudio
+from torch.nn.utils.rnn import pad_sequence
+import torch.nn.functional as F
+torchaudio.set_audio_backend('soundfile')
+AUDIO_FORMAT_SETS = {'flac', 'mp3', 'm4a', 'ogg', 'opus', 'wav', 'wma'}
+def parquet_opener(data, mode='train', tts_data={}):
+    """ Give url or local file, return file descriptor
+        Inplace operation.
+        Args:
+            data(Iterable[str]): url or local file list
+        Returns:
+            Iterable[{src, stream}]
+    """
+    for sample in data:
+        assert 'src' in sample
+        url = sample['src']
+        try:
+            for df in pq.ParquetFile(url).iter_batches(batch_size=64):
+                df = df.to_pandas()
+                for i in range(len(df)):
+                    if mode == 'inference' and df.loc[i, 'utt'] not in tts_data:
+                        continue
+                    sample.update(dict(df.loc[i]))
+                    if mode == 'train':
+                        # NOTE do not return sample directly, must initialize a new dict
+                        yield {**sample}
+                    else:
+                        for index, text in enumerate(tts_data[df.loc[i, 'utt']]):
+                            yield {**sample, 'tts_index': index, 'tts_text': text}
+        except Exception as ex:
+            logging.warning('Failed to open {}, ex info {}'.format(url, ex))
+def filter(data,
+           max_length=10240,
+           min_length=10,
+           token_max_length=200,
+           token_min_length=1,
+           min_output_input_ratio=0.0005,
+           max_output_input_ratio=1,
+           mode='train'):
+    """ Filter sample according to feature and label length
+        Inplace operation.
+        Args::
+            data: Iterable[{key, wav, label, sample_rate}]
+            max_length: drop utterance which is greater than max_length(10ms)
+            min_length: drop utterance which is less than min_length(10ms)
+            token_max_length: drop utterance which is greater than
+                token_max_length, especially when use char unit for
+                english modeling
+            token_min_length: drop utterance which is
+                less than token_max_length
+            min_output_input_ratio: minimal ration of
+                token_length / feats_length(10ms)
+            max_output_input_ratio: maximum ration of
+                token_length / feats_length(10ms)
+        Returns:
+            Iterable[{key, wav, label, sample_rate}]
+    """
+    for sample in data:
+        sample['speech'], sample['sample_rate'] = torchaudio.load(BytesIO(sample['audio_data']))
+        sample['speech'] = sample['speech'].mean(dim=0, keepdim=True)
+        del sample['audio_data']
+        # sample['wav'] is torch.Tensor, we have 100 frames every second
+        num_frames = sample['speech'].size(1) / sample['sample_rate'] * 100
+        if num_frames < min_length:
+            continue
+        if num_frames > max_length:
+            continue
+        if len(sample['text_token']) < token_min_length:
+            continue
+        if len(sample['text_token']) > token_max_length:
+            continue
+        if len(sample['speech_token']) == 0:
+            continue
+        if num_frames != 0:
+            if len(sample['text_token']) / num_frames < min_output_input_ratio:
+                continue
+            if len(sample['text_token']) / num_frames > max_output_input_ratio:
+                continue
+        yield sample
+def resample(data, resample_rate=22050, min_sample_rate=16000, mode='train'):
+    """ Resample data.
+        Inplace operation.
+        Args:
+            data: Iterable[{key, wav, label, sample_rate}]
+            resample_rate: target resample rate
+        Returns:
+            Iterable[{key, wav, label, sample_rate}]
+    """
+    for sample in data:
+        assert 'sample_rate' in sample
+        assert 'speech' in sample
+        sample_rate = sample['sample_rate']
+        waveform = sample['speech']
+        if sample_rate != resample_rate:
+            if sample_rate < min_sample_rate:
+                continue
+            sample['sample_rate'] = resample_rate
+            sample['speech'] = torchaudio.transforms.Resample(
+                orig_freq=sample_rate, new_freq=resample_rate)(waveform)
+        max_val = sample['speech'].abs().max()
+        if max_val > 1:
+            sample['speech'] /= max_val
+        yield sample
+def truncate(data, truncate_length=24576, mode='train'):
+    """ Truncate data.
+        Args:
+            data: Iterable[{key, wav, label, sample_rate}]
+            truncate_length: truncate length
+        Returns:
+            Iterable[{key, wav, label, sample_rate}]
+    """
+    for sample in data:
+        waveform = sample['speech']
+        if waveform.shape[1] > truncate_length:
+            start = random.randint(0, waveform.shape[1] - truncate_length)
+            waveform = waveform[:, start: start + truncate_length]
+        else:
+            waveform = torch.concat([waveform, torch.zeros(1, truncate_length - waveform.shape[1])], dim=1)
+        sample['speech'] = waveform
+        yield sample
+def compute_fbank(data,
+                  feat_extractor,
+                  mode='train'):
+    """ Extract fbank
+        Args:
+            data: Iterable[{key, wav, label, sample_rate}]
+        Returns:
+            Iterable[{key, feat, label}]
+    """
+    for sample in data:
+        assert 'sample_rate' in sample
+        assert 'speech' in sample
+        assert 'utt' in sample
+        assert 'text_token' in sample
+        waveform = sample['speech']
+        mat = feat_extractor(waveform).squeeze(dim=0).transpose(0, 1)
+        sample['speech_feat'] = mat
+        yield sample
+def compute_f0(data, pitch_extractor, mode='train'):
+    """ Extract f0
+        Args:
+            data: Iterable[{key, wav, label, sample_rate}]
+        Returns:
+            Iterable[{key, feat, label}]
+    """
+    for sample in data:
+        assert 'sample_rate' in sample
+        assert 'speech' in sample
+        assert 'utt' in sample
+        assert 'text_token' in sample
+        waveform = sample['speech']
+        mat = pitch_extractor(waveform).transpose(1, 2)
+        mat = F.interpolate(mat, size=sample['speech_feat'].shape[0], mode='linear')
+        sample['pitch_feat'] = mat[0, 0]
+        yield sample
+def parse_embedding(data, normalize, mode='train'):
+    """ Parse utt_embedding/spk_embedding
+        Args:
+            data: Iterable[{key, wav, label, sample_rate}]
+        Returns:
+            Iterable[{key, feat, label}]
+    """
+    for sample in data:
+        sample['utt_embedding'] = torch.tensor(sample['utt_embedding'], dtype=torch.float32)
+        sample['spk_embedding'] = torch.tensor(sample['spk_embedding'], dtype=torch.float32)
+        if normalize:
+            sample['utt_embedding'] = F.normalize(sample['utt_embedding'], dim=0)
+            sample['spk_embedding'] = F.normalize(sample['spk_embedding'], dim=0)
+        yield sample
+def tokenize(data, get_tokenizer, allowed_special, mode='train'):
+    """ Decode text to chars or BPE
+        Inplace operation
+        Args:
+            data: Iterable[{key, wav, txt, sample_rate}]
+        Returns:
+            Iterable[{key, wav, txt, tokens, label, sample_rate}]
+    """
+    tokenizer = get_tokenizer()
+    for sample in data:
+        assert 'text' in sample
+        sample['text_token'] = tokenizer.encode(sample['text'], allowed_special=allowed_special)
+        if mode == 'inference':
+            sample['tts_text_token'] = tokenizer.encode(sample['tts_text'], allowed_special=allowed_special)
+        yield sample
+def shuffle(data, shuffle_size=10000, mode='train'):
+    """ Local shuffle the data
+        Args:
+            data: Iterable[{key, feat, label}]
+            shuffle_size: buffer size for shuffle
+        Returns:
+            Iterable[{key, feat, label}]
+    """
+    buf = []
+    for sample in data:
+        buf.append(sample)
+        if len(buf) >= shuffle_size:
+            random.shuffle(buf)
+            for x in buf:
+                yield x
+            buf = []
+    # The sample left over
+    random.shuffle(buf)
+    for x in buf:
+        yield x
+def sort(data, sort_size=500, mode='train'):
+    """ Sort the data by feature length.
+        Sort is used after shuffle and before batch, so we can group
+        utts with similar lengths into a batch, and `sort_size` should
+        be less than `shuffle_size`
+        Args:
+            data: Iterable[{key, feat, label}]
+            sort_size: buffer size for sort
+        Returns:
+            Iterable[{key, feat, label}]
+    """
+    buf = []
+    for sample in data:
+        buf.append(sample)
+        if len(buf) >= sort_size:
+            buf.sort(key=lambda x: x['speech_feat'].size(0))
+            for x in buf:
+                yield x
+            buf = []
+    # The sample left over
+    buf.sort(key=lambda x: x['speech_feat'].size(0))
+    for x in buf:
+        yield x
+def static_batch(data, batch_size=16):
+    """ Static batch the data by `batch_size`
+        Args:
+            data: Iterable[{key, feat, label}]
+            batch_size: batch size
+        Returns:
+            Iterable[List[{key, feat, label}]]
+    """
+    buf = []
+    for sample in data:
+        buf.append(sample)
+        if len(buf) >= batch_size:
+            yield buf
+            buf = []
+    if len(buf) > 0:
+        yield buf
+def dynamic_batch(data, max_frames_in_batch=12000, mode='train'):
+    """ Dynamic batch the data until the total frames in batch
+        reach `max_frames_in_batch`
+        Args:
+            data: Iterable[{key, feat, label}]
+            max_frames_in_batch: max_frames in one batch
+        Returns:
+            Iterable[List[{key, feat, label}]]
+    """
+    buf = []
+    longest_frames = 0
+    for sample in data:
+        assert 'speech_feat' in sample
+        assert isinstance(sample['speech_feat'], torch.Tensor)
+        new_sample_frames = sample['speech_feat'].size(0)
+        longest_frames = max(longest_frames, new_sample_frames)
+        frames_after_padding = longest_frames * (len(buf) + 1)
+        if frames_after_padding > max_frames_in_batch:
+            yield buf
+            buf = [sample]
+            longest_frames = new_sample_frames
+        else:
+            buf.append(sample)
+    if len(buf) > 0:
+        yield buf
+def batch(data, batch_type='static', batch_size=16, max_frames_in_batch=12000, mode='train'):
+    """ Wrapper for static/dynamic batch
+    """
+    if mode == 'inference':
+        return static_batch(data, 1)
+    else:
+        if batch_type == 'static':
+            return static_batch(data, batch_size)
+        elif batch_type == 'dynamic':
+            return dynamic_batch(data, max_frames_in_batch)
+        else:
+            logging.fatal('Unsupported batch type {}'.format(batch_type))
+def padding(data, use_spk_embedding, mode='train', gan=False):
+    """ Padding the data into training data
+        Args:
+            data: Iterable[List[{key, feat, label}]]
+        Returns:
+            Iterable[Tuple(keys, feats, labels, feats lengths, label lengths)]
+    """
+    for sample in data:
+        assert isinstance(sample, list)
+        speech_feat_len = torch.tensor([x['speech_feat'].size(1) for x in sample],
+                                       dtype=torch.int32)
+        order = torch.argsort(speech_feat_len, descending=True)
+        utts = [sample[i]['utt'] for i in order]
+        speech = [sample[i]['speech'].squeeze(dim=0) for i in order]
+        speech_len = torch.tensor([i.size(0) for i in speech], dtype=torch.int32)
+        speech = pad_sequence(speech, batch_first=True, padding_value=0)
+        speech_token = [torch.tensor(sample[i]['speech_token']) for i in order]
+        speech_token_len = torch.tensor([i.size(0) for i in speech_token], dtype=torch.int32)
+        speech_token = pad_sequence(speech_token,
+                                    batch_first=True,
+                                    padding_value=0)
+        speech_feat = [sample[i]['speech_feat'] for i in order]
+        speech_feat_len = torch.tensor([i.size(0) for i in speech_feat], dtype=torch.int32)
+        speech_feat = pad_sequence(speech_feat,
+                                   batch_first=True,
+                                   padding_value=0)
+        text = [sample[i]['text'] for i in order]
+        text_token = [torch.tensor(sample[i]['text_token']) for i in order]
+        text_token_len = torch.tensor([i.size(0) for i in text_token], dtype=torch.int32)
+        text_token = pad_sequence(text_token, batch_first=True, padding_value=0)
+        utt_embedding = torch.stack([sample[i]['utt_embedding'] for i in order], dim=0)
+        spk_embedding = torch.stack([sample[i]['spk_embedding'] for i in order], dim=0)
+        batch = {
+            "utts": utts,
+            "speech": speech,
+            "speech_len": speech_len,
+            "speech_token": speech_token,
+            "speech_token_len": speech_token_len,
+            "speech_feat": speech_feat,
+            "speech_feat_len": speech_feat_len,
+            "text": text,
+            "text_token": text_token,
+            "text_token_len": text_token_len,
+            "utt_embedding": utt_embedding,
+            "spk_embedding": spk_embedding,
+        }
+        if gan is True:
+            # in gan train, we need pitch_feat
+            pitch_feat = [sample[i]['pitch_feat'] for i in order]
+            pitch_feat_len = torch.tensor([i.size(0) for i in pitch_feat], dtype=torch.int32)
+            pitch_feat = pad_sequence(pitch_feat,
+                                      batch_first=True,
+                                      padding_value=0)
+            batch["pitch_feat"] = pitch_feat
+            batch["pitch_feat_len"] = pitch_feat_len
+        else:
+            # only gan train needs speech, delete it to save memory
+            del batch["speech"]
+            del batch["speech_len"]
+        if mode == 'inference':
+            tts_text = [sample[i]['tts_text'] for i in order]
+            tts_index = [sample[i]['tts_index'] for i in order]
+            tts_text_token = [torch.tensor(sample[i]['tts_text_token']) for i in order]
+            tts_text_token_len = torch.tensor([i.size(0) for i in tts_text_token], dtype=torch.int32)
+            tts_text_token = pad_sequence(tts_text_token, batch_first=True, padding_value=-1)
+            batch.update({'tts_text': tts_text,
+                          'tts_index': tts_index,
+                          'tts_text_token': tts_text_token,
+                          'tts_text_token_len': tts_text_token_len})
+        if use_spk_embedding is True:
+            batch["embedding"] = batch["spk_embedding"]
+        else:
+            batch["embedding"] = batch["utt_embedding"]
+        yield batch

cosyvoice/flow/decoder.py ADDED Viewed

	@@ -0,0 +1,301 @@

+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Zhihao Du)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import pack, rearrange, repeat
+from cosyvoice.utils.common import mask_to_bias
+from cosyvoice.utils.mask import add_optional_chunk_mask
+from matcha.models.components.decoder import SinusoidalPosEmb, Block1D, ResnetBlock1D, Downsample1D, TimestepEmbedding, Upsample1D
+from matcha.models.components.transformer import BasicTransformerBlock
+class Transpose(torch.nn.Module):
+    def __init__(self, dim0: int, dim1: int):
+        super().__init__()
+        self.dim0 = dim0
+        self.dim1 = dim1
+    def forward(self, x: torch.Tensor):
+        x = torch.transpose(x, self.dim0, self.dim1)
+        return x
+class CausalBlock1D(Block1D):
+    def __init__(self, dim: int, dim_out: int):
+        super(CausalBlock1D, self).__init__(dim, dim_out)
+        self.block = torch.nn.Sequential(
+            CausalConv1d(dim, dim_out, 3),
+            Transpose(1, 2),
+            nn.LayerNorm(dim_out),
+            Transpose(1, 2),
+            nn.Mish(),
+        )
+    def forward(self, x: torch.Tensor, mask: torch.Tensor):
+        output = self.block(x * mask)
+        return output * mask
+class CausalResnetBlock1D(ResnetBlock1D):
+    def __init__(self, dim: int, dim_out: int, time_emb_dim: int, groups: int = 8):
+        super(CausalResnetBlock1D, self).__init__(dim, dim_out, time_emb_dim, groups)
+        self.block1 = CausalBlock1D(dim, dim_out)
+        self.block2 = CausalBlock1D(dim_out, dim_out)
+class CausalConv1d(torch.nn.Conv1d):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: int,
+        stride: int = 1,
+        dilation: int = 1,
+        groups: int = 1,
+        bias: bool = True,
+        padding_mode: str = 'zeros',
+        device=None,
+        dtype=None
+    ) -> None:
+        super(CausalConv1d, self).__init__(in_channels, out_channels,
+                                           kernel_size, stride,
+                                           padding=0, dilation=dilation,
+                                           groups=groups, bias=bias,
+                                           padding_mode=padding_mode,
+                                           device=device, dtype=dtype)
+        assert stride == 1
+        self.causal_padding = (kernel_size - 1, 0)
+    def forward(self, x: torch.Tensor):
+        x = F.pad(x, self.causal_padding)
+        x = super(CausalConv1d, self).forward(x)
+        return x
+class ConditionalDecoder(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        causal=False,
+        channels=(256, 256),
+        dropout=0.05,
+        attention_head_dim=64,
+        n_blocks=1,
+        num_mid_blocks=2,
+        num_heads=4,
+        act_fn="snake",
+    ):
+        """
+        This decoder requires an input with the same shape of the target. So, if your text content
+        is shorter or longer than the outputs, please re-sampling it before feeding to the decoder.
+        """
+        super().__init__()
+        channels = tuple(channels)
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.causal = causal
+        self.time_embeddings = SinusoidalPosEmb(in_channels)
+        time_embed_dim = channels[0] * 4
+        self.time_mlp = TimestepEmbedding(
+            in_channels=in_channels,
+            time_embed_dim=time_embed_dim,
+            act_fn="silu",
+        )
+        self.down_blocks = nn.ModuleList([])
+        self.mid_blocks = nn.ModuleList([])
+        self.up_blocks = nn.ModuleList([])
+        output_channel = in_channels
+        for i in range(len(channels)):  # pylint: disable=consider-using-enumerate
+            input_channel = output_channel
+            output_channel = channels[i]
+            is_last = i == len(channels) - 1
+            resnet = CausalResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim) if self.causal else \
+                ResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim)
+            transformer_blocks = nn.ModuleList(
+                [
+                    BasicTransformerBlock(
+                        dim=output_channel,
+                        num_attention_heads=num_heads,
+                        attention_head_dim=attention_head_dim,
+                        dropout=dropout,
+                        activation_fn=act_fn,
+                    )
+                    for _ in range(n_blocks)
+                ]
+            )
+            downsample = (
+                Downsample1D(output_channel) if not is_last else
+                CausalConv1d(output_channel, output_channel, 3) if self.causal else nn.Conv1d(output_channel, output_channel, 3, padding=1)
+            )
+            self.down_blocks.append(nn.ModuleList([resnet, transformer_blocks, downsample]))
+        for _ in range(num_mid_blocks):
+            input_channel = channels[-1]
+            out_channels = channels[-1]
+            resnet = CausalResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim) if self.causal else \
+                ResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim)
+            transformer_blocks = nn.ModuleList(
+                [
+                    BasicTransformerBlock(
+                        dim=output_channel,
+                        num_attention_heads=num_heads,
+                        attention_head_dim=attention_head_dim,
+                        dropout=dropout,
+                        activation_fn=act_fn,
+                    )
+                    for _ in range(n_blocks)
+                ]
+            )
+            self.mid_blocks.append(nn.ModuleList([resnet, transformer_blocks]))
+        channels = channels[::-1] + (channels[0],)
+        for i in range(len(channels) - 1):
+            input_channel = channels[i] * 2
+            output_channel = channels[i + 1]
+            is_last = i == len(channels) - 2
+            resnet = CausalResnetBlock1D(
+                dim=input_channel,
+                dim_out=output_channel,
+                time_emb_dim=time_embed_dim,
+            ) if self.causal else ResnetBlock1D(
+                dim=input_channel,
+                dim_out=output_channel,
+                time_emb_dim=time_embed_dim,
+            )
+            transformer_blocks = nn.ModuleList(
+                [
+                    BasicTransformerBlock(
+                        dim=output_channel,
+                        num_attention_heads=num_heads,
+                        attention_head_dim=attention_head_dim,
+                        dropout=dropout,
+                        activation_fn=act_fn,
+                    )
+                    for _ in range(n_blocks)
+                ]
+            )
+            upsample = (
+                Upsample1D(output_channel, use_conv_transpose=True)
+                if not is_last
+                else CausalConv1d(output_channel, output_channel, 3) if self.causal else nn.Conv1d(output_channel, output_channel, 3, padding=1)
+            )
+            self.up_blocks.append(nn.ModuleList([resnet, transformer_blocks, upsample]))
+        self.final_block = CausalBlock1D(channels[-1], channels[-1]) if self.causal else Block1D(channels[-1], channels[-1])
+        self.final_proj = nn.Conv1d(channels[-1], self.out_channels, 1)
+        self.initialize_weights()
+    def initialize_weights(self):
+        for m in self.modules():
+            if isinstance(m, nn.Conv1d):
+                nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+            elif isinstance(m, nn.GroupNorm):
+                nn.init.constant_(m.weight, 1)
+                nn.init.constant_(m.bias, 0)
+            elif isinstance(m, nn.Linear):
+                nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+    def forward(self, x, mask, mu, t, spks=None, cond=None):
+        """Forward pass of the UNet1DConditional model.
+        Args:
+            x (torch.Tensor): shape (batch_size, in_channels, time)
+            mask (_type_): shape (batch_size, 1, time)
+            t (_type_): shape (batch_size)
+            spks (_type_, optional): shape: (batch_size, condition_channels). Defaults to None.
+            cond (_type_, optional): placeholder for future use. Defaults to None.
+        Raises:
+            ValueError: _description_
+            ValueError: _description_
+        Returns:
+            _type_: _description_
+        """
+        t = self.time_embeddings(t).to(t.dtype)
+        t = self.time_mlp(t)
+        x = pack([x, mu], "b * t")[0]
+        if spks is not None:
+            spks = repeat(spks, "b c -> b c t", t=x.shape[-1])
+            x = pack([x, spks], "b * t")[0]
+        if cond is not None:
+            x = pack([x, cond], "b * t")[0]
+        hiddens = []
+        masks = [mask]
+        for resnet, transformer_blocks, downsample in self.down_blocks:
+            mask_down = masks[-1]
+            x = resnet(x, mask_down, t)
+            x = rearrange(x, "b c t -> b t c").contiguous()
+            # attn_mask = torch.matmul(mask_down.transpose(1, 2).contiguous(), mask_down)
+            attn_mask = add_optional_chunk_mask(x, mask_down.bool(), False, False, 0, self.static_chunk_size, -1)
+            attn_mask = mask_to_bias(attn_mask == 1, x.dtype)
+            for transformer_block in transformer_blocks:
+                x = transformer_block(
+                    hidden_states=x,
+                    attention_mask=attn_mask,
+                    timestep=t,
+                )
+            x = rearrange(x, "b t c -> b c t").contiguous()
+            hiddens.append(x)  # Save hidden states for skip connections
+            x = downsample(x * mask_down)
+            masks.append(mask_down[:, :, ::2])
+        masks = masks[:-1]
+        mask_mid = masks[-1]
+        for resnet, transformer_blocks in self.mid_blocks:
+            x = resnet(x, mask_mid, t)
+            x = rearrange(x, "b c t -> b t c").contiguous()
+            # attn_mask = torch.matmul(mask_mid.transpose(1, 2).contiguous(), mask_mid)
+            attn_mask = add_optional_chunk_mask(x, mask_mid.bool(), False, False, 0, self.static_chunk_size, -1)
+            attn_mask = mask_to_bias(attn_mask == 1, x.dtype)
+            for transformer_block in transformer_blocks:
+                x = transformer_block(
+                    hidden_states=x,
+                    attention_mask=attn_mask,
+                    timestep=t,
+                )
+            x = rearrange(x, "b t c -> b c t").contiguous()
+        for resnet, transformer_blocks, upsample in self.up_blocks:
+            mask_up = masks.pop()
+            skip = hiddens.pop()
+            x = pack([x[:, :, :skip.shape[-1]], skip], "b * t")[0]
+            x = resnet(x, mask_up, t)
+            x = rearrange(x, "b c t -> b t c").contiguous()
+            # attn_mask = torch.matmul(mask_up.transpose(1, 2).contiguous(), mask_up)
+            attn_mask = add_optional_chunk_mask(x, mask_up.bool(), False, False, 0, self.static_chunk_size, -1)
+            attn_mask = mask_to_bias(attn_mask == 1, x.dtype)
+            for transformer_block in transformer_blocks:
+                x = transformer_block(
+                    hidden_states=x,
+                    attention_mask=attn_mask,
+                    timestep=t,
+                )
+            x = rearrange(x, "b t c -> b c t").contiguous()
+            x = upsample(x * mask_up)
+        x = self.final_block(x, mask_up)
+        output = self.final_proj(x * mask_up)
+        return output * mask

cosyvoice/flow/flow.py ADDED Viewed

	@@ -0,0 +1,237 @@

+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Zhihao Du)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+import random
+from typing import Dict, Optional
+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+from omegaconf import DictConfig
+from cosyvoice.utils.mask import make_pad_mask
+class MaskedDiffWithXvec(torch.nn.Module):
+    def __init__(self,
+                 input_size: int = 512,
+                 output_size: int = 80,
+                 spk_embed_dim: int = 192,
+                 output_type: str = "mel",
+                 vocab_size: int = 4096,
+                 input_frame_rate: int = 50,
+                 only_mask_loss: bool = True,
+                 encoder: torch.nn.Module = None,
+                 length_regulator: torch.nn.Module = None,
+                 decoder: torch.nn.Module = None,
+                 decoder_conf: Dict = {'in_channels': 240, 'out_channel': 80, 'spk_emb_dim': 80, 'n_spks': 1,
+                                       'cfm_params': DictConfig({'sigma_min': 1e-06, 'solver': 'euler', 't_scheduler': 'cosine',
+                                                                 'training_cfg_rate': 0.2, 'inference_cfg_rate': 0.7, 'reg_loss_type': 'l1'}),
+                                       'decoder_params': {'channels': [256, 256], 'dropout': 0.0, 'attention_head_dim': 64,
+                                                          'n_blocks': 4, 'num_mid_blocks': 12, 'num_heads': 8, 'act_fn': 'gelu'}},
+                 mel_feat_conf: Dict = {'n_fft': 1024, 'num_mels': 80, 'sampling_rate': 22050,
+                                        'hop_size': 256, 'win_size': 1024, 'fmin': 0, 'fmax': 8000}):
+        super().__init__()
+        self.input_size = input_size
+        self.output_size = output_size
+        self.decoder_conf = decoder_conf
+        self.mel_feat_conf = mel_feat_conf
+        self.vocab_size = vocab_size
+        self.output_type = output_type
+        self.input_frame_rate = input_frame_rate
+        logging.info(f"input frame rate={self.input_frame_rate}")
+        self.input_embedding = nn.Embedding(vocab_size, input_size)
+        self.spk_embed_affine_layer = torch.nn.Linear(spk_embed_dim, output_size)
+        self.encoder = encoder
+        self.encoder_proj = torch.nn.Linear(self.encoder.output_size(), output_size)
+        self.decoder = decoder
+        self.length_regulator = length_regulator
+        self.only_mask_loss = only_mask_loss
+    def forward(
+            self,
+            batch: dict,
+            device: torch.device,
+    ) -> Dict[str, Optional[torch.Tensor]]:
+        token = batch['speech_token'].to(device)
+        token_len = batch['speech_token_len'].to(device)
+        feat = batch['speech_feat'].to(device)
+        feat_len = batch['speech_feat_len'].to(device)
+        embedding = batch['embedding'].to(device)
+        # xvec projection
+        embedding = F.normalize(embedding, dim=1)
+        embedding = self.spk_embed_affine_layer(embedding)
+        # concat text and prompt_text
+        mask = (~make_pad_mask(token_len)).float().unsqueeze(-1).to(device)
+        # Clamp tokens to valid vocabulary range
+        token = torch.clamp(token, min=0, max=self.vocab_size - 1)
+        token = self.input_embedding(token) * mask
+        # text encode
+        h, h_lengths = self.encoder(token, token_len)
+        h = self.encoder_proj(h)
+        h, h_lengths = self.length_regulator(h, feat_len)
+        # get conditions
+        conds = torch.zeros(feat.shape, device=token.device)
+        for i, j in enumerate(feat_len):
+            if random.random() < 0.5:
+                continue
+            index = random.randint(0, int(0.3 * j))
+            conds[i, :index] = feat[i, :index]
+        conds = conds.transpose(1, 2)
+        mask = (~make_pad_mask(feat_len)).to(h)
+        feat = F.interpolate(feat.unsqueeze(dim=1), size=h.shape[1:], mode="nearest").squeeze(dim=1)
+        loss, _ = self.decoder.compute_loss(
+            feat.transpose(1, 2).contiguous(),
+            mask.unsqueeze(1),
+            h.transpose(1, 2).contiguous(),
+            embedding,
+            cond=conds
+        )
+        return {'loss': loss}
+    @torch.inference_mode()
+    def inference(self,
+                  token,
+                  token_len,
+                  prompt_token,
+                  prompt_token_len,
+                  prompt_feat,
+                  prompt_feat_len,
+                  embedding,
+                  flow_cache):
+        assert token.shape[0] == 1
+        # xvec projection
+        embedding = F.normalize(embedding, dim=1)
+        embedding = self.spk_embed_affine_layer(embedding)
+        # concat text and prompt_text
+        token_len1, token_len2 = prompt_token.shape[1], token.shape[1]
+        token, token_len = torch.concat([prompt_token, token], dim=1), prompt_token_len + token_len
+        mask = (~make_pad_mask(token_len)).unsqueeze(-1).to(embedding)
+        # Clamp tokens to valid vocabulary range
+        token = torch.clamp(token, min=0, max=self.vocab_size - 1)
+        token = self.input_embedding(token) * mask
+        # text encode
+        h, h_lengths = self.encoder(token, token_len)
+        h = self.encoder_proj(h)
+        mel_len1, mel_len2 = prompt_feat.shape[1], int(token_len2 / self.input_frame_rate * 22050 / 256)
+        h, h_lengths = self.length_regulator.inference(h[:, :token_len1], h[:, token_len1:], mel_len1, mel_len2, self.input_frame_rate)
+        # get conditions
+        conds = torch.zeros([1, mel_len1 + mel_len2, self.output_size], device=token.device)
+        conds[:, :mel_len1] = prompt_feat
+        conds = conds.transpose(1, 2)
+        mask = (~make_pad_mask(torch.tensor([mel_len1 + mel_len2]))).to(h)
+        feat, flow_cache = self.decoder(
+            mu=h.transpose(1, 2).contiguous(),
+            mask=mask.unsqueeze(1),
+            spks=embedding,
+            cond=conds,
+            n_timesteps=10,
+            prompt_len=mel_len1,
+            flow_cache=flow_cache
+        )
+        feat = feat[:, :, mel_len1:]
+        assert feat.shape[2] == mel_len2
+        return feat, flow_cache
+class CausalMaskedDiffWithXvec(torch.nn.Module):
+    def __init__(self,
+                 input_size: int = 512,
+                 output_size: int = 80,
+                 spk_embed_dim: int = 192,
+                 output_type: str = "mel",
+                 vocab_size: int = 4096,
+                 input_frame_rate: int = 50,
+                 only_mask_loss: bool = True,
+                 token_mel_ratio: int = 2,
+                 pre_lookahead_len: int = 3,
+                 encoder: torch.nn.Module = None,
+                 decoder: torch.nn.Module = None,
+                 decoder_conf: Dict = {'in_channels': 240, 'out_channel': 80, 'spk_emb_dim': 80, 'n_spks': 1,
+                                       'cfm_params': DictConfig({'sigma_min': 1e-06, 'solver': 'euler', 't_scheduler': 'cosine',
+                                                                 'training_cfg_rate': 0.2, 'inference_cfg_rate': 0.7, 'reg_loss_type': 'l1'}),
+                                       'decoder_params': {'channels': [256, 256], 'dropout': 0.0, 'attention_head_dim': 64,
+                                                          'n_blocks': 4, 'num_mid_blocks': 12, 'num_heads': 8, 'act_fn': 'gelu'}},
+                 mel_feat_conf: Dict = {'n_fft': 1024, 'num_mels': 80, 'sampling_rate': 22050,
+                                        'hop_size': 256, 'win_size': 1024, 'fmin': 0, 'fmax': 8000}):
+        super().__init__()
+        self.input_size = input_size
+        self.output_size = output_size
+        self.decoder_conf = decoder_conf
+        self.mel_feat_conf = mel_feat_conf
+        self.vocab_size = vocab_size
+        self.output_type = output_type
+        self.input_frame_rate = input_frame_rate
+        logging.info(f"input frame rate={self.input_frame_rate}")
+        self.input_embedding = nn.Embedding(vocab_size, input_size)
+        self.spk_embed_affine_layer = torch.nn.Linear(spk_embed_dim, output_size)
+        self.encoder = encoder
+        self.encoder_proj = torch.nn.Linear(self.encoder.output_size(), output_size)
+        self.decoder = decoder
+        self.only_mask_loss = only_mask_loss
+        self.token_mel_ratio = token_mel_ratio
+        self.pre_lookahead_len = pre_lookahead_len
+    @torch.inference_mode()
+    def inference(self,
+                  token,
+                  token_len,
+                  prompt_token,
+                  prompt_token_len,
+                  prompt_feat,
+                  prompt_feat_len,
+                  embedding,
+                  finalize):
+        assert token.shape[0] == 1
+        # xvec projection
+        embedding = F.normalize(embedding, dim=1)
+        embedding = self.spk_embed_affine_layer(embedding)
+        # concat text and prompt_text
+        token, token_len = torch.concat([prompt_token, token], dim=1), prompt_token_len + token_len
+        mask = (~make_pad_mask(token_len)).unsqueeze(-1).to(embedding)
+        # Clamp tokens to valid vocabulary range
+        token = torch.clamp(token, min=0, max=self.vocab_size - 1)
+        token = self.input_embedding(token) * mask
+        # text encode
+        h, h_lengths = self.encoder(token, token_len)
+        if finalize is False:
+            h = h[:, :-self.pre_lookahead_len * self.token_mel_ratio]
+        mel_len1, mel_len2 = prompt_feat.shape[1], h.shape[1] - prompt_feat.shape[1]
+        h = self.encoder_proj(h)
+        # get conditions
+        conds = torch.zeros([1, mel_len1 + mel_len2, self.output_size], device=token.device)
+        conds[:, :mel_len1] = prompt_feat
+        conds = conds.transpose(1, 2)
+        mask = (~make_pad_mask(torch.tensor([mel_len1 + mel_len2]))).to(h)
+        feat, _ = self.decoder(
+            mu=h.transpose(1, 2).contiguous(),
+            mask=mask.unsqueeze(1),
+            spks=embedding,
+            cond=conds,
+            n_timesteps=10
+        )
+        feat = feat[:, :, mel_len1:]
+        assert feat.shape[2] == mel_len2
+        return feat, None

cosyvoice/flow/flow_matching.py ADDED Viewed

	@@ -0,0 +1,239 @@

+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Zhihao Du)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import onnxruntime
+import torch
+import torch.nn.functional as F
+from matcha.models.components.flow_matching import BASECFM
+class ConditionalCFM(BASECFM):
+    def __init__(self, in_channels, cfm_params, n_spks=1, spk_emb_dim=64, estimator: torch.nn.Module = None):
+        super().__init__(
+            n_feats=in_channels,
+            cfm_params=cfm_params,
+            n_spks=n_spks,
+            spk_emb_dim=spk_emb_dim,
+        )
+        self.t_scheduler = cfm_params.t_scheduler
+        self.training_cfg_rate = cfm_params.training_cfg_rate
+        self.inference_cfg_rate = cfm_params.inference_cfg_rate
+        in_channels = in_channels + (spk_emb_dim if n_spks > 0 else 0)
+        # Just change the architecture of the estimator here
+        self.estimator = estimator
+    @torch.inference_mode()
+    def forward(self, mu, mask, n_timesteps, temperature=1.0, spks=None, cond=None, prompt_len=0, flow_cache=torch.zeros(1, 80, 0, 2)):
+        """Forward diffusion
+        Args:
+            mu (torch.Tensor): output of encoder
+                shape: (batch_size, n_feats, mel_timesteps)
+            mask (torch.Tensor): output_mask
+                shape: (batch_size, 1, mel_timesteps)
+            n_timesteps (int): number of diffusion steps
+            temperature (float, optional): temperature for scaling noise. Defaults to 1.0.
+            spks (torch.Tensor, optional): speaker ids. Defaults to None.
+                shape: (batch_size, spk_emb_dim)
+            cond: Not used but kept for future purposes
+        Returns:
+            sample: generated mel-spectrogram
+                shape: (batch_size, n_feats, mel_timesteps)
+        """
+        z = torch.randn_like(mu) * temperature
+        # Handle None flow_cache
+        if flow_cache is not None:
+            cache_size = flow_cache.shape[2]
+            # fix prompt and overlap part mu and z
+            if cache_size != 0:
+                z[:, :, :cache_size] = flow_cache[:, :, :, 0]
+                mu[:, :, :cache_size] = flow_cache[:, :, :, 1]
+        else:
+            cache_size = 0
+        z_cache = torch.concat([z[:, :, :prompt_len], z[:, :, -34:]], dim=2)
+        mu_cache = torch.concat([mu[:, :, :prompt_len], mu[:, :, -34:]], dim=2)
+        flow_cache = torch.stack([z_cache, mu_cache], dim=-1)
+        t_span = torch.linspace(0, 1, n_timesteps + 1, device=mu.device, dtype=mu.dtype)
+        if self.t_scheduler == 'cosine':
+            t_span = 1 - torch.cos(t_span * 0.5 * torch.pi)
+        return self.solve_euler(z, t_span=t_span, mu=mu, mask=mask, spks=spks, cond=cond), flow_cache
+    def solve_euler(self, x, t_span, mu, mask, spks, cond):
+        """
+        Fixed euler solver for ODEs.
+        Args:
+            x (torch.Tensor): random noise
+            t_span (torch.Tensor): n_timesteps interpolated
+                shape: (n_timesteps + 1,)
+            mu (torch.Tensor): output of encoder
+                shape: (batch_size, n_feats, mel_timesteps)
+            mask (torch.Tensor): output_mask
+                shape: (batch_size, 1, mel_timesteps)
+            spks (torch.Tensor, optional): speaker ids. Defaults to None.
+                shape: (batch_size, spk_emb_dim)
+            cond: Not used but kept for future purposes
+        """
+        t, _, dt = t_span[0], t_span[-1], t_span[1] - t_span[0]
+        t = t.unsqueeze(dim=0)
+        # I am storing this because I can later plot it by putting a debugger here and saving it to a file
+        # Or in future might add like a return_all_steps flag
+        sol = []
+        if self.inference_cfg_rate > 0:
+            # Do not use concat, it may cause memory format changed and trt infer with wrong results!
+            x_in = torch.zeros([2, 80, x.size(2)], device=x.device, dtype=x.dtype)
+            mask_in = torch.zeros([2, 1, x.size(2)], device=x.device, dtype=x.dtype)
+            mu_in = torch.zeros([2, 80, x.size(2)], device=x.device, dtype=x.dtype)
+            t_in = torch.zeros([2], device=x.device, dtype=x.dtype)
+            spks_in = torch.zeros([2, 80], device=x.device, dtype=x.dtype)
+            cond_in = torch.zeros([2, 80, x.size(2)], device=x.device, dtype=x.dtype)
+        else:
+            x_in, mask_in, mu_in, t_in, spks_in, cond_in = x, mask, mu, t, spks, cond
+        for step in range(1, len(t_span)):
+            # Classifier-Free Guidance inference introduced in VoiceBox
+            if self.inference_cfg_rate > 0:
+                x_in[:] = x
+                mask_in[:] = mask
+                mu_in[0] = mu
+                t_in[:] = t.unsqueeze(0)
+                spks_in[0] = spks
+                cond_in[0] = cond
+            else:
+                x_in, mask_in, mu_in, t_in, spks_in, cond_in = x, mask, mu, t, spks, cond
+            dphi_dt = self.forward_estimator(
+                x_in, mask_in,
+                mu_in, t_in,
+                spks_in,
+                cond_in
+            )
+            if self.inference_cfg_rate > 0:
+                dphi_dt, cfg_dphi_dt = torch.split(dphi_dt, [x.size(0), x.size(0)], dim=0)
+                dphi_dt = ((1.0 + self.inference_cfg_rate) * dphi_dt - self.inference_cfg_rate * cfg_dphi_dt)
+            x = x + dt * dphi_dt
+            t = t + dt
+            sol.append(x)
+            if step < len(t_span) - 1:
+                dt = t_span[step + 1] - t
+        return sol[-1].float()
+    def forward_estimator(self, x, mask, mu, t, spks, cond):
+        if isinstance(self.estimator, torch.nn.Module):
+            return self.estimator.forward(x, mask, mu, t, spks, cond)
+        elif isinstance(self.estimator, onnxruntime.InferenceSession):
+            ort_inputs = {
+                'x': x.cpu().numpy(),
+                'mask': mask.cpu().numpy(),
+                'mu': mu.cpu().numpy(),
+                't': t.cpu().numpy(),
+                'spks': spks.cpu().numpy(),
+                'cond': cond.cpu().numpy()
+            }
+            output = self.estimator.run(None, ort_inputs)[0]
+            return torch.tensor(output, dtype=x.dtype, device=x.device)
+        else:
+            self.estimator.set_input_shape('x', (2, 80, x.size(2)))
+            self.estimator.set_input_shape('mask', (2, 1, x.size(2)))
+            self.estimator.set_input_shape('mu', (2, 80, x.size(2)))
+            self.estimator.set_input_shape('t', (2,))
+            self.estimator.set_input_shape('spks', (2, 80))
+            self.estimator.set_input_shape('cond', (2, 80, x.size(2)))
+            # run trt engine
+            self.estimator.execute_v2([x.contiguous().data_ptr(),
+                                       mask.contiguous().data_ptr(),
+                                       mu.contiguous().data_ptr(),
+                                       t.contiguous().data_ptr(),
+                                       spks.contiguous().data_ptr(),
+                                       cond.contiguous().data_ptr(),
+                                       x.data_ptr()])
+            return x
+    def compute_loss(self, x1, mask, mu, spks=None, cond=None):
+        """Computes diffusion loss
+        Args:
+            x1 (torch.Tensor): Target
+                shape: (batch_size, n_feats, mel_timesteps)
+            mask (torch.Tensor): target mask
+                shape: (batch_size, 1, mel_timesteps)
+            mu (torch.Tensor): output of encoder
+                shape: (batch_size, n_feats, mel_timesteps)
+            spks (torch.Tensor, optional): speaker embedding. Defaults to None.
+                shape: (batch_size, spk_emb_dim)
+        Returns:
+            loss: conditional flow matching loss
+            y: conditional flow
+                shape: (batch_size, n_feats, mel_timesteps)
+        """
+        b, _, t = mu.shape
+        # random timestep
+        t = torch.rand([b, 1, 1], device=mu.device, dtype=mu.dtype)
+        if self.t_scheduler == 'cosine':
+            t = 1 - torch.cos(t * 0.5 * torch.pi)
+        # sample noise p(x_0)
+        z = torch.randn_like(x1)
+        y = (1 - (1 - self.sigma_min) * t) * z + t * x1
+        u = x1 - (1 - self.sigma_min) * z
+        # during training, we randomly drop condition to trade off mode coverage and sample fidelity
+        if self.training_cfg_rate > 0:
+            cfg_mask = torch.rand(b, device=x1.device) > self.training_cfg_rate
+            mu = mu * cfg_mask.view(-1, 1, 1)
+            spks = spks * cfg_mask.view(-1, 1)
+            cond = cond * cfg_mask.view(-1, 1, 1)
+        pred = self.estimator(y, mask, mu, t.squeeze(), spks, cond)
+        loss = F.mse_loss(pred * mask, u * mask, reduction="sum") / (torch.sum(mask) * u.shape[1])
+        return loss, y
+class CausalConditionalCFM(ConditionalCFM):
+    def __init__(self, in_channels, cfm_params, n_spks=1, spk_emb_dim=64, estimator: torch.nn.Module = None):
+        super().__init__(in_channels, cfm_params, n_spks, spk_emb_dim, estimator)
+        self.rand_noise = torch.randn([1, 80, 50 * 300])
+    @torch.inference_mode()
+    def forward(self, mu, mask, n_timesteps, temperature=1.0, spks=None, cond=None):
+        """Forward diffusion
+        Args:
+            mu (torch.Tensor): output of encoder
+                shape: (batch_size, n_feats, mel_timesteps)
+            mask (torch.Tensor): output_mask
+                shape: (batch_size, 1, mel_timesteps)
+            n_timesteps (int): number of diffusion steps
+            temperature (float, optional): temperature for scaling noise. Defaults to 1.0.
+            spks (torch.Tensor, optional): speaker ids. Defaults to None.
+                shape: (batch_size, spk_emb_dim)
+            cond: Not used but kept for future purposes
+        Returns:
+            sample: generated mel-spectrogram
+                shape: (batch_size, n_feats, mel_timesteps)
+        """
+        z = self.rand_noise[:, :, :mu.size(2)].to(mu.device) * temperature
+        if self.fp16 is True:
+            z = z.half()
+        # fix prompt and overlap part mu and z
+        t_span = torch.linspace(0, 1, n_timesteps + 1, device=mu.device, dtype=mu.dtype)
+        if self.t_scheduler == 'cosine':
+            t_span = 1 - torch.cos(t_span * 0.5 * torch.pi)
+        return self.solve_euler(z, t_span=t_span, mu=mu, mask=mask, spks=spks, cond=cond), None

cosyvoice/flow/length_regulator.py ADDED Viewed

	@@ -0,0 +1,69 @@

+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Zhihao Du)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Tuple
+import torch.nn as nn
+import torch
+from torch.nn import functional as F
+from cosyvoice.utils.mask import make_pad_mask
+class InterpolateRegulator(nn.Module):
+    def __init__(
+            self,
+            channels: int,
+            sampling_ratios: Tuple,
+            out_channels: int = None,
+            groups: int = 1,
+    ):
+        super().__init__()
+        self.sampling_ratios = sampling_ratios
+        out_channels = out_channels or channels
+        model = nn.ModuleList([])
+        if len(sampling_ratios) > 0:
+            for _ in sampling_ratios:
+                module = nn.Conv1d(channels, channels, 3, 1, 1)
+                norm = nn.GroupNorm(groups, channels)
+                act = nn.Mish()
+                model.extend([module, norm, act])
+        model.append(
+            nn.Conv1d(channels, out_channels, 1, 1)
+        )
+        self.model = nn.Sequential(*model)
+    def forward(self, x, ylens=None):
+        # x in (B, T, D)
+        mask = (~make_pad_mask(ylens)).to(x).unsqueeze(-1)
+        x = F.interpolate(x.transpose(1, 2).contiguous(), size=ylens.max(), mode='linear')
+        out = self.model(x).transpose(1, 2).contiguous()
+        olens = ylens
+        return out * mask, olens
+    def inference(self, x1, x2, mel_len1, mel_len2, input_frame_rate=50):
+        # in inference mode, interploate prompt token and token(head/mid/tail) seprately, so we can get a clear separation point of mel
+        # x in (B, T, D)
+        if x2.shape[1] > 40:
+            x2_head = F.interpolate(x2[:, :20].transpose(1, 2).contiguous(), size=int(20 / input_frame_rate * 22050 / 256), mode='linear')
+            x2_mid = F.interpolate(x2[:, 20:-20].transpose(1, 2).contiguous(), size=mel_len2 - int(20 / input_frame_rate * 22050 / 256) * 2,
+                                   mode='linear')
+            x2_tail = F.interpolate(x2[:, -20:].transpose(1, 2).contiguous(), size=int(20 / input_frame_rate * 22050 / 256), mode='linear')
+            x2 = torch.concat([x2_head, x2_mid, x2_tail], dim=2)
+        else:
+            x2 = F.interpolate(x2.transpose(1, 2).contiguous(), size=mel_len2, mode='linear')
+        if x1.shape[1] != 0:
+            x1 = F.interpolate(x1.transpose(1, 2).contiguous(), size=mel_len1, mode='linear')
+            x = torch.concat([x1, x2], dim=2)
+        else:
+            x = x2
+        out = self.model(x).transpose(1, 2).contiguous()
+        return out, mel_len1 + mel_len2

cosyvoice/hifigan/discriminator.py ADDED Viewed

	@@ -0,0 +1,140 @@

+import torch
+import torch.nn as nn
+from torch.nn.utils import weight_norm
+from typing import List, Optional, Tuple
+from einops import rearrange
+from torchaudio.transforms import Spectrogram
+class MultipleDiscriminator(nn.Module):
+    def __init__(
+            self, mpd: nn.Module, mrd: nn.Module
+    ):
+        super().__init__()
+        self.mpd = mpd
+        self.mrd = mrd
+    def forward(self, y: torch.Tensor, y_hat: torch.Tensor):
+        y_d_rs, y_d_gs, fmap_rs, fmap_gs = [], [], [], []
+        this_y_d_rs, this_y_d_gs, this_fmap_rs, this_fmap_gs = self.mpd(y.unsqueeze(dim=1), y_hat.unsqueeze(dim=1))
+        y_d_rs += this_y_d_rs
+        y_d_gs += this_y_d_gs
+        fmap_rs += this_fmap_rs
+        fmap_gs += this_fmap_gs
+        this_y_d_rs, this_y_d_gs, this_fmap_rs, this_fmap_gs = self.mrd(y, y_hat)
+        y_d_rs += this_y_d_rs
+        y_d_gs += this_y_d_gs
+        fmap_rs += this_fmap_rs
+        fmap_gs += this_fmap_gs
+        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+class MultiResolutionDiscriminator(nn.Module):
+    def __init__(
+        self,
+        fft_sizes: Tuple[int, ...] = (2048, 1024, 512),
+        num_embeddings: Optional[int] = None,
+    ):
+        """
+        Multi-Resolution Discriminator module adapted from https://github.com/descriptinc/descript-audio-codec.
+        Additionally, it allows incorporating conditional information with a learned embeddings table.
+        Args:
+            fft_sizes (tuple[int]): Tuple of window lengths for FFT. Defaults to (2048, 1024, 512).
+            num_embeddings (int, optional): Number of embeddings. None means non-conditional discriminator.
+                Defaults to None.
+        """
+        super().__init__()
+        self.discriminators = nn.ModuleList(
+            [DiscriminatorR(window_length=w, num_embeddings=num_embeddings) for w in fft_sizes]
+        )
+    def forward(
+        self, y: torch.Tensor, y_hat: torch.Tensor, bandwidth_id: torch.Tensor = None
+    ) -> Tuple[List[torch.Tensor], List[torch.Tensor], List[List[torch.Tensor]], List[List[torch.Tensor]]]:
+        y_d_rs = []
+        y_d_gs = []
+        fmap_rs = []
+        fmap_gs = []
+        for d in self.discriminators:
+            y_d_r, fmap_r = d(x=y, cond_embedding_id=bandwidth_id)
+            y_d_g, fmap_g = d(x=y_hat, cond_embedding_id=bandwidth_id)
+            y_d_rs.append(y_d_r)
+            fmap_rs.append(fmap_r)
+            y_d_gs.append(y_d_g)
+            fmap_gs.append(fmap_g)
+        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+class DiscriminatorR(nn.Module):
+    def __init__(
+        self,
+        window_length: int,
+        num_embeddings: Optional[int] = None,
+        channels: int = 32,
+        hop_factor: float = 0.25,
+        bands: Tuple[Tuple[float, float], ...] = ((0.0, 0.1), (0.1, 0.25), (0.25, 0.5), (0.5, 0.75), (0.75, 1.0)),
+    ):
+        super().__init__()
+        self.window_length = window_length
+        self.hop_factor = hop_factor
+        self.spec_fn = Spectrogram(
+            n_fft=window_length, hop_length=int(window_length * hop_factor), win_length=window_length, power=None
+        )
+        n_fft = window_length // 2 + 1
+        bands = [(int(b[0] * n_fft), int(b[1] * n_fft)) for b in bands]
+        self.bands = bands
+        convs = lambda: nn.ModuleList(
+            [
+                weight_norm(nn.Conv2d(2, channels, (3, 9), (1, 1), padding=(1, 4))),
+                weight_norm(nn.Conv2d(channels, channels, (3, 9), (1, 2), padding=(1, 4))),
+                weight_norm(nn.Conv2d(channels, channels, (3, 9), (1, 2), padding=(1, 4))),
+                weight_norm(nn.Conv2d(channels, channels, (3, 9), (1, 2), padding=(1, 4))),
+                weight_norm(nn.Conv2d(channels, channels, (3, 3), (1, 1), padding=(1, 1))),
+            ]
+        )
+        self.band_convs = nn.ModuleList([convs() for _ in range(len(self.bands))])
+        if num_embeddings is not None:
+            self.emb = torch.nn.Embedding(num_embeddings=num_embeddings, embedding_dim=channels)
+            torch.nn.init.zeros_(self.emb.weight)
+        self.conv_post = weight_norm(nn.Conv2d(channels, 1, (3, 3), (1, 1), padding=(1, 1)))
+    def spectrogram(self, x):
+        # Remove DC offset
+        x = x - x.mean(dim=-1, keepdims=True)
+        # Peak normalize the volume of input audio
+        x = 0.8 * x / (x.abs().max(dim=-1, keepdim=True)[0] + 1e-9)
+        x = self.spec_fn(x)
+        x = torch.view_as_real(x)
+        x = rearrange(x, "b f t c -> b c t f")
+        # Split into bands
+        x_bands = [x[..., b[0]: b[1]] for b in self.bands]
+        return x_bands
+    def forward(self, x: torch.Tensor, cond_embedding_id: torch.Tensor = None):
+        x_bands = self.spectrogram(x)
+        fmap = []
+        x = []
+        for band, stack in zip(x_bands, self.band_convs):
+            for i, layer in enumerate(stack):
+                band = layer(band)
+                band = torch.nn.functional.leaky_relu(band, 0.1)
+                if i > 0:
+                    fmap.append(band)
+            x.append(band)
+        x = torch.cat(x, dim=-1)
+        if cond_embedding_id is not None:
+            emb = self.emb(cond_embedding_id)
+            h = (emb.view(1, -1, 1, 1) * x).sum(dim=1, keepdims=True)
+        else:
+            h = 0
+        x = self.conv_post(x)
+        fmap.append(x)
+        x += h
+        return x, fmap