Spaces:

flrtemis
/

https-huggingface-co-spaces-ftrtemis-moshi

Configuration error

App Files Files Community

flrtemis commited on Jan 11

Commit

d0f0efe

verified ·

1 Parent(s): 56e7e0e

Upload folder using huggingface_hub

Browse files

Files changed (37) hide show

.gitattributes +2 -0
.github/ISSUE_TEMPLATE/bug.yml +83 -0
.github/ISSUE_TEMPLATE/question.yml +40 -0
.github/PULL_REQUEST_TEMPLATE.md +9 -0
.github/actions/moshi_build/action.yml +32 -0
.github/workflows/precommit.yml +17 -0
.gitignore +195 -0
.pre-commit-config.yaml +25 -0
.vscode/settings.json +3 -0
CONTRIBUTING.md +58 -0
FAQ.md +56 -0
LICENSE-APACHE +201 -0
LICENSE-MIT +23 -0
README.md +345 -10
audio/bria.mp3 +3 -0
audio/loona.mp3 +0 -0
audio/sample_fr_hibiki_crepes.mp3 +3 -0
configs/config-stt-en-hf.toml +42 -0
configs/config-stt-en_fr-hf.toml +46 -0
configs/config-tts.toml +60 -0
scripts/stt_evaluate_on_dataset.py +387 -0
scripts/stt_from_file_mlx.py +100 -0
scripts/stt_from_file_pytorch.py +247 -0
scripts/stt_from_file_rust_server.py +135 -0
scripts/stt_from_file_with_prompt_pytorch.py +187 -0
scripts/stt_from_mic_mlx.py +116 -0
scripts/stt_from_mic_rust_server.py +135 -0
scripts/tts_mlx.py +210 -0
scripts/tts_mlx_streaming.py +317 -0
scripts/tts_pytorch.py +142 -0
scripts/tts_pytorch_streaming.py +261 -0
scripts/tts_rust_server.py +185 -0
stt-rs/Cargo.lock +3746 -0
stt-rs/Cargo.toml +31 -0
stt-rs/src/main.rs +260 -0
stt_pytorch.ipynb +237 -0
tts_pytorch.ipynb +139 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+audio/bria.mp3 filter=lfs diff=lfs merge=lfs -text
+audio/sample_fr_hibiki_crepes.mp3 filter=lfs diff=lfs merge=lfs -text

.github/ISSUE_TEMPLATE/bug.yml ADDED Viewed

	@@ -0,0 +1,83 @@

+name: Bug Report
+description: You found a bug.
+labels: ["bug", "triage"]
+body:
+  - type: markdown
+    attributes:
+      value: |
+        Please first check the [FAQ](https://github.com/kyutai-labs/delayed-streams-modeling/blob/main/FAQ.md).
+  - type: dropdown
+    id: backend
+    attributes:
+      label: Backend impacted
+      description: Which backend is concerned with your bug report?
+      options:
+        - The PyTorch implementation
+        - The MLX implementation
+        - The Rust implementation
+        - Other / All
+      default: 0
+    validations:
+      required: true
+  - type: dropdown
+    id: os
+    attributes:
+      label: Operating system
+      description: What is your operating system?
+      options:
+        - Linux
+        - Mac OS X
+        - Windows (unsupported)
+      default: 0
+    validations:
+      required: true
+  - type: dropdown
+    id: hardware
+    attributes:
+      label: Hardware
+      description: What hardware are you using?
+      options:
+        - CPU
+        - GPU with CUDA
+        - Metal with MLX
+      default: 0
+    validations:
+      required: true
+  - type: textarea
+    id: description
+    attributes:
+      label: Description
+      description: Provide a detailed description of your bug.
+      placeholder:
+      value:
+    validations:
+      required: true
+  - type: textarea
+    id: more_info
+    attributes:
+      label: Extra information
+      description: Please provide any other relevant information, such as log extracts, code etc.
+      placeholder:
+      value:
+    validations:
+      required: true
+  - type: textarea
+    id: env
+    attributes:
+      label: Environment
+      description: Please provide any other relevant information, such as log extracts, code etc.
+      placeholder:
+      value: |
+          Fill in the following information on your system.
+          - Operating system version:
+          If the backend impacted is PyTorch:
+          - Python version:
+          - PyTorch version:
+          - CUDA version (run `python -c 'import torch;  print(torch.version.cuda)'`):
+          - GPU model and memory:
+          If the backend is MLX:
+          - Mac model:
+    validations:
+      required: true

.github/ISSUE_TEMPLATE/question.yml ADDED Viewed

	@@ -0,0 +1,40 @@

+name: Question
+description: You have a question about the codebase, the paper, or the implementation.
+labels: ["question", "triage"]
+body:
+  - type: markdown
+    attributes:
+      value: |
+        Please first check the [FAQ](https://github.com/kyutai-labs/delayed-streams-modeling/blob/main/FAQ.md).
+  - type: checkboxes
+    id: terms
+    attributes:
+      label: Due diligence
+      description: Have you searched the existing issues / FAQ / Google / asked ChatGPT?
+      options:
+        - label: I have done my due diligence in trying to find the answer myself.
+          required: true
+  - type: dropdown
+    id: backend
+    attributes:
+      label: Topic
+      description: What is your question about?
+      options:
+        - The paper
+        - The PyTorch implementation
+        - The MLX implementation
+        - The Rust implementation
+        - Other / All
+      default: 0
+    validations:
+      required: true
+  - type: textarea
+    id: question
+    attributes:
+      label: Question
+      description: What is your question?
+      placeholder: Your question. Please make sure this is directly related to our codebase. We will not provide support for installing PyTorch, CUDA, Rust etc.
+      value:
+    validations:
+      required: true

.github/PULL_REQUEST_TEMPLATE.md ADDED Viewed

	@@ -0,0 +1,9 @@

+## Checklist
+- [ ] Read CONTRIBUTING.md, and accept the CLA by including the provided snippet. We will not accept PR without this.
+- [ ] Run pre-commit hook.
+- [ ] If you changed Rust code, run `cargo check`, `cargo clippy`, `cargo test`.
+## PR Description
+<!-- Description for the PR -->

.github/actions/moshi_build/action.yml ADDED Viewed

	@@ -0,0 +1,32 @@

+name: moshi_build
+description: 'Build env.'
+runs:
+  using: "composite"
+  steps:
+  - uses: actions/setup-python@v2
+    with:
+      python-version: '3.10.14'
+  - uses: actions/cache@v3
+    id: cache
+    with:
+      path: env
+      key: env-${{ hashFiles('moshi/pyproject.toml') }}
+  - name: Install dependencies
+    if: steps.cache.outputs.cache-hit != 'true'
+    shell: bash
+    run: |
+      python3 -m venv env
+      .  env/bin/activate
+      python -m pip install --upgrade pip
+      pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cpu
+      pip install moshi==0.2.7
+      pip install pre-commit
+  - name: Setup env
+    shell: bash
+    run: |
+      source  env/bin/activate
+      pre-commit install
+  - name: Install uv
+    uses: astral-sh/setup-uv@v6
+    with:
+      version: "0.8.13"

.github/workflows/precommit.yml ADDED Viewed

	@@ -0,0 +1,17 @@

+name: precommit
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
+jobs:
+  run_precommit:
+    name: Run precommit
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - uses: ./.github/actions/moshi_build
+      - run: |
+          source env/bin/activate
+          pre-commit run --all-files

.gitignore ADDED Viewed

	@@ -0,0 +1,195 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# UV
+#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#uv.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+# Abstra
+# Abstra is an AI-powered process automation framework.
+# Ignore directories containing user credentials, local state, and settings.
+# Learn more at https://abstra.io/docs
+.abstra/
+# Visual Studio Code
+#  Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
+#  that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
+#  and can be added to the global gitignore or merged into this file. However, if you prefer,
+#  you could uncomment the following to ignore the enitre vscode folder
+# .vscode/
+# Ruff stuff:
+.ruff_cache/
+# PyPI configuration file
+.pypirc
+# Cursor
+#  Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
+#  exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
+#  refer to https://docs.cursor.com/context/ignore-files
+.cursorignore
+.cursorindexingignore
+out*.wav

.pre-commit-config.yaml ADDED Viewed

	@@ -0,0 +1,25 @@

+repos:
+  - repo: local
+    hooks:
+    - id: ruff
+      name: ruff
+      language: system
+      entry: bash -c 'uvx ruff check'
+      pass_filenames: false
+      always_run: true
+    - id: ruff-format
+      name: ruff format
+      language: system
+      entry: bash -c 'uvx ruff format --check'
+      pass_filenames: false
+      always_run: true
+    # Get rid of Jupyter Notebook output because we don't want to keep it in Git
+  - repo: https://github.com/kynan/nbstripout
+    rev: 0.8.1
+    hooks:
+      - id: nbstripout
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v5.0.0
+    hooks:
+      - id: check-added-large-files
+        args: ["--maxkb=2048"]

.vscode/settings.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "python.analysis.typeCheckingMode": "standard"
+}

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,58 @@

+# Contributing to Delayed-Streams-Modeling
+## Pull Requests
+Delayed-Streams-Modeling is the implementation of a research paper.
+Therefore, we do not plan on accepting many pull requests for new features.
+However, we certainly welcome them for bug fixes.
+1. Fork the repo and create your branch from `main`.
+2. If you have changed APIs, update the documentation accordingly.
+3. Ensure pre-commit hooks pass properly, in particular the linting and typing.
+4. When changing the Rust code, run `cargo check`, `cargo clippy`, `cargo test`.
+5. Accept the Contributor License Agreement (see after).
+Note that in general, we will not accept refactoring of the code.
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a Contributor License Agreement.
+If you agree with the full CLA provided in the next paragraph, copy the following statement in your PR, changing your Github Handle:
+> I, {your GitHub handle}, confirm that I have read and understood the terms of the CLA of Kyutai-labs, as outlined in the repository's CONTRIBUTING.md, and I agree to be bound by these terms.
+The full CLA is provided as follows:
+> I, {your GitHub handle}, hereby grant to Kyutai-labs a perpetual, worldwide, non-exclusive, royalty-free,
+> irrevocable license to use, modify, distribute, and sublicense my Contributions.
+> I understand and accept that Contributions are limited to modifications, improvements, or changes
+> to the project’s source code submitted via pull requests. I accept that Kyutai-labs has full discretion to
+> review, accept, reject, or request changes to any Contributions I submit, and that submitting
+> a pull request does not guarantee its inclusion in the project.
+> By submitting a Contribution, I grant Kyutai-labs a perpetual, worldwide license to use, modify,
+> reproduce, distribute, and create derivative works based on my Contributions.
+> I also agree to assign all patent rights for any inventions or improvements that arise from my Contributions,
+> giving the Kyutai-labs full rights to file for and enforce patents.
+> I understand that the Kyutai-labs may commercialize, relicense, or exploit the project and my Contributions without further notice or obligation to me.
+> I confirm that my Contributions are original and that I have the legal right to grant this license.
+> If my Contributions include third-party materials, I will ensure that I have the necessary permissions
+> and will disclose this information. I accept that once my Contributions are integrated, they may be altered or removed at the Kyutai-labs’s discretion.
+> I acknowledge that I am making these Contributions voluntarily and will not receive any compensation.
+> Furthermore, I understand that all Contributions, including mine, are provided on an "as-is" basis, with no warranties.
+> By submitting a pull request, I agree to be bound by these terms.
+## Issues
+Please submit issues on our Github repository.
+## License
+By contributing to Delayed-Streams-Modeling, you agree that your contributions
+will be licensed under the LICENSE-* files in the root directory of this source
+tree. In particular, the rust code is licensed under APACHE, and the python code
+under MIT.

FAQ.md ADDED Viewed

	@@ -0,0 +1,56 @@

+# FAQ
+Here is the answer to a number of frequently asked questions.
+### Torch compilation issues
+With some PyTorch/triton versions, one might encounter compilation errors
+like the following:
+```
+  Traceback (most recent call last):
+  ...
+  File "site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1153, in make_launcher
+    "launch_enter_hook": binary.__class__.launch_enter_hook,
+                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+torch._inductor.exc.InductorError: AttributeError: type object 'CompiledKernel' has no attribute 'launch_enter_hook'
+```
+If that's the case, you can disable torch compilation by setting the following
+environment variable.
+```bash
+export NO_TORCH_COMPILE=1
+```
+### Issues installing the sentencepiece dependency
+On some linux distributions (arch) or on macos, the local version of cmake can
+be too recent for the sentencepiece dependency.
+```
+CMake Error at CMakeLists.txt:15 (cmake_minimum_required):
+  Compatibility with CMake < 3.5 has been removed from CMake.
+```
+You can either downgrade your cmake version, e.g. 3.31.0 on arch works or try
+setting `CMAKE_POLICY_VERSION_MINIMUM=3.5`.
+If you run into some errors when compiling the sentencepiece rust bindings,
+these could also be due to gcc being too recent, e.g. gcc 15. You can get
+around this by using gcc-13, e.g. by setting the following after installing
+the proper gcc packages.
+```bash
+export CMAKE_C_COMPILER=/usr/bin/gcc-13
+export CMAKE_CXX_COMPILER=/usr/bin/g++-13
+CC=gcc-13 CXX=g++-13 cargo build --release
+```
+Alternatively you can set `CXXFLAGS="-include cstdint"`, see this
+[issue](https://github.com/google/sentencepiece/issues/1108).
+### Will you release training code?
+Some finetuning code can be found in the [kyutai-labs/moshi-finetune repo](https://github.com/kyutai-labs/moshi-finetune).
+This code has not been adapted to the Speech-To-Text and Text-To-Speech models
+yet, but it should be a good starting point.

LICENSE-APACHE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

LICENSE-MIT ADDED Viewed

	@@ -0,0 +1,23 @@

+Permission is hereby granted, free of charge, to any
+person obtaining a copy of this software and associated
+documentation files (the "Software"), to deal in the
+Software without restriction, including without
+limitation the rights to use, copy, modify, merge,
+publish, distribute, sublicense, and/or sell copies of
+the Software, and to permit persons to whom the Software
+is furnished to do so, subject to the following
+conditions:
+The above copyright notice and this permission notice
+shall be included in all copies or substantial portions
+of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF
+ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
+TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
+PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
+SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
+IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.

README.md CHANGED Viewed

@@ -1,10 +1,345 @@
----
-title: Https Huggingface Co Spaces Ftrtemis Moshi
-emoji: 🔥
-colorFrom: indigo
-colorTo: green
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Delayed Streams Modeling: Kyutai STT & TTS
+This repo contains instructions and examples of how to run
+[Kyutai Speech-To-Text](#kyutai-speech-to-text)
+and [Kyutai Text-To-Speech](#kyutai-text-to-speech) models.
+See also [Unmute](https://github.com/kyutai-labs/unmute), a voice AI system built using Kyutai STT and Kyutai TTS.
+But wait, what is "Delayed Streams Modeling"? It is a technique for solving many streaming X-to-Y tasks (with X, Y in `{speech, text}`)
+that formalize the approach we had with Moshi and Hibiki. See our [pre-print about DSM](https://arxiv.org/abs/2509.08753).
+## Kyutai Speech-To-Text
+<a href="https://huggingface.co/collections/kyutai/speech-to-text-685403682cf8a23ab9466886" target="_blank" style="margin: 2px;">
+    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-KyutaiSTT-blue" style="display: inline-block; vertical-align: middle;"/>
+</a>
+<a target="_blank" href="https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/stt_pytorch.ipynb">
+  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+</a>
+**More details can be found on the [project page](https://kyutai.org/next/stt).**
+Kyutai STT models are optimized for real-time usage, can be batched for efficiency, and return word level timestamps.
+We provide two models:
+- `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad).
+- `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.
+These speech-to-text models have several advantages:
+- Streaming inference: the models can process audio in chunks, which allows
+  for real-time transcription, and is great for interactive applications.
+- Easy batching for maximum efficiency: a H100 can process 400 streams in
+  real-time.
+- They return word-level timestamps.
+- The 1B model has a semantic Voice Activity Detection (VAD) component that
+  can be used to detect when the user is speaking. This is especially useful
+  for building voice agents.
+### Implementations overview
+We provide different implementations of Kyutai STT for different use cases.
+Here is how to choose which one to use:
+- **PyTorch: for research and tinkering.**
+  If you want to call the model from Python for research or experimentation, use our PyTorch implementation.
+- **Rust: for production.**
+  If you want to serve Kyutai STT in a production setting, use our Rust server.
+  Our robust Rust server provides streaming access to the model over websockets.
+  We use this server to run [Unmute](https://unmute.sh/); on a L40S GPU, we can serve 64 simultaneous connections at a real-time factor of 3x.
+- **MLX: for on-device inference on iPhone and Mac.**
+  MLX is Apple's ML framework that allows you to use hardware acceleration on Apple silicon.
+  If you want to run the model on a Mac or an iPhone, choose the MLX implementation.
+<details>
+<summary>PyTorch implementation</summary>
+<a href="https://huggingface.co/kyutai/stt-2.6b-en" target="_blank" style="margin: 2px;">
+    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
+</a>
+<a target="_blank" href="https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/stt_pytorch.ipynb">
+  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+</a>
+For an example of how to use the model in a way where you can directly stream in PyTorch tensors,
+[see our Colab notebook](https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/stt_pytorch.ipynb).
+This requires the [moshi package](https://pypi.org/project/moshi/)
+with version 0.2.6 or later, which can be installed via pip.
+If you just want to run the model on a file, you can use `moshi.run_inference`.
+```bash
+python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3
+```
+If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step
+and just prefix the command above with `uvx --with moshi`.
+Additionally, we provide two scripts that highlight different usage scenarios. The first script illustrates how to extract word-level timestamps from the model's outputs:
+```bash
+uv run \
+  scripts/stt_from_file_pytorch.py \
+  --hf-repo kyutai/stt-2.6b-en \
+  audio/bria.mp3
+```
+The second script can be used to run a model on an existing Hugging Face dataset and calculate its performance metrics:
+```bash
+uv run scripts/evaluate_on_dataset.py  \
+  --dataset meanwhile  \
+  --hf-repo kyutai/stt-2.6b-en
+```
+Another example shows how one can provide a text-, audio-, or text-audio prompt to our STT model:
+```bash
+uv run scripts/stt_from_file_pytorch_with_prompt.py \
+  --hf-repo kyutai/stt-2.6b-en \
+  --file bria.mp3 \
+  --prompt_file ./audio/loonah.mp3 \
+  --prompt_text "Loonah" \
+  --cut-prompt-transcript
+```
+Produces the transcript of `bria.mp3` using the `Loonah` spelling for the name, instead of the `Luna` used without any prompt:
+```
+In the heart of an ancient forest, where the trees whispered secrets of the past, there lived a peculiar rabbit named Loonah (...)
+```
+Apart from nudging the model for a specific spelling of a word, other potential use-cases include speaker adaptation and steering the model towards a specific formatting style or even a language.
+However, please bear in mind that is an experimental feature and its behavior is very sensitive to the prompt provided.
+</details>
+<details>
+<summary>Rust server</summary>
+<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
+    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
+</a>
+The Rust implementation provides a server that can process multiple streaming
+queries in parallel. Depending on the amount of memory on your GPU, you may
+have to adjust the batch size from the config file. For a L40S GPU, a batch size
+of 64 works well and requests can be processed at 3x real-time speed.
+In order to run the server, install the [moshi-server
+crate](https://crates.io/crates/moshi-server) via the following command. The
+server code can be found in the
+[kyutai-labs/moshi](https://github.com/kyutai-labs/moshi/tree/main/rust/moshi-server)
+repository.
+```bash
+cargo install --features cuda moshi-server
+```
+Then the server can be started via the following command using the config file
+from this repository.
+For `kyutai/stt-1b-en_fr`, use `configs/config-stt-en_fr.hf.toml`,
+and for `kyutai/stt-2.6b-en`, use `configs/config-stt-en-hf.toml`,
+```bash
+moshi-server worker --config configs/config-stt-en_fr-hf.toml
+```
+Once the server has started you can transcribe audio from your microphone with the following script.
+```bash
+uv run scripts/stt_from_mic_rust_server.py
+```
+We also provide a script for transcribing from an audio file.
+```bash
+uv run scripts/stt_from_file_rust_server.py audio/bria.mp3
+```
+The script limits the decoding speed to simulates real-time processing of the audio.
+Faster processing can be triggered by setting
+the real-time factor, e.g. `--rtf 1000` will process
+the data as fast as possible.
+</details>
+<details>
+<summary>Rust standalone</summary>
+<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
+    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
+</a>
+A standalone Rust example script is provided in the `stt-rs` directory in this repo.
+This can be used as follows:
+```bash
+cd stt-rs
+cargo run --features cuda -r -- ../audio/bria.mp3
+```
+You can get the timestamps by adding the `--timestamps` flag, and see the output
+of the semantic VAD by adding the `--vad` flag.
+</details>
+<details>
+<summary>MLX implementation</summary>
+<a href="https://huggingface.co/kyutai/stt-2.6b-en-mlx" target="_blank" style="margin: 2px;">
+    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
+</a>
+[MLX](https://ml-explore.github.io/mlx/build/html/index.html) is Apple's ML framework that allows you to use
+hardware acceleration on Apple silicon.
+This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/)
+with version 0.2.6 or later, which can be installed via pip.
+If you just want to run the model on a file, you can use `moshi_mlx.run_inference`:
+```bash
+python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx audio/bria.mp3 --temp 0
+```
+If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step
+and just prefix the command above with `uvx --with moshi-mlx`.
+If you want to transcribe audio from your microphone, use:
+```bash
+python scripts/stt_from_mic_mlx.py
+```
+The MLX models can also be used in swift using the [moshi-swift
+codebase](https://github.com/kyutai-labs/moshi-swift), the 1b model has been
+tested to work fine on an iPhone 16 Pro.
+</details>
+## Kyutai Text-to-Speech
+<a href="https://huggingface.co/collections/kyutai/text-to-speech-6866192e7e004ed04fd39e29" target="_blank" style="margin: 2px;">
+    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-KyutaiTTS-blue" style="display: inline-block; vertical-align: middle;"/>
+</a>
+<a target="_blank" href="https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/tts_pytorch.ipynb">
+  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+</a>
+**More details can be found on the [project page](https://kyutai.org/next/tts).**
+We provide different implementations of Kyutai TTS for different use cases. Here is how to choose which one to use:
+- PyTorch: for research and tinkering. If you want to call the model from Python for research or experimentation, use our PyTorch implementation.
+- Rust: for production. If you want to serve Kyutai TTS in a production setting, use our Rust server. Our robust Rust server provides streaming access to the model over websockets. We use this server to run Unmute.
+- MLX: for on-device inference on iPhone and Mac. MLX is Apple's ML framework that allows you to use hardware acceleration on Apple silicon. If you want to run the model on a Mac or an iPhone, choose the MLX implementation.
+<details>
+<summary>PyTorch implementation</summary>
+<a target="_blank" href="https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/tts_pytorch.ipynb">
+  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+</a>
+Check out our [Colab notebook](https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/tts_pytorch.ipynb) or use the script:
+```bash
+# From stdin, plays audio immediately
+echo "Hey, how are you?" | python scripts/tts_pytorch.py - -
+# From text file to audio file
+python scripts/tts_pytorch.py text_to_say.txt audio_output.wav
+```
+The `tts_pytorch.py` script waits for all the text to be available before
+starting the audio generation. A fully streaming implementation is available in
+the `tts_pytorch_streaming.py` script, which can be used as follows:
+```bash
+echo "Hey, how are you?" | python scripts/tts_pytorch_streaming.py audio_output.wav
+```
+This requires the [moshi package](https://pypi.org/project/moshi/), which can be installed via pip.
+If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step
+and just prefix the command above with `uvx --with moshi`.
+</details>
+<details>
+<summary>Rust server</summary>
+The Rust implementation provides a server that can process multiple streaming
+queries in parallel.
+Installing the Rust server is a bit tricky because it uses our Python implementation under the hood,
+which also requires installing the Python dependencies.
+Use the [start_tts.sh](https://github.com/kyutai-labs/unmute/blob/main/dockerless/start_tts.sh) script to properly install the Rust server.
+If you already installed the `moshi-server` crate before and it's not working, you might need to force a reinstall by running `cargo uninstall moshi-server` first.
+Feel free to open an issue if the installation is still broken.
+Once installed, the server can be started via the following command using the config file
+from this repository.
+```bash
+moshi-server worker --config configs/config-tts.toml
+```
+Once the server has started you can connect to it using our script as follows:
+```bash
+# From stdin, plays audio immediately
+echo "Hey, how are you?" | python scripts/tts_rust_server.py - -
+# From text file to audio file
+python scripts/tts_rust_server.py text_to_say.txt audio_output.wav
+```
+You can configure the server by modifying `configs/config-tts.toml`. See comments in that file to see what options are available.
+</details>
+<details>
+<summary>MLX implementation</summary>
+[MLX](https://ml-explore.github.io/mlx/build/html/index.html) is Apple's ML framework that allows you to use
+hardware acceleration on Apple silicon.
+Use our example script to run Kyutai TTS on MLX.
+The script takes text from stdin or a file and can output to a file or stream the resulting audio.
+When streaming the output, if the model is not fast enough to keep with
+real-time, you can use the `--quantize 8` or `--quantize 4` flags to quantize
+the model resulting in faster inference.
+```bash
+# From stdin, plays audio immediately
+echo "Hey, how are you?" | python scripts/tts_mlx.py - - --quantize 8
+# From text file to audio file
+python scripts/tts_mlx.py text_to_say.txt audio_output.wav
+```
+This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/), which can be installed via pip.
+If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step
+and just prefix the command above with `uvx --with moshi-mlx`.
+</details>
+## FAQ
+Checkout the [Frequently Asked Questions](FAQ.md) section before opening an issue.
+## License
+The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend.
+The web client code is provided under the MIT license.
+Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under
+the MIT license.
+The weights for the speech-to-text models are released under the CC-BY 4.0 license.
+## Developing
+Install the [pre-commit hooks](https://pre-commit.com/) by running:
+```bash
+pip install pre-commit
+pre-commit install
+```
+If you're using `uv`, you can replace the two commands with `uvx pre-commit install`.
+## Citation
+Please cite the following paper.
+```
+@techreport{kyutai2025streaming,
+      title={Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling},
+      author={Neil Zeghidour and Eugene Kharitonov and Manu Orsini and Václav Volhejn and Gabriel de Marmiesse and Edouard Grave and Patrick Pérez and Laurent Mazaré and Alexandre Défossez},
+      year={2025},
+      eprint={2509.08753},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2509.08753},
+}
+```

audio/bria.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f3fa2f70249b671746ad032a1ca701c8c91c22dfeda45b9c5ecb6d453275a85c
+size 717635

audio/loona.mp3 ADDED Viewed

Binary file (9 kB). View file

audio/sample_fr_hibiki_crepes.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:47d47d0143a55847a27beb61a990cf91c07ac83548796b853b441d3041e635ee
+size 759450

configs/config-stt-en-hf.toml ADDED Viewed

	@@ -0,0 +1,42 @@

+static_dir = "./static/"
+log_dir = "$HOME/tmp/tts-logs"
+instance_name = "tts"
+authorized_ids = ["public_token"]
+[modules.asr]
+path = "/api/asr-streaming"
+type = "BatchedAsr"
+lm_model_file = "hf://kyutai/stt-2.6b-en-candle/model.safetensors"
+text_tokenizer_file = "hf://kyutai/stt-2.6b-en-candle/tokenizer_en_audio_4000.model"
+audio_tokenizer_file = "hf://kyutai/stt-2.6b-en-candle/mimi-pytorch-e351c8d8@125.safetensors"
+asr_delay_in_tokens = 32
+batch_size = 16
+conditioning_learnt_padding = true
+temperature = 0
+[modules.asr.model]
+audio_vocab_size = 2049
+text_in_vocab_size = 4001
+text_out_vocab_size = 4000
+audio_codebooks = 32
+[modules.asr.model.transformer]
+d_model = 2048
+num_heads = 32
+num_layers = 48
+dim_feedforward = 8192
+causal = true
+norm_first = true
+bias_ff = false
+bias_attn = false
+context = 375
+max_period = 100000
+use_conv_block = false
+use_conv_bias = true
+gating = "silu"
+norm = "RmsNorm"
+positional_embedding = "Rope"
+conv_layout = false
+conv_kernel_size = 3
+kv_repeat = 1
+max_seq_len = 40960

configs/config-stt-en_fr-hf.toml ADDED Viewed

	@@ -0,0 +1,46 @@

+static_dir = "./static/"
+log_dir = "$HOME/tmp/tts-logs"
+instance_name = "tts"
+authorized_ids = ["public_token"]
+[modules.asr]
+path = "/api/asr-streaming"
+type = "BatchedAsr"
+lm_model_file = "hf://kyutai/stt-1b-en_fr-candle/model.safetensors"
+text_tokenizer_file = "hf://kyutai/stt-1b-en_fr-candle/tokenizer_en_fr_audio_8000.model"
+audio_tokenizer_file = "hf://kyutai/stt-1b-en_fr-candle/mimi-pytorch-e351c8d8@125.safetensors"
+asr_delay_in_tokens = 6
+batch_size = 64
+conditioning_learnt_padding = true
+temperature = 0.0
+[modules.asr.model]
+audio_vocab_size = 2049
+text_in_vocab_size = 8001
+text_out_vocab_size = 8000
+audio_codebooks = 32
+[modules.asr.model.transformer]
+d_model = 2048
+num_heads = 16
+num_layers = 16
+dim_feedforward = 8192
+causal = true
+norm_first = true
+bias_ff = false
+bias_attn = false
+context = 750
+max_period = 100000
+use_conv_block = false
+use_conv_bias = true
+gating = "silu"
+norm = "RmsNorm"
+positional_embedding = "Rope"
+conv_layout = false
+conv_kernel_size = 3
+kv_repeat = 1
+max_seq_len = 40960
+[modules.asr.model.extra_heads]
+num_heads = 4
+dim = 6

configs/config-tts.toml ADDED Viewed

	@@ -0,0 +1,60 @@

+static_dir = "./static/"
+log_dir = "$HOME/tmp/tts-logs"
+# Used to identify the server when logging.
+instance_name = "tts"
+# Simple security: require clients to provide an auth token when connecting.
+# It can be set by setting auth_id to the query string, e.g.
+# "localhost:8089/api/tts_streaming?auth_id=public_token"
+# or by setting the kyutai-api-key HTTP header, see the tts_rust_server.py example.
+authorized_ids = ["public_token"]
+[modules.tts_py]
+type = "Py"
+# Under which path should the TTS be available? This is relevant because the server
+# can run STT at the same time.
+path = "/api/tts_streaming"
+text_tokenizer_file = "hf://kyutai/tts-1.6b-en_fr/tokenizer_spm_8k_en_fr_audio.model"
+# Batch size determines how many parallel connections can the server handle.
+# Higher values mean slower inference. Adjust to your GPU memory capacity.
+batch_size = 8
+text_bos_token = 1
+[modules.tts_py.py]
+log_folder = "$HOME/tmp/moshi-server-logs"
+# The folder to read voices from. Can be a local directory, or a Hugging Face repo
+# using the "hf-snapshot://" prefix. We use a glob to only download the .safetensors files
+# with voice embeddings since the repo also contains .wav files we don't need.
+voice_folder = "hf-snapshot://kyutai/tts-voices/**/*.safetensors"
+# This voice will be used if the user doesn't specify one, or selects a non-existent one.
+# This usually means something is wrong, so here we set it to a strange voice to make it clear
+# that something is off.
+# Relative to the voice folder.
+default_voice = "unmute-prod-website/default_voice.wav"
+# Classifier-free guidance coefficient (see https://arxiv.org/abs/2207.12598).
+# TLDR: A higher CFG value makes the model adhere to the voice more closely,
+# but it can affect audio quality and make it more likely to make mistakes
+# like inserting words that aren't in the script.
+# Technical details:
+# CFG has the disadvantage of increasing inference time, because you need to run the model
+# twice for each step (once with the voice embedding, once without).
+# The default model, "tts-1.6b-en_fr", is trained with CFG distillation, which means it learns
+# to mimic CFG with different coefs during training, without actually using CFG at inference time.
+# These is only a fixed set of CFG coefs the model was trained with, so using a different value
+# will not work. The recommended value for this model is 2.0.
+cfg_coef = 2.0
+# Whether the unconditioned branch of the CFG should still have text conditioning or not.
+# Typically, no need to touch this.
+cfg_is_no_text = true
+# Number of padding frames to force between words. Will make the model articulate
+# a bit better with values such as 1.
+padding_between = 1
+# Number of quantization levels for the residual vector quantizer.
+# Higher means better sounding audio but longer inference.
+# The maximum is typically 32, reasonable values are 8-32.
+n_q = 24
+# Make the model speak faster or slower by changing how likely it is to sample the padding token.
+# Should be between -2 and 2, with positive values leading to slower speech.
+padding_bonus = 0

scripts/stt_evaluate_on_dataset.py ADDED Viewed

	@@ -0,0 +1,387 @@

+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "datasets",
+#     "jiwer==3.1.0",
+#     "julius",
+#     "librosa",
+#     "moshi",
+#     "openai-whisper",
+#     "soundfile",
+# ]
+# ///
+"""
+Example implementation of the streaming STT example. Here we group
+test utterances in batches (pre- and post-padded with silence) and
+and then feed these batches into the streaming STT model frame-by-frame.
+"""
+# The outputs I get on my H100 using this code with the 2.6B model,
+# bsz 32:
+# LibriVox === cer: 4.09% wer: 7.33% corpus_wer: 6.78% RTF = 52.72
+# Ami === cer: 15.99% wer: 18.78% corpus_wer: 12.20% RTF = 28.37
+# LibriSpeech other === cer: 2.31% wer: 5.24% corpus_wer: 4.33% RTF = 44.76
+# LibriSpeech clean === cer: 0.67% wer: 1.95% corpus_wer: 1.69% RTF = 68.19
+# Tedlium (short) === cer: 2.15% wer: 3.65% corpus_wer: 3.33% RTF = 67.44
+# spgispeech === cer: 0.99% wer: 2.00% corpus_wer: 2.03% RTF = 78.64
+# gigaspeech === cer: 6.80% wer: 11.31% corpus_wer: 9.81% RTF = 64.04
+# earnings22 (short) === cer: 12.63% wer: 15.70% corpus_wer: 11.02% RTF = 50.13
+# Meanwhile === cer: 2.02% wer: 5.50% corpus_wer: 5.60% RTF = 69.19
+# Tedlium (long) == cer: 1.53% wer: 2.56% corpus_wer: 2.97% RTF = 33.92
+# Rev16 === cer: 6.57% wer: 10.08% corpus_wer: 11.43% RTF = 40.34
+# Earnings21 === cer: 5.73% wer: 9.84% corpus_wer: 10.38% RTF = 73.15
+import argparse
+import dataclasses
+import time
+import jiwer
+import julius
+import moshi.models
+import torch
+import tqdm
+from datasets import Dataset, load_dataset
+from whisper.normalizers import EnglishTextNormalizer
+_NORMALIZER = EnglishTextNormalizer()
+def get_text(sample):
+    possible_keys = [
+        "text",
+        "sentence",
+        "normalized_text",
+        "transcript",
+        "transcription",
+    ]
+    for key in possible_keys:
+        if key in sample:
+            return sample[key]
+    raise ValueError(
+        f"Expected transcript column of either {possible_keys}."
+        f"Got sample with keys: {', '.join(sample.keys())}. Ensure a text column name is present in the dataset."
+    )
+# The two functions below are adapted from https://github.com/huggingface/open_asr_leaderboard/blob/main/normalizer/data_utils.py
+def normalize(batch):
+    batch["original_text"] = get_text(batch)
+    batch["norm_text"] = _NORMALIZER(batch["original_text"])
+    return batch
+def is_target_text_in_range(ref):
+    if ref.strip() == "ignore time segment in scoring":
+        return False
+    else:
+        return ref.strip() != ""
+# End of the adapted part
+class AsrMetrics:
+    def __init__(self):
+        self.cer_sum = 0.0
+        self.wer_sum = 0.0
+        self.errors_sum = 0.0
+        self.total_words_sum = 0.0
+        self.num_sequences = 0.0
+    def update(self, hyp: str, ref: str) -> None:
+        normalized_ref = _NORMALIZER(ref)
+        normalized_hyp = _NORMALIZER(hyp)
+        this_wer = jiwer.wer(normalized_ref, normalized_hyp)
+        this_cer = jiwer.cer(normalized_ref, normalized_hyp)
+        measures = jiwer.compute_measures(normalized_ref, normalized_hyp)
+        self.wer_sum += this_wer
+        self.cer_sum += this_cer
+        self.errors_sum += (
+            measures["substitutions"] + measures["deletions"] + measures["insertions"]
+        )
+        self.total_words_sum += (
+            measures["substitutions"] + measures["deletions"] + measures["hits"]
+        )
+        self.num_sequences += 1
+    def compute(self) -> dict:
+        assert self.num_sequences > 0, (
+            "Unable to compute with total number of comparisons <= 0"
+        )  # type: ignore
+        return {
+            "cer": (self.cer_sum / self.num_sequences),
+            "wer": (self.wer_sum / self.num_sequences),
+            "corpus_wer": (self.errors_sum / self.total_words_sum),
+        }
+    def __str__(self) -> str:
+        result = self.compute()
+        return " ".join(f"{k}: {100 * v:.2f}%" for k, v in result.items())
+class Timer:
+    def __init__(self):
+        self.total = 0
+        self._start_time = None
+    def __enter__(self):
+        self._start_time = time.perf_counter()
+        return self
+    def __exit__(self, *_):
+        self.total += time.perf_counter() - self._start_time
+        self._start_time = None
+@dataclasses.dataclass
+class _DatasetInfo:
+    alias: str
+    name: str
+    config: str
+    split: str = "test"
+_DATASETS = [
+    # Long-form datasets from distil-whisper
+    _DatasetInfo("rev16", "distil-whisper/rev16", "whisper_subset"),
+    _DatasetInfo("earnings21", "distil-whisper/earnings21", "full"),
+    _DatasetInfo("earnings22", "distil-whisper/earnings22", "full"),
+    _DatasetInfo("tedlium", "distil-whisper/tedlium-long-form", None),
+    _DatasetInfo("meanwhile", "distil-whisper/meanwhile", None),
+    # Short-form datasets from OpenASR leaderboard
+    _DatasetInfo("ami", "hf-audio/esb-datasets-test-only-sorted", "ami"),
+    _DatasetInfo(
+        "librispeech.clean",
+        "hf-audio/esb-datasets-test-only-sorted",
+        "librispeech",
+        split="test.clean",
+    ),
+    _DatasetInfo(
+        "librispeech.other",
+        "hf-audio/esb-datasets-test-only-sorted",
+        "librispeech",
+        split="test.other",
+    ),
+    _DatasetInfo("voxpopuli", "hf-audio/esb-datasets-test-only-sorted", "voxpopuli"),
+    _DatasetInfo("spgispeech", "hf-audio/esb-datasets-test-only-sorted", "spgispeech"),
+    _DatasetInfo("gigaspeech", "hf-audio/esb-datasets-test-only-sorted", "gigaspeech"),
+    _DatasetInfo("tedlium-short", "hf-audio/esb-datasets-test-only-sorted", "tedlium"),
+    _DatasetInfo(
+        "earnings22-short", "hf-audio/esb-datasets-test-only-sorted", "earnings22"
+    ),
+]
+DATASET_MAP = {dataset.alias: dataset for dataset in _DATASETS}
+def get_dataset(args) -> Dataset:
+    if args.dataset not in DATASET_MAP:
+        raise RuntimeError(f"Unknown dataset: {args.dataset}")
+    info = DATASET_MAP[args.dataset]
+    dataset = load_dataset(
+        info.name,
+        info.config,
+        split=info.split,
+        cache_dir=args.hf_cache_dir,
+        streaming=False,
+        token=True,
+    )
+    dataset = dataset.map(normalize)
+    dataset = dataset.filter(is_target_text_in_range, input_columns=["norm_text"])
+    return dataset
+@torch.no_grad
+def get_padded_batch(
+    audios: list[tuple[torch.Tensor, int]],
+    before_padding: float,
+    after_padding: float,
+    audio_encoder,
+):
+    sample_rate = audio_encoder.sample_rate
+    max_len = 0
+    batch = []
+    durations = []
+    for audio, sr in audios:
+        durations.append(audio.shape[-1] / sr)
+        audio = julius.resample_frac(audio, int(sr), int(sample_rate))
+        audio = torch.nn.functional.pad(
+            audio, (int(before_padding * sample_rate), int(after_padding * sample_rate))
+        )
+        max_len = max(max_len, audio.shape[-1])
+        batch.append(audio)
+    target = max_len
+    if target % audio_encoder.frame_size != 0:
+        target = target + (
+            audio_encoder.frame_size - max_len % audio_encoder.frame_size
+        )
+    padded_batch = torch.stack(
+        [
+            torch.nn.functional.pad(audio, (0, target - audio.shape[-1]))
+            for audio in batch
+        ]
+    )
+    return padded_batch
+@torch.no_grad
+def streaming_transcribe(
+    padded_batch: torch.Tensor,
+    mimi,
+    lm_gen,
+):
+    bsz = padded_batch.shape[0]
+    text_tokens_acc = []
+    with mimi.streaming(bsz), lm_gen.streaming(bsz):
+        for offset in range(0, padded_batch.shape[-1], mimi.frame_size):
+            audio_chunk = padded_batch[:, offset : offset + mimi.frame_size]
+            audio_chunk = audio_chunk[:, None, :]
+            audio_tokens = mimi.encode(audio_chunk)
+            text_tokens = lm_gen.step(audio_tokens)
+            if text_tokens is not None:
+                text_tokens_acc.append(text_tokens)
+    return torch.concat(text_tokens_acc, axis=-1)
+def run_inference(
+    dataset,
+    mimi,
+    lm_gen,
+    tokenizer,
+    padding_token_id,
+    before_padding_sec,
+    after_padding_sec,
+):
+    metrics = AsrMetrics()
+    audio_time = 0.0
+    inference_timer = Timer()
+    for batch in tqdm.tqdm(dataset.iter(args.batch_size)):
+        audio_data = list(
+            zip(
+                [torch.tensor(x["array"]).float() for x in batch["audio"]],
+                [x["sampling_rate"] for x in batch["audio"]],
+            )
+        )
+        audio_time += sum(audio.shape[-1] / sr for (audio, sr) in audio_data)
+        gt_transcripts = batch["original_text"]
+        padded_batch = get_padded_batch(
+            audio_data,
+            before_padding=before_padding_sec,
+            after_padding=after_padding_sec,
+            audio_encoder=mimi,
+        )
+        padded_batch = padded_batch.cuda()
+        with inference_timer:
+            text_tokens = streaming_transcribe(
+                padded_batch,
+                mimi=mimi,
+                lm_gen=lm_gen,
+            )
+        for batch_index in range(text_tokens.shape[0]):
+            utterance_tokens = text_tokens[batch_index, ...]
+            utterance_tokens = utterance_tokens[utterance_tokens > padding_token_id]
+            text = tokenizer.decode(utterance_tokens.cpu().numpy().tolist())
+            metrics.update(hyp=text, ref=gt_transcripts[batch_index])
+    return metrics, inference_timer.total, audio_time
+def main(args):
+    torch.set_float32_matmul_precision("high")
+    info = moshi.models.loaders.CheckpointInfo.from_hf_repo(
+        args.hf_repo,
+        moshi_weights=args.moshi_weight,
+        mimi_weights=args.mimi_weight,
+        tokenizer=args.tokenizer,
+        config_path=args.config_path,
+    )
+    mimi = info.get_mimi(device=args.device)
+    tokenizer = info.get_text_tokenizer()
+    lm = info.get_moshi(
+        device=args.device,
+        dtype=torch.bfloat16,
+    )
+    lm_gen = moshi.models.LMGen(lm, temp=0, temp_text=0.0)
+    dataset = get_dataset(args)
+    padding_token_id = info.raw_config.get("text_padding_token_id", 3)
+    # Putting in some conservative defaults
+    audio_silence_prefix_seconds = info.stt_config.get(
+        "audio_silence_prefix_seconds", 1.0
+    )
+    audio_delay_seconds = info.stt_config.get("audio_delay_seconds", 5.0)
+    wer_metric, inference_time, audio_time = run_inference(
+        dataset,
+        mimi,
+        lm_gen,
+        tokenizer,
+        padding_token_id,
+        audio_silence_prefix_seconds,
+        audio_delay_seconds + 0.5,
+    )
+    print(wer_metric, f"RTF = {audio_time / inference_time:.2f}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Example streaming STT inference.")
+    parser.add_argument(
+        "--dataset",
+        required=True,
+        choices=DATASET_MAP.keys(),
+        help="Dataset to run inference on.",
+    )
+    parser.add_argument(
+        "--hf-repo", type=str, help="HF repo to load the STT model from."
+    )
+    parser.add_argument("--tokenizer", type=str, help="Path to a local tokenizer file.")
+    parser.add_argument(
+        "--moshi-weight", type=str, help="Path to a local checkpoint file."
+    )
+    parser.add_argument(
+        "--mimi-weight", type=str, help="Path to a local checkpoint file for Mimi."
+    )
+    parser.add_argument(
+        "--config-path", type=str, help="Path to a local config file.", default=None
+    )
+    parser.add_argument(
+        "--batch-size",
+        type=int,
+        help="Batch size.",
+        default=32,
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cuda",
+        help="Device on which to run, defaults to 'cuda'.",
+    )
+    parser.add_argument("--hf-cache-dir", type=str, help="HuggingFace cache folder.")
+    args = parser.parse_args()
+    main(args)

scripts/stt_from_file_mlx.py ADDED Viewed

	@@ -0,0 +1,100 @@

+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "huggingface_hub",
+#     "moshi_mlx==0.2.12",
+#     "numpy",
+#     "sentencepiece",
+#     "sounddevice",
+#     "sphn",
+# ]
+# ///
+import argparse
+import json
+import mlx.core as mx
+import mlx.nn as nn
+import sentencepiece
+import sphn
+from huggingface_hub import hf_hub_download
+from moshi_mlx import models, utils
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("in_file", help="The file to transcribe.")
+    parser.add_argument("--max-steps", default=4096)
+    parser.add_argument("--hf-repo")
+    parser.add_argument(
+        "--vad", action="store_true", help="Enable VAD (Voice Activity Detection)."
+    )
+    args = parser.parse_args()
+    audio, _ = sphn.read(args.in_file, sample_rate=24000)
+    if args.hf_repo is None:
+        if args.vad:
+            args.hf_repo = "kyutai/stt-1b-en_fr-candle"
+        else:
+            args.hf_repo = "kyutai/stt-1b-en_fr-mlx"
+    lm_config = hf_hub_download(args.hf_repo, "config.json")
+    with open(lm_config, "r") as fobj:
+        lm_config = json.load(fobj)
+    mimi_weights = hf_hub_download(args.hf_repo, lm_config["mimi_name"])
+    moshi_name = lm_config.get("moshi_name", "model.safetensors")
+    moshi_weights = hf_hub_download(args.hf_repo, moshi_name)
+    text_tokenizer = hf_hub_download(args.hf_repo, lm_config["tokenizer_name"])
+    lm_config = models.LmConfig.from_config_dict(lm_config)
+    model = models.Lm(lm_config)
+    model.set_dtype(mx.bfloat16)
+    if moshi_weights.endswith(".q4.safetensors"):
+        nn.quantize(model, bits=4, group_size=32)
+    elif moshi_weights.endswith(".q8.safetensors"):
+        nn.quantize(model, bits=8, group_size=64)
+    print(f"loading model weights from {moshi_weights}")
+    if args.hf_repo.endswith("-candle"):
+        model.load_pytorch_weights(moshi_weights, lm_config, strict=True)
+    else:
+        model.load_weights(moshi_weights, strict=True)
+    print(f"loading the text tokenizer from {text_tokenizer}")
+    text_tokenizer = sentencepiece.SentencePieceProcessor(text_tokenizer)  # type: ignore
+    print(f"loading the audio tokenizer {mimi_weights}")
+    audio_tokenizer = models.mimi.Mimi(models.mimi_202407(32))
+    audio_tokenizer.load_pytorch_weights(str(mimi_weights), strict=True)
+    print("warming up the model")
+    model.warmup()
+    gen = models.LmGen(
+        model=model,
+        max_steps=args.max_steps,
+        text_sampler=utils.Sampler(top_k=25, temp=0),
+        audio_sampler=utils.Sampler(top_k=250, temp=0.8),
+        check=False,
+    )
+    print(f"starting inference {audio.shape}")
+    audio = mx.concat([mx.array(audio), mx.zeros((1, 48000))], axis=-1)
+    last_print_was_vad = False
+    for start_idx in range(0, audio.shape[-1] // 1920 * 1920, 1920):
+        block = audio[:, None, start_idx : start_idx + 1920]
+        other_audio_tokens = audio_tokenizer.encode_step(block).transpose(0, 2, 1)
+        if args.vad:
+            text_token, vad_heads = gen.step_with_extra_heads(other_audio_tokens[0])
+            if vad_heads:
+                pr_vad = vad_heads[2][0, 0, 0].item()
+                if pr_vad > 0.5 and not last_print_was_vad:
+                    print(" [end of turn detected]")
+                    last_print_was_vad = True
+        else:
+            text_token = gen.step(other_audio_tokens[0])
+        text_token = text_token[0].item()
+        audio_tokens = gen.last_audio_tokens()
+        _text = None
+        if text_token not in (0, 3):
+            _text = text_tokenizer.id_to_piece(text_token)  # type: ignore
+            _text = _text.replace("▁", " ")
+            print(_text, end="", flush=True)
+            last_print_was_vad = False
+    print()

scripts/stt_from_file_pytorch.py ADDED Viewed

	@@ -0,0 +1,247 @@

+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "julius",
+#     "librosa",
+#     "soundfile",
+#     "moshi==0.2.11",
+# ]
+# ///
+"""An example script that illustrates how one can get per-word timestamps from
+Kyutai STT models.
+"""
+import argparse
+import dataclasses
+import itertools
+import math
+import julius
+import moshi.models
+import sphn
+import time
+import torch
+@dataclasses.dataclass
+class TimestampedText:
+    text: str
+    timestamp: tuple[float, float]
+    def __str__(self):
+        return f"{self.text} ({self.timestamp[0]:.2f}:{self.timestamp[1]:.2f})"
+def tokens_to_timestamped_text(
+    text_tokens,
+    tokenizer,
+    frame_rate,
+    end_of_padding_id,
+    padding_token_id,
+    offset_seconds,
+) -> list[TimestampedText]:
+    text_tokens = text_tokens.cpu().view(-1)
+    # Normally `end_of_padding` tokens indicate word boundaries.
+    # Everything between them should be a single word;
+    # the time offset of the those tokens correspond to word start and
+    # end timestamps (minus silence prefix and audio delay).
+    #
+    # However, in rare cases some complexities could arise. Firstly,
+    # for words that are said quickly but are represented with
+    # multiple tokens, the boundary might be omitted. Secondly,
+    # for the very last word the end boundary might not happen.
+    # Below is a code snippet that handles those situations a bit
+    # more carefully.
+    sequence_timestamps = []
+    def _tstmp(start_position, end_position):
+        return (
+            max(0, start_position / frame_rate - offset_seconds),
+            max(0, end_position / frame_rate - offset_seconds),
+        )
+    def _decode(t):
+        t = t[t > padding_token_id]
+        return tokenizer.decode(t.numpy().tolist())
+    def _decode_segment(start, end):
+        nonlocal text_tokens
+        nonlocal sequence_timestamps
+        text = _decode(text_tokens[start:end])
+        words_inside_segment = text.split()
+        if len(words_inside_segment) == 0:
+            return
+        if len(words_inside_segment) == 1:
+            # Single word within the boundaries, the general case
+            sequence_timestamps.append(
+                TimestampedText(text=text, timestamp=_tstmp(start, end))
+            )
+        else:
+            # We're in a rare situation where multiple words are so close they are not separated by `end_of_padding`.
+            # We tokenize words one-by-one; each word is assigned with as many frames as much tokens it has.
+            for adjacent_word in words_inside_segment[:-1]:
+                n_tokens = len(tokenizer.encode(adjacent_word))
+                sequence_timestamps.append(
+                    TimestampedText(
+                        text=adjacent_word, timestamp=_tstmp(start, start + n_tokens)
+                    )
+                )
+                start += n_tokens
+            # The last word takes everything until the boundary
+            adjacent_word = words_inside_segment[-1]
+            sequence_timestamps.append(
+                TimestampedText(text=adjacent_word, timestamp=_tstmp(start, end))
+            )
+    (segment_boundaries,) = torch.where(text_tokens == end_of_padding_id)
+    if not segment_boundaries.numel():
+        return []
+    for i in range(len(segment_boundaries) - 1):
+        segment_start = int(segment_boundaries[i]) + 1
+        segment_end = int(segment_boundaries[i + 1])
+        _decode_segment(segment_start, segment_end)
+    last_segment_start = segment_boundaries[-1] + 1
+    boundary_token = torch.tensor([tokenizer.eos_id()])
+    (end_of_last_segment,) = torch.where(
+        torch.isin(text_tokens[last_segment_start:], boundary_token)
+    )
+    if not end_of_last_segment.numel():
+        # upper-bound either end of the audio or 1 second duration, whicher is smaller
+        last_segment_end = min(text_tokens.shape[-1], last_segment_start + frame_rate)
+    else:
+        last_segment_end = last_segment_start + end_of_last_segment[0]
+    _decode_segment(last_segment_start, last_segment_end)
+    return sequence_timestamps
+def main(args):
+    if args.vad and args.hf_repo is None:
+        args.hf_repo = "kyutai/stt-1b-en_fr-candle"
+    info = moshi.models.loaders.CheckpointInfo.from_hf_repo(
+        args.hf_repo,
+        moshi_weights=args.moshi_weight,
+        mimi_weights=args.mimi_weight,
+        tokenizer=args.tokenizer,
+        config_path=args.config_path,
+    )
+    mimi = info.get_mimi(device=args.device)
+    tokenizer = info.get_text_tokenizer()
+    lm = info.get_moshi(
+        device=args.device,
+        dtype=torch.bfloat16,
+    )
+    lm_gen = moshi.models.LMGen(lm, temp=0, temp_text=0.0)
+    audio_silence_prefix_seconds = info.stt_config.get(
+        "audio_silence_prefix_seconds", 1.0
+    )
+    audio_delay_seconds = info.stt_config.get("audio_delay_seconds", 5.0)
+    padding_token_id = info.raw_config.get("text_padding_token_id", 3)
+    audio, input_sample_rate = sphn.read(args.in_file)
+    audio = torch.from_numpy(audio).to(args.device)
+    audio = julius.resample_frac(audio, input_sample_rate, mimi.sample_rate)
+    if audio.shape[-1] % mimi.frame_size != 0:
+        to_pad = mimi.frame_size - audio.shape[-1] % mimi.frame_size
+        audio = torch.nn.functional.pad(audio, (0, to_pad))
+    text_tokens_accum = []
+    n_prefix_chunks = math.ceil(audio_silence_prefix_seconds * mimi.frame_rate)
+    n_suffix_chunks = math.ceil(audio_delay_seconds * mimi.frame_rate)
+    silence_chunk = torch.zeros(
+        (1, 1, mimi.frame_size), dtype=torch.float32, device=args.device
+    )
+    chunks = itertools.chain(
+        itertools.repeat(silence_chunk, n_prefix_chunks),
+        torch.split(audio[:, None], mimi.frame_size, dim=-1),
+        itertools.repeat(silence_chunk, n_suffix_chunks),
+    )
+    start_time = time.time()
+    nchunks = 0
+    last_print_was_vad = False
+    with mimi.streaming(1), lm_gen.streaming(1):
+        for audio_chunk in chunks:
+            nchunks += 1
+            audio_tokens = mimi.encode(audio_chunk)
+            if args.vad:
+                text_tokens, vad_heads = lm_gen.step_with_extra_heads(audio_tokens)
+                if vad_heads:
+                    pr_vad = vad_heads[2][0, 0, 0].cpu().item()
+                    if pr_vad > 0.5 and not last_print_was_vad:
+                        print(" [end of turn detected]")
+                        last_print_was_vad = True
+            else:
+                text_tokens = lm_gen.step(audio_tokens)
+            text_token = text_tokens[0, 0, 0].cpu().item()
+            if text_token not in (0, 3):
+                _text = tokenizer.id_to_piece(text_tokens[0, 0, 0].cpu().item())  # type: ignore
+                _text = _text.replace("▁", " ")
+                print(_text, end="", flush=True)
+                last_print_was_vad = False
+            text_tokens_accum.append(text_tokens)
+    utterance_tokens = torch.concat(text_tokens_accum, dim=-1)
+    dt = time.time() - start_time
+    print(
+        f"\nprocessed {nchunks} chunks in {dt:.2f} seconds, steps per second: {nchunks / dt:.2f}"
+    )
+    timed_text = tokens_to_timestamped_text(
+        utterance_tokens,
+        tokenizer,
+        mimi.frame_rate,
+        end_of_padding_id=0,
+        padding_token_id=padding_token_id,
+        offset_seconds=int(n_prefix_chunks / mimi.frame_rate) + audio_delay_seconds,
+    )
+    decoded = " ".join([str(t) for t in timed_text])
+    print(decoded)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Example streaming STT w/ timestamps.")
+    parser.add_argument("in_file", help="The file to transcribe.")
+    parser.add_argument(
+        "--hf-repo", type=str, help="HF repo to load the STT model from. "
+    )
+    parser.add_argument("--tokenizer", type=str, help="Path to a local tokenizer file.")
+    parser.add_argument(
+        "--moshi-weight", type=str, help="Path to a local checkpoint file."
+    )
+    parser.add_argument(
+        "--mimi-weight", type=str, help="Path to a local checkpoint file for Mimi."
+    )
+    parser.add_argument(
+        "--config-path", type=str, help="Path to a local config file.", default=None
+    )
+    parser.add_argument(
+        "--vad", action="store_true", help="Enable VAD (Voice Activity Detection)."
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cuda",
+        help="Device on which to run, defaults to 'cuda'.",
+    )
+    args = parser.parse_args()
+    main(args)

scripts/stt_from_file_rust_server.py ADDED Viewed

	@@ -0,0 +1,135 @@

+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "msgpack",
+#     "numpy",
+#     "sphn",
+#     "websockets",
+# ]
+# ///
+import argparse
+import asyncio
+import time
+import msgpack
+import numpy as np
+import sphn
+import websockets
+SAMPLE_RATE = 24000
+FRAME_SIZE = 1920  # Send data in chunks
+def load_and_process_audio(file_path):
+    """Load an MP3 file, resample to 24kHz, convert to mono, and extract PCM float32 data."""
+    pcm_data, _ = sphn.read(file_path, sample_rate=SAMPLE_RATE)
+    return pcm_data[0]
+async def receive_messages(websocket):
+    transcript = []
+    async for message in websocket:
+        data = msgpack.unpackb(message, raw=False)
+        if data["type"] == "Step":
+            # This message contains the signal from the semantic VAD, and tells us how
+            # much audio the server has already processed. We don't use either here.
+            continue
+        if data["type"] == "Word":
+            print(data["text"], end=" ", flush=True)
+            transcript.append(
+                {
+                    "text": data["text"],
+                    "timestamp": [data["start_time"], data["start_time"]],
+                }
+            )
+        if data["type"] == "EndWord":
+            if len(transcript) > 0:
+                transcript[-1]["timestamp"][1] = data["stop_time"]
+        if data["type"] == "Marker":
+            # Received marker, stopping stream
+            break
+    return transcript
+async def send_messages(websocket, rtf: float):
+    audio_data = load_and_process_audio(args.in_file)
+    async def send_audio(audio: np.ndarray):
+        await websocket.send(
+            msgpack.packb(
+                {"type": "Audio", "pcm": [float(x) for x in audio]},
+                use_single_float=True,
+            )
+        )
+    # Start with a second of silence.
+    # This is needed for the 2.6B model for technical reasons.
+    await send_audio([0.0] * SAMPLE_RATE)
+    start_time = time.time()
+    for i in range(0, len(audio_data), FRAME_SIZE):
+        await send_audio(audio_data[i : i + FRAME_SIZE])
+        expected_send_time = start_time + (i + 1) / SAMPLE_RATE / rtf
+        current_time = time.time()
+        if current_time < expected_send_time:
+            await asyncio.sleep(expected_send_time - current_time)
+        else:
+            await asyncio.sleep(0.001)
+    for _ in range(5):
+        await send_audio([0.0] * SAMPLE_RATE)
+    # Send a marker to indicate the end of the stream.
+    await websocket.send(
+        msgpack.packb({"type": "Marker", "id": 0}, use_single_float=True)
+    )
+    # We'll get back the marker once the corresponding audio has been transcribed,
+    # accounting for the delay of the model. That's why we need to send some silence
+    # after the marker, because the model will not return the marker immediately.
+    for _ in range(35):
+        await send_audio([0.0] * SAMPLE_RATE)
+async def stream_audio(url: str, api_key: str, rtf: float):
+    """Stream audio data to a WebSocket server."""
+    headers = {"kyutai-api-key": api_key}
+    # Instead of using the header, you can authenticate by adding `?auth_id={api_key}` to the URL
+    async with websockets.connect(url, additional_headers=headers) as websocket:
+        send_task = asyncio.create_task(send_messages(websocket, rtf))
+        receive_task = asyncio.create_task(receive_messages(websocket))
+        _, transcript = await asyncio.gather(send_task, receive_task)
+    return transcript
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("in_file")
+    parser.add_argument(
+        "--url",
+        help="The url of the server to which to send the audio",
+        default="ws://127.0.0.1:8080",
+    )
+    parser.add_argument("--api-key", default="public_token")
+    parser.add_argument(
+        "--rtf",
+        type=float,
+        default=1.01,
+        help="The real-time factor of how fast to feed in the audio.",
+    )
+    args = parser.parse_args()
+    url = f"{args.url}/api/asr-streaming"
+    transcript = asyncio.run(stream_audio(url, args.api_key, args.rtf))
+    print()
+    print()
+    for word in transcript:
+        print(
+            f"{word['timestamp'][0]:7.2f} -{word['timestamp'][1]:7.2f}  {word['text']}"
+        )

scripts/stt_from_file_with_prompt_pytorch.py ADDED Viewed

	@@ -0,0 +1,187 @@

+"""An example script that illustrates how one can prompt Kyutai STT models."""
+import argparse
+import itertools
+import math
+from collections import deque
+import julius
+import moshi.models
+import sphn
+import torch
+import tqdm
+class PromptHook:
+    def __init__(self, tokenizer, prefix, padding_tokens=(0, 3)):
+        self.tokenizer = tokenizer
+        self.prefix_enforce = deque(self.tokenizer.encode(prefix))
+        self.padding_tokens = padding_tokens
+    def on_token(self, token):
+        if not self.prefix_enforce:
+            return
+        token = token.item()
+        if token in self.padding_tokens:
+            pass
+        elif token == self.prefix_enforce[0]:
+            self.prefix_enforce.popleft()
+        else:
+            assert False
+    def on_logits(self, logits):
+        if not self.prefix_enforce:
+            return
+        mask = torch.zeros_like(logits, dtype=torch.bool)
+        for t in self.padding_tokens:
+            mask[..., t] = True
+        mask[..., self.prefix_enforce[0]] = True
+        logits[:] = torch.where(mask, logits, float("-inf"))
+def main(args):
+    info = moshi.models.loaders.CheckpointInfo.from_hf_repo(
+        args.hf_repo,
+        moshi_weights=args.moshi_weight,
+        mimi_weights=args.mimi_weight,
+        tokenizer=args.tokenizer,
+        config_path=args.config_path,
+    )
+    mimi = info.get_mimi(device=args.device)
+    tokenizer = info.get_text_tokenizer()
+    lm = info.get_moshi(
+        device=args.device,
+        dtype=torch.bfloat16,
+    )
+    if args.prompt_text:
+        prompt_hook = PromptHook(tokenizer, args.prompt_text)
+        lm_gen = moshi.models.LMGen(
+            lm,
+            temp=0,
+            temp_text=0.0,
+            on_text_hook=prompt_hook.on_token,
+            on_text_logits_hook=prompt_hook.on_logits,
+        )
+    else:
+        lm_gen = moshi.models.LMGen(lm, temp=0, temp_text=0.0)
+    audio_silence_prefix_seconds = info.stt_config.get(
+        "audio_silence_prefix_seconds", 1.0
+    )
+    audio_delay_seconds = info.stt_config.get("audio_delay_seconds", 5.0)
+    padding_token_id = info.raw_config.get("text_padding_token_id", 3)
+    def _load_and_process(path):
+        audio, input_sample_rate = sphn.read(path)
+        audio = torch.from_numpy(audio).to(args.device).mean(axis=0, keepdim=True)
+        audio = julius.resample_frac(audio, input_sample_rate, mimi.sample_rate)
+        if audio.shape[-1] % mimi.frame_size != 0:
+            to_pad = mimi.frame_size - audio.shape[-1] % mimi.frame_size
+            audio = torch.nn.functional.pad(audio, (0, to_pad))
+        return audio
+    n_prefix_chunks = math.ceil(audio_silence_prefix_seconds * mimi.frame_rate)
+    n_suffix_chunks = math.ceil(audio_delay_seconds * mimi.frame_rate)
+    silence_chunk = torch.zeros(
+        (1, 1, mimi.frame_size), dtype=torch.float32, device=args.device
+    )
+    audio = _load_and_process(args.file)
+    if args.prompt_file:
+        audio_prompt = _load_and_process(args.prompt_file)
+    else:
+        audio_prompt = None
+    chain = [itertools.repeat(silence_chunk, n_prefix_chunks)]
+    if audio_prompt is not None:
+        chain.append(torch.split(audio_prompt[:, None, :], mimi.frame_size, dim=-1))
+        # adding a bit (0.8s) of silence to separate prompt and the actual audio
+        chain.append(itertools.repeat(silence_chunk, 10))
+    chain += [
+        torch.split(audio[:, None, :], mimi.frame_size, dim=-1),
+        itertools.repeat(silence_chunk, n_suffix_chunks),
+    ]
+    chunks = itertools.chain(*chain)
+    text_tokens_accum = []
+    with mimi.streaming(1), lm_gen.streaming(1):
+        for audio_chunk in tqdm.tqdm(chunks):
+            audio_tokens = mimi.encode(audio_chunk)
+            text_tokens = lm_gen.step(audio_tokens)
+            if text_tokens is not None:
+                text_tokens_accum.append(text_tokens)
+    utterance_tokens = torch.concat(text_tokens_accum, dim=-1)
+    text_tokens = utterance_tokens.cpu().view(-1)
+    # if we have an audio prompt and we don't want to have it in the transcript,
+    # we should cut the corresponding number of frames from the output tokens.
+    # However, there is also some amount of padding that happens before it
+    # due to silence_prefix and audio_delay. Normally it is ignored in detokenization,
+    # but now we should account for it to find the position of the prompt transcript.
+    if args.cut_prompt_transcript and audio_prompt is not None:
+        prompt_frames = audio_prompt.shape[1] // mimi.frame_size
+        no_prompt_offset_seconds = audio_delay_seconds + audio_silence_prefix_seconds
+        no_prompt_offset = int(no_prompt_offset_seconds * mimi.frame_rate)
+        text_tokens = text_tokens[prompt_frames + no_prompt_offset :]
+    text = tokenizer.decode(
+        text_tokens[text_tokens > padding_token_id].numpy().tolist()
+    )
+    print(text)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Example streaming STT w/ a prompt.")
+    parser.add_argument(
+        "--file",
+        required=True,
+        help="File to transcribe.",
+    )
+    parser.add_argument(
+        "--prompt_file",
+        required=False,
+        help="Audio of the prompt.",
+    )
+    parser.add_argument(
+        "--prompt_text",
+        required=False,
+        help="Text of the prompt.",
+    )
+    parser.add_argument(
+        "--cut-prompt-transcript",
+        action="store_true",
+        help="Cut the prompt from the output transcript",
+    )
+    parser.add_argument(
+        "--hf-repo", type=str, help="HF repo to load the STT model from. "
+    )
+    parser.add_argument("--tokenizer", type=str, help="Path to a local tokenizer file.")
+    parser.add_argument(
+        "--moshi-weight", type=str, help="Path to a local checkpoint file."
+    )
+    parser.add_argument(
+        "--mimi-weight", type=str, help="Path to a local checkpoint file for Mimi."
+    )
+    parser.add_argument(
+        "--config-path", type=str, help="Path to a local config file.", default=None
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cuda",
+        help="Device on which to run, defaults to 'cuda'.",
+    )
+    args = parser.parse_args()
+    main(args)

scripts/stt_from_mic_mlx.py ADDED Viewed

	@@ -0,0 +1,116 @@

+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "huggingface_hub",
+#     "moshi_mlx==0.2.12",
+#     "numpy",
+#     "rustymimi",
+#     "sentencepiece",
+#     "sounddevice",
+# ]
+# ///
+import argparse
+import json
+import queue
+import mlx.core as mx
+import mlx.nn as nn
+import rustymimi
+import sentencepiece
+import sounddevice as sd
+from huggingface_hub import hf_hub_download
+from moshi_mlx import models, utils
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--max-steps", default=4096)
+    parser.add_argument("--hf-repo")
+    parser.add_argument(
+        "--vad", action="store_true", help="Enable VAD (Voice Activity Detection)."
+    )
+    args = parser.parse_args()
+    if args.hf_repo is None:
+        if args.vad:
+            args.hf_repo = "kyutai/stt-1b-en_fr-candle"
+        else:
+            args.hf_repo = "kyutai/stt-1b-en_fr-mlx"
+    lm_config = hf_hub_download(args.hf_repo, "config.json")
+    with open(lm_config, "r") as fobj:
+        lm_config = json.load(fobj)
+    mimi_weights = hf_hub_download(args.hf_repo, lm_config["mimi_name"])
+    moshi_name = lm_config.get("moshi_name", "model.safetensors")
+    moshi_weights = hf_hub_download(args.hf_repo, moshi_name)
+    tokenizer = hf_hub_download(args.hf_repo, lm_config["tokenizer_name"])
+    lm_config = models.LmConfig.from_config_dict(lm_config)
+    model = models.Lm(lm_config)
+    model.set_dtype(mx.bfloat16)
+    if moshi_weights.endswith(".q4.safetensors"):
+        nn.quantize(model, bits=4, group_size=32)
+    elif moshi_weights.endswith(".q8.safetensors"):
+        nn.quantize(model, bits=8, group_size=64)
+    print(f"loading model weights from {moshi_weights}")
+    if args.hf_repo.endswith("-candle"):
+        model.load_pytorch_weights(moshi_weights, lm_config, strict=True)
+    else:
+        model.load_weights(moshi_weights, strict=True)
+    print(f"loading the text tokenizer from {tokenizer}")
+    text_tokenizer = sentencepiece.SentencePieceProcessor(tokenizer)  # type: ignore
+    print(f"loading the audio tokenizer {mimi_weights}")
+    generated_codebooks = lm_config.generated_codebooks
+    other_codebooks = lm_config.other_codebooks
+    mimi_codebooks = max(generated_codebooks, other_codebooks)
+    audio_tokenizer = rustymimi.Tokenizer(mimi_weights, num_codebooks=mimi_codebooks)  # type: ignore
+    print("warming up the model")
+    model.warmup()
+    gen = models.LmGen(
+        model=model,
+        max_steps=args.max_steps,
+        text_sampler=utils.Sampler(top_k=25, temp=0),
+        audio_sampler=utils.Sampler(top_k=250, temp=0.8),
+        check=False,
+    )
+    block_queue = queue.Queue()
+    def audio_callback(indata, _frames, _time, _status):
+        block_queue.put(indata.copy())
+    print("recording audio from microphone, speak to get your words transcribed")
+    last_print_was_vad = False
+    with sd.InputStream(
+        channels=1,
+        dtype="float32",
+        samplerate=24000,
+        blocksize=1920,
+        callback=audio_callback,
+    ):
+        while True:
+            block = block_queue.get()
+            block = block[None, :, 0]
+            other_audio_tokens = audio_tokenizer.encode_step(block[None, 0:1])
+            other_audio_tokens = mx.array(other_audio_tokens).transpose(0, 2, 1)[
+                :, :, :other_codebooks
+            ]
+            if args.vad:
+                text_token, vad_heads = gen.step_with_extra_heads(other_audio_tokens[0])
+                if vad_heads:
+                    pr_vad = vad_heads[2][0, 0, 0].item()
+                    if pr_vad > 0.5 and not last_print_was_vad:
+                        print(" [end of turn detected]")
+                        last_print_was_vad = True
+            else:
+                text_token = gen.step(other_audio_tokens[0])
+            text_token = text_token[0].item()
+            audio_tokens = gen.last_audio_tokens()
+            _text = None
+            if text_token not in (0, 3):
+                _text = text_tokenizer.id_to_piece(text_token)  # type: ignore
+                _text = _text.replace("▁", " ")
+                print(_text, end="", flush=True)
+                last_print_was_vad = False

scripts/stt_from_mic_rust_server.py ADDED Viewed

	@@ -0,0 +1,135 @@

+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "msgpack",
+#     "numpy",
+#     "sounddevice",
+#     "websockets",
+# ]
+# ///
+import argparse
+import asyncio
+import signal
+import msgpack
+import numpy as np
+import sounddevice as sd
+import websockets
+SAMPLE_RATE = 24000
+# The VAD has several prediction heads, each of which tries to determine whether there
+# has been a pause of a given length. The lengths are 0.5, 1.0, 2.0, and 3.0 seconds.
+# Lower indices predict pauses more aggressively. In Unmute, we use 2.0 seconds = index 2.
+PAUSE_PREDICTION_HEAD_INDEX = 2
+async def receive_messages(websocket, show_vad: bool = False):
+    """Receive and process messages from the WebSocket server."""
+    try:
+        speech_started = False
+        async for message in websocket:
+            data = msgpack.unpackb(message, raw=False)
+            # The Step message only gets sent if the model has semantic VAD available
+            if data["type"] == "Step" and show_vad:
+                pause_prediction = data["prs"][PAUSE_PREDICTION_HEAD_INDEX]
+                if pause_prediction > 0.5 and speech_started:
+                    print("| ", end="", flush=True)
+                    speech_started = False
+            elif data["type"] == "Word":
+                print(data["text"], end=" ", flush=True)
+                speech_started = True
+    except websockets.ConnectionClosed:
+        print("Connection closed while receiving messages.")
+async def send_messages(websocket, audio_queue):
+    """Send audio data from microphone to WebSocket server."""
+    try:
+        # Start by draining the queue to avoid lags
+        while not audio_queue.empty():
+            await audio_queue.get()
+        print("Starting the transcription")
+        while True:
+            audio_data = await audio_queue.get()
+            chunk = {"type": "Audio", "pcm": [float(x) for x in audio_data]}
+            msg = msgpack.packb(chunk, use_bin_type=True, use_single_float=True)
+            await websocket.send(msg)
+    except websockets.ConnectionClosed:
+        print("Connection closed while sending messages.")
+async def stream_audio(url: str, api_key: str, show_vad: bool):
+    """Stream audio data to a WebSocket server."""
+    print("Starting microphone recording...")
+    print("Press Ctrl+C to stop recording")
+    audio_queue = asyncio.Queue()
+    loop = asyncio.get_event_loop()
+    def audio_callback(indata, frames, time, status):
+        loop.call_soon_threadsafe(
+            audio_queue.put_nowait, indata[:, 0].astype(np.float32).copy()
+        )
+    # Start audio stream
+    with sd.InputStream(
+        samplerate=SAMPLE_RATE,
+        channels=1,
+        dtype="float32",
+        callback=audio_callback,
+        blocksize=1920,  # 80ms blocks
+    ):
+        headers = {"kyutai-api-key": api_key}
+        # Instead of using the header, you can authenticate by adding `?auth_id={api_key}` to the URL
+        async with websockets.connect(url, additional_headers=headers) as websocket:
+            send_task = asyncio.create_task(send_messages(websocket, audio_queue))
+            receive_task = asyncio.create_task(
+                receive_messages(websocket, show_vad=show_vad)
+            )
+            await asyncio.gather(send_task, receive_task)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Real-time microphone transcription")
+    parser.add_argument(
+        "--url",
+        help="The URL of the server to which to send the audio",
+        default="ws://127.0.0.1:8080",
+    )
+    parser.add_argument("--api-key", default="public_token")
+    parser.add_argument(
+        "--list-devices", action="store_true", help="List available audio devices"
+    )
+    parser.add_argument(
+        "--device", type=int, help="Input device ID (use --list-devices to see options)"
+    )
+    parser.add_argument(
+        "--show-vad",
+        action="store_true",
+        help="Visualize the predictions of the semantic voice activity detector with a '|' symbol",
+    )
+    args = parser.parse_args()
+    def handle_sigint(signum, frame):
+        print("Interrupted by user")  # Don't complain about KeyboardInterrupt
+        exit(0)
+    signal.signal(signal.SIGINT, handle_sigint)
+    if args.list_devices:
+        print("Available audio devices:")
+        print(sd.query_devices())
+        exit(0)
+    if args.device is not None:
+        sd.default.device[0] = args.device  # Set input device
+    url = f"{args.url}/api/asr-streaming"
+    asyncio.run(stream_audio(url, args.api_key, args.show_vad))

scripts/tts_mlx.py ADDED Viewed

	@@ -0,0 +1,210 @@

+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "huggingface_hub",
+#     "moshi_mlx==0.2.12",
+#     "numpy",
+#     "sounddevice",
+# ]
+# ///
+import argparse
+import json
+import queue
+import sys
+import time
+import mlx.core as mx
+import mlx.nn as nn
+import numpy as np
+import sentencepiece
+import sounddevice as sd
+import sphn
+from moshi_mlx import models
+from moshi_mlx.client_utils import make_log
+from moshi_mlx.models.tts import (
+    DEFAULT_DSM_TTS_REPO,
+    DEFAULT_DSM_TTS_VOICE_REPO,
+    TTSModel,
+)
+from moshi_mlx.utils.loaders import hf_get
+def log(level: str, msg: str):
+    print(make_log(level, msg))
+def main():
+    parser = argparse.ArgumentParser(
+        description="Run Kyutai TTS using the MLX implementation"
+    )
+    parser.add_argument("inp", type=str, help="Input file, use - for stdin")
+    parser.add_argument(
+        "out", type=str, help="Output file to generate, use - for playing the audio"
+    )
+    parser.add_argument(
+        "--hf-repo",
+        type=str,
+        default=DEFAULT_DSM_TTS_REPO,
+        help="HF repo in which to look for the pretrained models.",
+    )
+    parser.add_argument(
+        "--voice-repo",
+        default=DEFAULT_DSM_TTS_VOICE_REPO,
+        help="HF repo in which to look for pre-computed voice embeddings.",
+    )
+    parser.add_argument(
+        "--voice", default="expresso/ex03-ex01_happy_001_channel1_334s.wav"
+    )
+    parser.add_argument(
+        "--quantize",
+        type=int,
+        help="The quantization to be applied, e.g. 8 for 8 bits.",
+    )
+    args = parser.parse_args()
+    mx.random.seed(299792458)
+    log("info", "retrieving checkpoints")
+    raw_config = hf_get("config.json", args.hf_repo)
+    with open(hf_get(raw_config), "r") as fobj:
+        raw_config = json.load(fobj)
+    mimi_weights = hf_get(raw_config["mimi_name"], args.hf_repo)
+    moshi_name = raw_config.get("moshi_name", "model.safetensors")
+    moshi_weights = hf_get(moshi_name, args.hf_repo)
+    tokenizer = hf_get(raw_config["tokenizer_name"], args.hf_repo)
+    lm_config = models.LmConfig.from_config_dict(raw_config)
+    # There is a bug in moshi_mlx <= 0.3.0 handling of the ring kv cache.
+    # The following line gets around it for now.
+    lm_config.transformer.max_seq_len = lm_config.transformer.context
+    model = models.Lm(lm_config)
+    model.set_dtype(mx.bfloat16)
+    log("info", f"loading model weights from {moshi_weights}")
+    model.load_pytorch_weights(str(moshi_weights), lm_config, strict=True)
+    if args.quantize is not None:
+        log("info", f"quantizing model to {args.quantize} bits")
+        nn.quantize(model.depformer, bits=args.quantize)
+        for layer in model.transformer.layers:
+            nn.quantize(layer.self_attn, bits=args.quantize)
+            nn.quantize(layer.gating, bits=args.quantize)
+    log("info", f"loading the text tokenizer from {tokenizer}")
+    text_tokenizer = sentencepiece.SentencePieceProcessor(str(tokenizer))  # type: ignore
+    log("info", f"loading the audio tokenizer {mimi_weights}")
+    generated_codebooks = lm_config.generated_codebooks
+    audio_tokenizer = models.mimi.Mimi(models.mimi_202407(generated_codebooks))
+    audio_tokenizer.load_pytorch_weights(str(mimi_weights), strict=True)
+    cfg_coef_conditioning = None
+    tts_model = TTSModel(
+        model,
+        audio_tokenizer,
+        text_tokenizer,
+        voice_repo=args.voice_repo,
+        temp=0.6,
+        cfg_coef=1,
+        max_padding=8,
+        initial_padding=2,
+        final_padding=2,
+        padding_bonus=0,
+        raw_config=raw_config,
+    )
+    if tts_model.valid_cfg_conditionings:
+        # Model was trained with CFG distillation.
+        cfg_coef_conditioning = tts_model.cfg_coef
+        tts_model.cfg_coef = 1.0
+        cfg_is_no_text = False
+        cfg_is_no_prefix = False
+    else:
+        cfg_is_no_text = True
+        cfg_is_no_prefix = True
+    mimi = tts_model.mimi
+    log("info", f"reading input from {args.inp}")
+    if args.inp == "-":
+        if sys.stdin.isatty():  # Interactive
+            print("Enter text to synthesize (Ctrl+D to end input):")
+        text_to_tts = sys.stdin.read().strip()
+    else:
+        with open(args.inp, "r", encoding="utf-8") as fobj:
+            text_to_tts = fobj.read().strip()
+    all_entries = [tts_model.prepare_script([text_to_tts])]
+    if tts_model.multi_speaker:
+        voices = [tts_model.get_voice_path(args.voice)]
+    else:
+        voices = []
+    all_attributes = [
+        tts_model.make_condition_attributes(voices, cfg_coef_conditioning)
+    ]
+    wav_frames = queue.Queue()
+    _frames_cnt = 0
+    def _on_frame(frame):
+        nonlocal _frames_cnt
+        if (frame == -1).any():
+            return
+        _pcm = tts_model.mimi.decode_step(frame[:, :, None])
+        _pcm = np.array(mx.clip(_pcm[0, 0], -1, 1))
+        wav_frames.put_nowait(_pcm)
+        _frames_cnt += 1
+        print(f"generated {_frames_cnt / 12.5:.2f}s", end="\r", flush=True)
+    def run():
+        log("info", "starting the inference loop")
+        begin = time.time()
+        result = tts_model.generate(
+            all_entries,
+            all_attributes,
+            cfg_is_no_prefix=cfg_is_no_prefix,
+            cfg_is_no_text=cfg_is_no_text,
+            on_frame=_on_frame,
+        )
+        frames = mx.concat(result.frames, axis=-1)
+        total_duration = frames.shape[0] * frames.shape[-1] / mimi.frame_rate
+        time_taken = time.time() - begin
+        total_speed = total_duration / time_taken
+        log("info", f"[LM] took {time_taken:.2f}s, total speed {total_speed:.2f}x")
+        return result
+    if args.out == "-":
+        def audio_callback(outdata, _a, _b, _c):
+            try:
+                pcm_data = wav_frames.get(block=False)
+                outdata[:, 0] = pcm_data
+            except queue.Empty:
+                outdata[:] = 0
+        with sd.OutputStream(
+            samplerate=mimi.sample_rate,
+            blocksize=1920,
+            channels=1,
+            callback=audio_callback,
+        ):
+            run()
+            time.sleep(3)
+            while True:
+                if wav_frames.qsize() == 0:
+                    break
+                time.sleep(1)
+    else:
+        run()
+        frames = []
+        while True:
+            try:
+                frames.append(wav_frames.get_nowait())
+            except queue.Empty:
+                break
+        wav = np.concat(frames, -1)
+        sphn.write_wav(args.out, wav, mimi.sample_rate)
+if __name__ == "__main__":
+    main()

scripts/tts_mlx_streaming.py ADDED Viewed

	@@ -0,0 +1,317 @@

+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "huggingface_hub",
+#     "moshi_mlx==0.2.12",
+#     "numpy",
+#     "sounddevice",
+# ]
+# ///
+import argparse
+from dataclasses import dataclass
+import json
+import queue
+import sys
+import time
+import mlx.core as mx
+import mlx.nn as nn
+import numpy as np
+import sentencepiece
+import sounddevice as sd
+import sphn
+import typing as tp
+from moshi_mlx import models
+from moshi_mlx.models.generate import LmGen
+from moshi_mlx.client_utils import make_log
+from moshi_mlx.modules.conditioner import (
+    ConditionAttributes,
+    ConditionTensor,
+    dropout_all_conditions,
+)
+from moshi_mlx.utils.sampling import Sampler
+from moshi_mlx.models.tts import (
+    Entry,
+    DEFAULT_DSM_TTS_REPO,
+    DEFAULT_DSM_TTS_VOICE_REPO,
+    TTSModel,
+    script_to_entries,
+)
+from moshi_mlx.utils.loaders import hf_get
+def prepare_script(model: TTSModel, script: str, first_turn: bool) -> list[Entry]:
+    multi_speaker = first_turn and model.multi_speaker
+    return script_to_entries(
+        model.tokenizer,
+        model.machine.token_ids,
+        model.mimi.frame_rate,
+        [script],
+        multi_speaker=multi_speaker,
+        padding_between=1,
+    )
+def _make_null(
+    all_attributes: tp.Sequence[ConditionAttributes],
+) -> list[ConditionAttributes]:
+    # When using CFG, returns the null conditions.
+    return dropout_all_conditions(all_attributes)
+@dataclass
+class TTSGen:
+    tts_model: TTSModel
+    attributes: tp.Sequence[ConditionAttributes]
+    on_frame: tp.Optional[tp.Callable[[mx.array], None]] = None
+    def __post_init__(self):
+        tts_model = self.tts_model
+        attributes = self.attributes
+        self.offset = 0
+        self.state = self.tts_model.machine.new_state([])
+        if tts_model.cfg_coef != 1.0:
+            if tts_model.valid_cfg_conditionings:
+                raise ValueError(
+                    "This model does not support direct CFG, but was trained with "
+                    "CFG distillation. Pass instead `cfg_coef` to `make_condition_attributes`."
+                )
+            nulled = _make_null(attributes)
+            attributes = list(attributes) + nulled
+        assert tts_model.lm.condition_provider is not None
+        self.ct = None
+        self.cross_attention_src = None
+        for _attr in attributes:
+            for _key, _value in _attr.text.items():
+                _ct = tts_model.lm.condition_provider.condition_tensor(_key, _value)
+                if self.ct is None:
+                    self.ct = _ct
+                else:
+                    self.ct = ConditionTensor(self.ct.tensor + _ct.tensor)
+            for _key, _value in _attr.tensor.items():
+                _conditioner = tts_model.lm.condition_provider.conditioners[_key]
+                _ca_src = _conditioner.condition(_value)
+                if self.cross_attention_src is None:
+                    self.cross_attention_src = _ca_src
+                else:
+                    raise ValueError("multiple cross-attention conditioners")
+        def _on_audio_hook(audio_tokens):
+            delays = tts_model.lm.delays
+            for q in range(audio_tokens.shape[0]):
+                delay = delays[q]
+                if self.offset < delay + tts_model.delay_steps:
+                    audio_tokens[q] = tts_model.machine.token_ids.zero
+        def _on_text_hook(text_tokens):
+            tokens = text_tokens.tolist()
+            out_tokens = []
+            for token in tokens:
+                out_token, _ = tts_model.machine.process(self.offset, self.state, token)
+                out_tokens.append(out_token)
+            text_tokens[:] = mx.array(out_tokens, dtype=mx.int64)
+        self.lm_gen = LmGen(
+            tts_model.lm,
+            max_steps=tts_model.max_gen_length,
+            text_sampler=Sampler(temp=tts_model.temp),
+            audio_sampler=Sampler(temp=tts_model.temp),
+            cfg_coef=tts_model.cfg_coef,
+            on_text_hook=_on_text_hook,
+            on_audio_hook=_on_audio_hook,
+            # TODO(laurent):
+            # cfg_is_masked_until=cfg_is_masked_until,
+            # cfg_is_no_text=cfg_is_no_text,
+        )
+    def process_last(self):
+        while len(self.state.entries) > 0 or self.state.end_step is not None:
+            self._step()
+        additional_steps = (
+            self.tts_model.delay_steps + max(self.tts_model.lm.delays) + 8
+        )
+        for _ in range(additional_steps):
+            self._step()
+    def process(self):
+        while len(self.state.entries) > self.tts_model.machine.second_stream_ahead:
+            self._step()
+    def _step(self):
+        missing = self.tts_model.lm.n_q - self.tts_model.lm.dep_q
+        missing = self.tts_model.lm.n_q - self.tts_model.lm.dep_q
+        input_tokens = (
+            mx.ones((1, missing), dtype=mx.int64)
+            * self.tts_model.machine.token_ids.zero
+        )
+        self.lm_gen.step(
+            input_tokens, ct=self.ct, cross_attention_src=self.cross_attention_src
+        )
+        frame = self.lm_gen.last_audio_tokens()
+        self.offset += 1
+        if frame is not None:
+            if self.on_frame is not None:
+                self.on_frame(frame)
+    def append_entry(self, entry):
+        self.state.entries.append(entry)
+def log(level: str, msg: str):
+    print(make_log(level, msg))
+def main():
+    parser = argparse.ArgumentParser(
+        description="Run Kyutai TTS using the MLX implementation"
+    )
+    parser.add_argument(
+        "out", type=str, help="Output file to generate, use - for playing the audio"
+    )
+    parser.add_argument(
+        "--hf-repo",
+        type=str,
+        default=DEFAULT_DSM_TTS_REPO,
+        help="HF repo in which to look for the pretrained models.",
+    )
+    parser.add_argument(
+        "--voice-repo",
+        default=DEFAULT_DSM_TTS_VOICE_REPO,
+        help="HF repo in which to look for pre-computed voice embeddings.",
+    )
+    parser.add_argument(
+        "--voice", default="expresso/ex03-ex01_happy_001_channel1_334s.wav"
+    )
+    parser.add_argument(
+        "--quantize",
+        type=int,
+        help="The quantization to be applied, e.g. 8 for 8 bits.",
+    )
+    args = parser.parse_args()
+    mx.random.seed(299792458)
+    log("info", "retrieving checkpoints")
+    raw_config = hf_get("config.json", args.hf_repo)
+    with open(hf_get(raw_config), "r") as fobj:
+        raw_config = json.load(fobj)
+    mimi_weights = hf_get(raw_config["mimi_name"], args.hf_repo)
+    moshi_name = raw_config.get("moshi_name", "model.safetensors")
+    moshi_weights = hf_get(moshi_name, args.hf_repo)
+    tokenizer = hf_get(raw_config["tokenizer_name"], args.hf_repo)
+    lm_config = models.LmConfig.from_config_dict(raw_config)
+    # There is a bug in moshi_mlx <= 0.3.0 handling of the ring kv cache.
+    # The following line gets around it for now.
+    lm_config.transformer.max_seq_len = lm_config.transformer.context
+    model = models.Lm(lm_config)
+    model.set_dtype(mx.bfloat16)
+    log("info", f"loading model weights from {moshi_weights}")
+    model.load_pytorch_weights(str(moshi_weights), lm_config, strict=True)
+    if args.quantize is not None:
+        log("info", f"quantizing model to {args.quantize} bits")
+        nn.quantize(model.depformer, bits=args.quantize)
+        for layer in model.transformer.layers:
+            nn.quantize(layer.self_attn, bits=args.quantize)
+            nn.quantize(layer.gating, bits=args.quantize)
+    log("info", f"loading the text tokenizer from {tokenizer}")
+    text_tokenizer = sentencepiece.SentencePieceProcessor(str(tokenizer))  # type: ignore
+    log("info", f"loading the audio tokenizer {mimi_weights}")
+    generated_codebooks = lm_config.generated_codebooks
+    audio_tokenizer = models.mimi.Mimi(models.mimi_202407(generated_codebooks))
+    audio_tokenizer.load_pytorch_weights(str(mimi_weights), strict=True)
+    cfg_coef_conditioning = None
+    tts_model = TTSModel(
+        model,
+        audio_tokenizer,
+        text_tokenizer,
+        voice_repo=args.voice_repo,
+        temp=0.6,
+        cfg_coef=1,
+        max_padding=8,
+        initial_padding=2,
+        final_padding=2,
+        padding_bonus=0,
+        raw_config=raw_config,
+    )
+    if tts_model.valid_cfg_conditionings:
+        # Model was trained with CFG distillation.
+        cfg_coef_conditioning = tts_model.cfg_coef
+        tts_model.cfg_coef = 1.0
+    mimi = tts_model.mimi
+    log("info", "reading input from stdin")
+    if tts_model.multi_speaker:
+        voices = [tts_model.get_voice_path(args.voice)]
+    else:
+        voices = []
+    all_attributes = [
+        tts_model.make_condition_attributes(voices, cfg_coef_conditioning)
+    ]
+    wav_frames = queue.Queue()
+    def _on_frame(frame):
+        if (frame == -1).any():
+            return
+        _pcm = tts_model.mimi.decode_step(frame[:, :, None])
+        _pcm = np.array(mx.clip(_pcm[0, 0], -1, 1))
+        wav_frames.put_nowait(_pcm)
+    gen = TTSGen(tts_model, all_attributes, on_frame=_on_frame)
+    def run():
+        log("info", "starting the inference loop")
+        first_turn = True
+        for line in sys.stdin:
+            entries = prepare_script(tts_model, line.strip(), first_turn=first_turn)
+            first_turn = False
+            for entry in entries:
+                gen.append_entry(entry)
+                gen.process()
+        gen.process_last()
+    if args.out == "-":
+        def audio_callback(outdata, _a, _b, _c):
+            try:
+                pcm_data = wav_frames.get(block=False)
+                outdata[:, 0] = pcm_data
+            except queue.Empty:
+                outdata[:] = 0
+        with sd.OutputStream(
+            samplerate=mimi.sample_rate,
+            blocksize=1920,
+            channels=1,
+            callback=audio_callback,
+        ):
+            run()
+            while True:
+                if wav_frames.qsize() == 0:
+                    break
+                time.sleep(1)
+    else:
+        run()
+        frames = []
+        while True:
+            try:
+                frames.append(wav_frames.get_nowait())
+            except queue.Empty:
+                break
+        wav = np.concat(frames, -1)
+        sphn.write_wav(args.out, wav, mimi.sample_rate)
+if __name__ == "__main__":
+    main()

scripts/tts_pytorch.py ADDED Viewed

	@@ -0,0 +1,142 @@

+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "moshi==0.2.11",
+#     "torch",
+#     "sphn",
+#     "sounddevice",
+# ]
+# ///
+import argparse
+import sys
+import numpy as np
+import queue
+import sphn
+import time
+import torch
+from moshi.models.loaders import CheckpointInfo
+from moshi.models.tts import DEFAULT_DSM_TTS_REPO, DEFAULT_DSM_TTS_VOICE_REPO, TTSModel
+def main():
+    parser = argparse.ArgumentParser(
+        description="Run Kyutai TTS using the PyTorch implementation"
+    )
+    parser.add_argument("inp", type=str, help="Input file, use - for stdin.")
+    parser.add_argument(
+        "out", type=str, help="Output file to generate, use - for playing the audio"
+    )
+    parser.add_argument(
+        "--hf-repo",
+        type=str,
+        default=DEFAULT_DSM_TTS_REPO,
+        help="HF repo in which to look for the pretrained models.",
+    )
+    parser.add_argument(
+        "--voice-repo",
+        default=DEFAULT_DSM_TTS_VOICE_REPO,
+        help="HF repo in which to look for pre-computed voice embeddings.",
+    )
+    parser.add_argument(
+        "--voice",
+        default="expresso/ex03-ex01_happy_001_channel1_334s.wav",
+        help="The voice to use, relative to the voice repo root. "
+        f"See {DEFAULT_DSM_TTS_VOICE_REPO}",
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cuda",
+        help="Device on which to run, defaults to 'cuda'.",
+    )
+    args = parser.parse_args()
+    print("Loading model...")
+    checkpoint_info = CheckpointInfo.from_hf_repo(args.hf_repo)
+    tts_model = TTSModel.from_checkpoint_info(
+        checkpoint_info, n_q=32, temp=0.6, device=args.device
+    )
+    if args.inp == "-":
+        if sys.stdin.isatty():  # Interactive
+            print("Enter text to synthesize (Ctrl+D to end input):")
+        text = sys.stdin.read().strip()
+    else:
+        with open(args.inp, "r", encoding="utf-8") as fobj:
+            text = fobj.read().strip()
+    # If you want to make a dialog, you can pass more than one turn [text_speaker_1, text_speaker_2, text_2_speaker_1, ...]
+    entries = tts_model.prepare_script([text], padding_between=1)
+    if args.voice.endswith(".safetensors"):
+        voice_path = args.voice
+    else:
+        voice_path = tts_model.get_voice_path(args.voice)
+    # CFG coef goes here because the model was trained with CFG distillation,
+    # so it's not _actually_ doing CFG at inference time.
+    # Also, if you are generating a dialog, you should have two voices in the list.
+    condition_attributes = tts_model.make_condition_attributes(
+        [voice_path], cfg_coef=2.0
+    )
+    _frames_cnt = 0
+    if args.out == "-":
+        # Stream the audio to the speakers using sounddevice.
+        import sounddevice as sd
+        pcms = queue.Queue()
+        def _on_frame(frame):
+            nonlocal _frames_cnt
+            if (frame != -1).all():
+                pcm = tts_model.mimi.decode(frame[:, 1:, :]).cpu().numpy()
+                pcms.put_nowait(np.clip(pcm[0, 0], -1, 1))
+                _frames_cnt += 1
+                print(f"generated {_frames_cnt / 12.5:.2f}s", end="\r", flush=True)
+        def audio_callback(outdata, _a, _b, _c):
+            try:
+                pcm_data = pcms.get(block=False)
+                outdata[:, 0] = pcm_data
+            except queue.Empty:
+                outdata[:] = 0
+        with sd.OutputStream(
+            samplerate=tts_model.mimi.sample_rate,
+            blocksize=1920,
+            channels=1,
+            callback=audio_callback,
+        ):
+            with tts_model.mimi.streaming(1):
+                tts_model.generate(
+                    [entries], [condition_attributes], on_frame=_on_frame
+                )
+            time.sleep(3)
+            while True:
+                if pcms.qsize() == 0:
+                    break
+                time.sleep(1)
+    else:
+        def _on_frame(frame):
+            nonlocal _frames_cnt
+            if (frame != -1).all():
+                _frames_cnt += 1
+                print(f"generated {_frames_cnt / 12.5:.2f}s", end="\r", flush=True)
+        start_time = time.time()
+        result = tts_model.generate(
+            [entries], [condition_attributes], on_frame=_on_frame
+        )
+        print(f"\nTotal time: {time.time() - start_time:.2f}s")
+        with tts_model.mimi.streaming(1), torch.no_grad():
+            pcms = []
+            for frame in result.frames[tts_model.delay_steps :]:
+                pcm = tts_model.mimi.decode(frame[:, 1:, :]).cpu().numpy()
+                pcms.append(np.clip(pcm[0, 0], -1, 1))
+            pcm = np.concatenate(pcms, axis=-1)
+        sphn.write_wav(args.out, pcm, tts_model.mimi.sample_rate)
+if __name__ == "__main__":
+    main()

scripts/tts_pytorch_streaming.py ADDED Viewed

	@@ -0,0 +1,261 @@

+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "moshi==0.2.11",
+#     "torch",
+#     "sphn",
+#     "sounddevice",
+# ]
+# ///
+import argparse
+from dataclasses import dataclass
+import sys
+import numpy as np
+import queue
+import sphn
+import time
+import torch
+import typing as tp
+from moshi.models.loaders import CheckpointInfo
+from moshi.conditioners import dropout_all_conditions
+from moshi.models.lm import LMGen
+from moshi.models.tts import (
+    Entry,
+    DEFAULT_DSM_TTS_REPO,
+    DEFAULT_DSM_TTS_VOICE_REPO,
+    TTSModel,
+    ConditionAttributes,
+    script_to_entries,
+)
+def prepare_script(model: TTSModel, script: str, first_turn: bool) -> list[Entry]:
+    multi_speaker = first_turn and model.multi_speaker
+    return script_to_entries(
+        model.tokenizer,
+        model.machine.token_ids,
+        model.mimi.frame_rate,
+        [script],
+        multi_speaker=multi_speaker,
+        padding_between=1,
+    )
+def _make_null(
+    all_attributes: tp.Sequence[ConditionAttributes],
+) -> list[ConditionAttributes]:
+    # When using CFG, returns the null conditions.
+    return dropout_all_conditions(all_attributes)
+@dataclass
+class TTSGen:
+    tts_model: TTSModel
+    attributes: tp.Sequence[ConditionAttributes]
+    on_frame: tp.Optional[tp.Callable[[torch.Tensor], None]] = None
+    def __post_init__(self):
+        tts_model = self.tts_model
+        attributes = self.attributes
+        self.offset = 0
+        self.state = self.tts_model.machine.new_state([])
+        if tts_model.cfg_coef != 1.0:
+            if tts_model.valid_cfg_conditionings:
+                raise ValueError(
+                    "This model does not support direct CFG, but was trained with "
+                    "CFG distillation. Pass instead `cfg_coef` to `make_condition_attributes`."
+                )
+            nulled = _make_null(attributes)
+            attributes = list(attributes) + nulled
+        assert tts_model.lm.condition_provider is not None
+        prepared = tts_model.lm.condition_provider.prepare(attributes)
+        condition_tensors = tts_model.lm.condition_provider(prepared)
+        def _on_text_logits_hook(text_logits):
+            if tts_model.padding_bonus:
+                text_logits[..., tts_model.machine.token_ids.pad] += (
+                    tts_model.padding_bonus
+                )
+            return text_logits
+        def _on_audio_hook(audio_tokens):
+            audio_offset = tts_model.lm.audio_offset
+            delays = tts_model.lm.delays
+            for q in range(audio_tokens.shape[1]):
+                delay = delays[q + audio_offset]
+                if self.offset < delay + tts_model.delay_steps:
+                    audio_tokens[:, q] = tts_model.machine.token_ids.zero
+        def _on_text_hook(text_tokens):
+            tokens = text_tokens.tolist()
+            out_tokens = []
+            for token in tokens:
+                out_token, _ = tts_model.machine.process(self.offset, self.state, token)
+                out_tokens.append(out_token)
+            text_tokens[:] = torch.tensor(
+                out_tokens, dtype=torch.long, device=text_tokens.device
+            )
+        tts_model.lm.dep_q = tts_model.n_q
+        self.lm_gen = LMGen(
+            tts_model.lm,
+            temp=tts_model.temp,
+            temp_text=tts_model.temp,
+            cfg_coef=tts_model.cfg_coef,
+            condition_tensors=condition_tensors,
+            on_text_logits_hook=_on_text_logits_hook,
+            on_text_hook=_on_text_hook,
+            on_audio_hook=_on_audio_hook,
+            cfg_is_masked_until=None,
+            cfg_is_no_text=True,
+        )
+        self.lm_gen.streaming_forever(1)
+    def process_last(self):
+        while len(self.state.entries) > 0 or self.state.end_step is not None:
+            self._step()
+        additional_steps = (
+            self.tts_model.delay_steps + max(self.tts_model.lm.delays) + 8
+        )
+        for _ in range(additional_steps):
+            self._step()
+    def process(self):
+        while len(self.state.entries) > self.tts_model.machine.second_stream_ahead:
+            self._step()
+    def _step(self):
+        missing = self.tts_model.lm.n_q - self.tts_model.lm.dep_q
+        input_tokens = torch.full(
+            (1, missing, 1),
+            self.tts_model.machine.token_ids.zero,
+            dtype=torch.long,
+            device=self.tts_model.lm.device,
+        )
+        frame = self.lm_gen.step(input_tokens)
+        self.offset += 1
+        if frame is not None:
+            if self.on_frame is not None:
+                self.on_frame(frame)
+    def append_entry(self, entry):
+        self.state.entries.append(entry)
+@torch.no_grad()
+def main():
+    parser = argparse.ArgumentParser(
+        description="Run Kyutai TTS using the PyTorch implementation"
+    )
+    parser.add_argument(
+        "out", type=str, help="Output file to generate, use - for playing the audio"
+    )
+    parser.add_argument(
+        "--hf-repo",
+        type=str,
+        default=DEFAULT_DSM_TTS_REPO,
+        help="HF repo in which to look for the pretrained models.",
+    )
+    parser.add_argument(
+        "--voice-repo",
+        default=DEFAULT_DSM_TTS_VOICE_REPO,
+        help="HF repo in which to look for pre-computed voice embeddings.",
+    )
+    parser.add_argument(
+        "--voice",
+        default="expresso/ex03-ex01_happy_001_channel1_334s.wav",
+        help="The voice to use, relative to the voice repo root. "
+        f"See {DEFAULT_DSM_TTS_VOICE_REPO}",
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cuda",
+        help="Device on which to run, defaults to 'cuda'.",
+    )
+    args = parser.parse_args()
+    print("Loading model...")
+    checkpoint_info = CheckpointInfo.from_hf_repo(args.hf_repo)
+    tts_model = TTSModel.from_checkpoint_info(
+        checkpoint_info, n_q=32, temp=0.6, device=args.device
+    )
+    if args.voice.endswith(".safetensors"):
+        voice_path = args.voice
+    else:
+        voice_path = tts_model.get_voice_path(args.voice)
+    # CFG coef goes here because the model was trained with CFG distillation,
+    # so it's not _actually_ doing CFG at inference time.
+    # Also, if you are generating a dialog, you should have two voices in the list.
+    condition_attributes = tts_model.make_condition_attributes(
+        [voice_path], cfg_coef=2.0
+    )
+    if sys.stdin.isatty():  # Interactive
+        print("Enter text to synthesize (Ctrl+D to end input):")
+    if args.out == "-":
+        # Stream the audio to the speakers using sounddevice.
+        import sounddevice as sd
+        pcms = queue.Queue()
+        def _on_frame(frame):
+            if (frame != -1).all():
+                pcm = tts_model.mimi.decode(frame[:, 1:, :]).cpu().numpy()
+                pcms.put_nowait(np.clip(pcm[0, 0], -1, 1))
+        def audio_callback(outdata, _a, _b, _c):
+            try:
+                pcm_data = pcms.get(block=False)
+                outdata[:, 0] = pcm_data
+            except queue.Empty:
+                outdata[:] = 0
+        gen = TTSGen(tts_model, [condition_attributes], on_frame=_on_frame)
+        with sd.OutputStream(
+            samplerate=tts_model.mimi.sample_rate,
+            blocksize=1920,
+            channels=1,
+            callback=audio_callback,
+        ) and tts_model.mimi.streaming(1):
+            first_turn = True
+            for line in sys.stdin:
+                entries = prepare_script(tts_model, line.strip(), first_turn=first_turn)
+                first_turn = False
+                for entry in entries:
+                    gen.append_entry(entry)
+                    gen.process()
+            gen.process_last()
+            while True:
+                if pcms.qsize() == 0:
+                    break
+                time.sleep(1)
+    else:
+        pcms = []
+        def _on_frame(frame: torch.Tensor):
+            if (frame != -1).all():
+                pcm = tts_model.mimi.decode(frame[:, 1:, :]).cpu().numpy()
+                pcms.append(np.clip(pcm[0, 0]))
+        gen = TTSGen(tts_model, [condition_attributes], on_frame=_on_frame)
+        with tts_model.mimi.streaming(1):
+            first_turn = True
+            for line in sys.stdin:
+                entries = prepare_script(tts_model, line.strip(), first_turn=first_turn)
+                first_turn = False
+                for entry in entries:
+                    gen.append_entry(entry)
+                    gen.process()
+            gen.process_last()
+        pcm = np.concatenate(pcms, axis=-1)
+        sphn.write_wav(args.out, pcm, tts_model.mimi.sample_rate)
+if __name__ == "__main__":
+    main()

scripts/tts_rust_server.py ADDED Viewed

	@@ -0,0 +1,185 @@

+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "msgpack",
+#     "numpy",
+#     "sphn",
+#     "websockets",
+#     "sounddevice",
+#     "tqdm",
+# ]
+# ///
+import argparse
+import asyncio
+import sys
+from urllib.parse import urlencode
+import msgpack
+import numpy as np
+import sphn
+import tqdm
+import websockets
+SAMPLE_RATE = 24000
+TTS_TEXT = "Hello, this is a test of the moshi text to speech system, this should result in some nicely sounding generated voice."
+DEFAULT_DSM_TTS_VOICE_REPO = "kyutai/tts-voices"
+AUTH_TOKEN = "public_token"
+async def receive_messages(websocket: websockets.ClientConnection, output_queue):
+    with tqdm.tqdm(desc="Receiving audio", unit=" seconds generated") as pbar:
+        accumulated_samples = 0
+        last_seconds = 0
+        async for message_bytes in websocket:
+            msg = msgpack.unpackb(message_bytes)
+            if msg["type"] == "Audio":
+                pcm = np.array(msg["pcm"]).astype(np.float32)
+                await output_queue.put(pcm)
+                accumulated_samples += len(msg["pcm"])
+                current_seconds = accumulated_samples // SAMPLE_RATE
+                if current_seconds > last_seconds:
+                    pbar.update(current_seconds - last_seconds)
+                    last_seconds = current_seconds
+    print("End of audio.")
+    await output_queue.put(None)  # Signal end of audio
+async def output_audio(out: str, output_queue: asyncio.Queue[np.ndarray | None]):
+    if out == "-":
+        # This will fail with "OSError: PortAudio library not found" on servers with no
+        # audio output, so only import if the user requests it.
+        import sounddevice as sd
+        should_exit = False
+        def audio_callback(outdata, _a, _b, _c):
+            nonlocal should_exit
+            try:
+                pcm_data = output_queue.get_nowait()
+                if pcm_data is not None:
+                    outdata[:, 0] = pcm_data
+                else:
+                    should_exit = True
+                    outdata[:] = 0
+            except asyncio.QueueEmpty:
+                outdata[:] = 0
+        with sd.OutputStream(
+            samplerate=SAMPLE_RATE,
+            blocksize=1920,
+            channels=1,
+            callback=audio_callback,
+        ):
+            while True:
+                if should_exit:
+                    break
+                await asyncio.sleep(1)
+    else:
+        frames = []
+        while True:
+            item = await output_queue.get()
+            if item is None:
+                break
+            frames.append(item)
+        sphn.write_wav(out, np.concat(frames, -1), SAMPLE_RATE)
+        print(f"Saved audio to {out}")
+async def read_lines_from_stdin():
+    reader = asyncio.StreamReader()
+    protocol = asyncio.StreamReaderProtocol(reader)
+    loop = asyncio.get_running_loop()
+    await loop.connect_read_pipe(lambda: protocol, sys.stdin)
+    while True:
+        line = await reader.readline()
+        if not line:
+            break
+        yield line.decode().rstrip()
+async def read_lines_from_file(path: str):
+    queue = asyncio.Queue()
+    loop = asyncio.get_running_loop()
+    def producer():
+        with open(path, "r", encoding="utf-8") as f:
+            for line in f:
+                asyncio.run_coroutine_threadsafe(queue.put(line), loop)
+        asyncio.run_coroutine_threadsafe(queue.put(None), loop)
+    await asyncio.to_thread(producer)
+    while True:
+        line = await queue.get()
+        if line is None:
+            break
+        yield line
+async def get_lines(source: str):
+    if source == "-":
+        async for line in read_lines_from_stdin():
+            yield line
+    else:
+        async for line in read_lines_from_file(source):
+            yield line
+async def websocket_client():
+    parser = argparse.ArgumentParser(description="Use the TTS streaming API")
+    parser.add_argument("inp", type=str, help="Input file, use - for stdin.")
+    parser.add_argument(
+        "out", type=str, help="Output file to generate, use - for playing the audio"
+    )
+    parser.add_argument(
+        "--voice",
+        default="expresso/ex03-ex01_happy_001_channel1_334s.wav",
+        help="The voice to use, relative to the voice repo root. "
+        f"See {DEFAULT_DSM_TTS_VOICE_REPO}",
+    )
+    parser.add_argument(
+        "--url",
+        help="The URL of the server to which to send the audio",
+        default="ws://127.0.0.1:8080",
+    )
+    parser.add_argument("--api-key", default="public_token")
+    args = parser.parse_args()
+    params = {"voice": args.voice, "format": "PcmMessagePack"}
+    uri = f"{args.url}/api/tts_streaming?{urlencode(params)}"
+    print(uri)
+    if args.inp == "-":
+        if sys.stdin.isatty():  # Interactive
+            print("Enter text to synthesize (Ctrl+D to end input):")
+    headers = {"kyutai-api-key": args.api_key}
+    # For clients that don't support the `additional_headers` parameter when connecting
+    # (notably: JS libraries like react-use-websocket),
+    # you can also provide the API key in the query string with the "auth_id" key,
+    # i.e. adding "&auth_id=public_token" at the end of `uri`
+    async with websockets.connect(uri, additional_headers=headers) as websocket:
+        print("connected")
+        async def send_loop():
+            print("go send")
+            async for line in get_lines(args.inp):
+                for word in line.split():
+                    await websocket.send(msgpack.packb({"type": "Text", "text": word}))
+            await websocket.send(msgpack.packb({"type": "Eos"}))
+        output_queue = asyncio.Queue()
+        receive_task = asyncio.create_task(receive_messages(websocket, output_queue))
+        output_audio_task = asyncio.create_task(output_audio(args.out, output_queue))
+        send_task = asyncio.create_task(send_loop())
+        await asyncio.gather(receive_task, output_audio_task, send_task)
+if __name__ == "__main__":
+    asyncio.run(websocket_client())

stt-rs/Cargo.lock ADDED Viewed

	@@ -0,0 +1,3746 @@

+# This file is automatically @generated by Cargo.
+# It is not intended for manual editing.
+version = 4
+[[package]]
+name = "addr2line"
+version = "0.24.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "dfbe277e56a376000877090da837660b4427aad530e3028d44e0bffe4f89a1c1"
+dependencies = [
+ "gimli",
+]
+[[package]]
+name = "adler2"
+version = "2.0.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "320119579fcad9c21884f5c4861d16174d0e06250625266f50fe6898340abefa"
+[[package]]
+name = "aho-corasick"
+version = "1.1.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8e60d3430d3a69478ad0993f19238d2df97c507009a52b3c10addcd7f6bcb916"
+dependencies = [
+ "memchr",
+]
+[[package]]
+name = "anstream"
+version = "0.6.19"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "301af1932e46185686725e0fad2f8f2aa7da69dd70bf6ecc44d6b703844a3933"
+dependencies = [
+ "anstyle",
+ "anstyle-parse",
+ "anstyle-query",
+ "anstyle-wincon",
+ "colorchoice",
+ "is_terminal_polyfill",
+ "utf8parse",
+]
+[[package]]
+name = "anstyle"
+version = "1.0.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "862ed96ca487e809f1c8e5a8447f6ee2cf102f846893800b20cebdf541fc6bbd"
+[[package]]
+name = "anstyle-parse"
+version = "0.2.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4e7644824f0aa2c7b9384579234ef10eb7efb6a0deb83f9630a49594dd9c15c2"
+dependencies = [
+ "utf8parse",
+]
+[[package]]
+name = "anstyle-query"
+version = "1.1.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6c8bdeb6047d8983be085bab0ba1472e6dc604e7041dbf6fcd5e71523014fae9"
+dependencies = [
+ "windows-sys 0.59.0",
+]
+[[package]]
+name = "anstyle-wincon"
+version = "3.0.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "403f75924867bb1033c59fbf0797484329750cfbe3c4325cd33127941fabc882"
+dependencies = [
+ "anstyle",
+ "once_cell_polyfill",
+ "windows-sys 0.59.0",
+]
+[[package]]
+name = "anyhow"
+version = "1.0.98"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e16d2d3311acee920a9eb8d33b8cbc1787ce4a264e85f964c2404b969bdcd487"
+[[package]]
+name = "arbitrary"
+version = "1.4.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "dde20b3d026af13f561bdd0f15edf01fc734f0dafcedbaf42bba506a9517f223"
+dependencies = [
+ "derive_arbitrary",
+]
+[[package]]
+name = "arrayvec"
+version = "0.7.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7c02d123df017efcdfbd739ef81735b36c5ba83ec3c59c80a9d7ecc718f92e50"
+[[package]]
+name = "atomic-waker"
+version = "1.1.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1505bd5d3d116872e7271a6d4e16d81d0c8570876c8de68093a09ac269d8aac0"
+[[package]]
+name = "audiopus_sys"
+version = "0.2.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "62314a1546a2064e033665d658e88c620a62904be945f8147e6b16c3db9f8651"
+dependencies = [
+ "cmake",
+ "log",
+ "pkg-config",
+]
+[[package]]
+name = "autocfg"
+version = "1.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8"
+[[package]]
+name = "backtrace"
+version = "0.3.75"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6806a6321ec58106fea15becdad98371e28d92ccbc7c8f1b3b6dd724fe8f1002"
+dependencies = [
+ "addr2line",
+ "cfg-if",
+ "libc",
+ "miniz_oxide",
+ "object",
+ "rustc-demangle",
+ "windows-targets 0.52.6",
+]
+[[package]]
+name = "base64"
+version = "0.22.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "72b3254f16251a8381aa12e40e3c4d2f0199f8c6508fbecb9d91f575e0fbb8c6"
+[[package]]
+name = "bindgen_cuda"
+version = "0.1.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1f8489af5b7d17a81bffe37e0f4d6e1e4de87c87329d05447f22c35d95a1227d"
+dependencies = [
+ "glob",
+ "num_cpus",
+ "rayon",
+]
+[[package]]
+name = "bit-set"
+version = "0.5.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0700ddab506f33b20a03b13996eccd309a48e5ff77d0d95926aa0210fb4e95f1"
+dependencies = [
+ "bit-vec",
+]
+[[package]]
+name = "bit-vec"
+version = "0.6.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "349f9b6a179ed607305526ca489b34ad0a41aed5f7980fa90eb03160b69598fb"
+[[package]]
+name = "bitflags"
+version = "1.3.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "bef38d45163c2f1dde094a7dfd33ccf595c92905c8f8f4fdc18d06fb1037718a"
+[[package]]
+name = "bitflags"
+version = "2.9.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1b8e56985ec62d17e9c1001dc89c88ecd7dc08e47eba5ec7c29c7b5eeecde967"
+[[package]]
+name = "block"
+version = "0.1.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0d8c1fef690941d3e7788d328517591fecc684c084084702d6ff1641e993699a"
+[[package]]
+name = "bumpalo"
+version = "3.18.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "793db76d6187cd04dff33004d8e6c9cc4e05cd330500379d2394209271b4aeee"
+[[package]]
+name = "bytemuck"
+version = "1.23.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5c76a5792e44e4abe34d3abf15636779261d45a7450612059293d1d2cfc63422"
+dependencies = [
+ "bytemuck_derive",
+]
+[[package]]
+name = "bytemuck_derive"
+version = "1.9.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7ecc273b49b3205b83d648f0690daa588925572cc5063745bfe547fe7ec8e1a1"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "byteorder"
+version = "1.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b"
+[[package]]
+name = "bytes"
+version = "1.10.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d71b6127be86fdcfddb610f7182ac57211d4b18a3e9c82eb2d17662f2227ad6a"
+[[package]]
+name = "candle-core"
+version = "0.9.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a9f51e2ecf6efe9737af8f993433c839f956d2b6ed4fd2dd4a7c6d8b0fa667ff"
+dependencies = [
+ "byteorder",
+ "candle-kernels",
+ "candle-metal-kernels",
+ "cudarc",
+ "gemm 0.17.1",
+ "half",
+ "memmap2",
+ "metal 0.27.0",
+ "num-traits",
+ "num_cpus",
+ "rand",
+ "rand_distr",
+ "rayon",
+ "safetensors",
+ "thiserror 1.0.69",
+ "ug",
+ "ug-cuda",
+ "ug-metal",
+ "yoke 0.7.5",
+ "zip",
+]
+[[package]]
+name = "candle-kernels"
+version = "0.9.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9fcd989c2143aa754370b5bfee309e35fbd259e83d9ecf7a73d23d8508430775"
+dependencies = [
+ "bindgen_cuda",
+]
+[[package]]
+name = "candle-metal-kernels"
+version = "0.9.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9a323ee9c813707f73b6e59300661b354a70410f31fe4135170c4eda8a061534"
+dependencies = [
+ "half",
+ "metal 0.27.0",
+ "once_cell",
+ "thiserror 1.0.69",
+ "tracing",
+]
+[[package]]
+name = "candle-nn"
+version = "0.9.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c1980d53280c8f9e2c6cbe1785855d7ff8010208b46e21252b978badf13ad69d"
+dependencies = [
+ "candle-core",
+ "candle-metal-kernels",
+ "half",
+ "metal 0.27.0",
+ "num-traits",
+ "rayon",
+ "safetensors",
+ "serde",
+ "thiserror 1.0.69",
+]
+[[package]]
+name = "candle-transformers"
+version = "0.9.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "186cb80045dbe47e0b387ea6d3e906f02fb3056297080d9922984c90e90a72b0"
+dependencies = [
+ "byteorder",
+ "candle-core",
+ "candle-nn",
+ "fancy-regex",
+ "num-traits",
+ "rand",
+ "rayon",
+ "serde",
+ "serde_json",
+ "serde_plain",
+ "tracing",
+]
+[[package]]
+name = "cc"
+version = "1.2.27"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d487aa071b5f64da6f19a3e848e3578944b726ee5a4854b82172f02aa876bfdc"
+dependencies = [
+ "shlex",
+]
+[[package]]
+name = "cfg-if"
+version = "1.0.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9555578bc9e57714c812a1f84e4fc5b4d21fcb063490c624de019f7464c91268"
+[[package]]
+name = "clap"
+version = "4.5.40"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "40b6887a1d8685cebccf115538db5c0efe625ccac9696ad45c409d96566e910f"
+dependencies = [
+ "clap_builder",
+ "clap_derive",
+]
+[[package]]
+name = "clap_builder"
+version = "4.5.40"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e0c66c08ce9f0c698cbce5c0279d0bb6ac936d8674174fe48f736533b964f59e"
+dependencies = [
+ "anstream",
+ "anstyle",
+ "clap_lex",
+ "strsim",
+]
+[[package]]
+name = "clap_derive"
+version = "4.5.40"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d2c7947ae4cc3d851207c1adb5b5e260ff0cca11446b1d6d1423788e442257ce"
+dependencies = [
+ "heck",
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "clap_lex"
+version = "0.7.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b94f61472cee1439c0b966b47e3aca9ae07e45d070759512cd390ea2bebc6675"
+[[package]]
+name = "cmake"
+version = "0.1.54"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e7caa3f9de89ddbe2c607f4101924c5abec803763ae9534e4f4d7d8f84aa81f0"
+dependencies = [
+ "cc",
+]
+[[package]]
+name = "colorchoice"
+version = "1.0.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b05b61dc5112cbb17e4b6cd61790d9845d13888356391624cbe7e41efeac1e75"
+[[package]]
+name = "console"
+version = "0.15.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "054ccb5b10f9f2cbf51eb355ca1d05c2d279ce1804688d0db74b4733a5aeafd8"
+dependencies = [
+ "encode_unicode",
+ "libc",
+ "once_cell",
+ "unicode-width",
+ "windows-sys 0.59.0",
+]
+[[package]]
+name = "core-foundation"
+version = "0.9.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "91e195e091a93c46f7102ec7818a2aa394e1e1771c3ab4825963fa03e45afb8f"
+dependencies = [
+ "core-foundation-sys",
+ "libc",
+]
+[[package]]
+name = "core-foundation-sys"
+version = "0.8.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "773648b94d0e5d620f64f280777445740e61fe701025087ec8b57f45c791888b"
+[[package]]
+name = "core-graphics-types"
+version = "0.1.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "45390e6114f68f718cc7a830514a96f903cccd70d02a8f6d9f643ac4ba45afaf"
+dependencies = [
+ "bitflags 1.3.2",
+ "core-foundation",
+ "libc",
+]
+[[package]]
+name = "crc32fast"
+version = "1.4.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a97769d94ddab943e4510d138150169a2758b5ef3eb191a9ee688de3e23ef7b3"
+dependencies = [
+ "cfg-if",
+]
+[[package]]
+name = "crossbeam-deque"
+version = "0.8.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9dd111b7b7f7d55b72c0a6ae361660ee5853c9af73f70c3c2ef6858b950e2e51"
+dependencies = [
+ "crossbeam-epoch",
+ "crossbeam-utils",
+]
+[[package]]
+name = "crossbeam-epoch"
+version = "0.9.18"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5b82ac4a3c2ca9c3460964f020e1402edd5753411d7737aa39c3714ad1b5420e"
+dependencies = [
+ "crossbeam-utils",
+]
+[[package]]
+name = "crossbeam-utils"
+version = "0.8.21"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d0a5c400df2834b80a4c3327b3aad3a4c4cd4de0629063962b03235697506a28"
+[[package]]
+name = "crunchy"
+version = "0.2.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "43da5946c66ffcc7745f48db692ffbb10a83bfe0afd96235c5c2a4fb23994929"
+[[package]]
+name = "cudarc"
+version = "0.16.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f9574894139a982bf26fbb44473a9d416c015e779c51ef0fbc0789f1a1c17b25"
+dependencies = [
+ "half",
+ "libloading",
+]
+[[package]]
+name = "derive_arbitrary"
+version = "1.4.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "30542c1ad912e0e3d22a1935c290e12e8a29d704a420177a31faad4a601a0800"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "dirs"
+version = "6.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c3e8aa94d75141228480295a7d0e7feb620b1a5ad9f12bc40be62411e38cce4e"
+dependencies = [
+ "dirs-sys",
+]
+[[package]]
+name = "dirs-sys"
+version = "0.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e01a3366d27ee9890022452ee61b2b63a67e6f13f58900b651ff5665f0bb1fab"
+dependencies = [
+ "libc",
+ "option-ext",
+ "redox_users",
+ "windows-sys 0.60.2",
+]
+[[package]]
+name = "displaydoc"
+version = "0.2.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "97369cbbc041bc366949bc74d34658d6cda5621039731c6310521892a3a20ae0"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "dyn-stack"
+version = "0.10.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "56e53799688f5632f364f8fb387488dd05db9fe45db7011be066fc20e7027f8b"
+dependencies = [
+ "bytemuck",
+ "reborrow",
+]
+[[package]]
+name = "dyn-stack"
+version = "0.13.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "490bd48eb68fffcfed519b4edbfd82c69cbe741d175b84f0e0cbe8c57cbe0bdd"
+dependencies = [
+ "bytemuck",
+]
+[[package]]
+name = "either"
+version = "1.15.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "48c757948c5ede0e46177b7add2e67155f70e33c07fea8284df6576da70b3719"
+[[package]]
+name = "encode_unicode"
+version = "1.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "34aa73646ffb006b8f5147f3dc182bd4bcb190227ce861fc4a4844bf8e3cb2c0"
+[[package]]
+name = "encoding_rs"
+version = "0.8.35"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "75030f3c4f45dafd7586dd6780965a8c7e8e285a5ecb86713e63a79c5b2766f3"
+dependencies = [
+ "cfg-if",
+]
+[[package]]
+name = "enum-as-inner"
+version = "0.6.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a1e6a265c649f3f5979b601d26f1d05ada116434c87741c9493cb56218f76cbc"
+dependencies = [
+ "heck",
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "equivalent"
+version = "1.0.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "877a4ace8713b0bcf2a4e7eec82529c029f1d0619886d18145fea96c3ffe5c0f"
+[[package]]
+name = "errno"
+version = "0.3.12"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cea14ef9355e3beab063703aa9dab15afd25f0667c341310c1e5274bb1d0da18"
+dependencies = [
+ "libc",
+ "windows-sys 0.59.0",
+]
+[[package]]
+name = "extended"
+version = "0.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "af9673d8203fcb076b19dfd17e38b3d4ae9f44959416ea532ce72415a6020365"
+[[package]]
+name = "fancy-regex"
+version = "0.13.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "531e46835a22af56d1e3b66f04844bed63158bc094a628bec1d321d9b4c44bf2"
+dependencies = [
+ "bit-set",
+ "regex-automata",
+ "regex-syntax",
+]
+[[package]]
+name = "fastrand"
+version = "2.3.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "37909eebbb50d72f9059c3b6d82c0463f2ff062c9e95845c43a6c9c0355411be"
+[[package]]
+name = "flate2"
+version = "1.1.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4a3d7db9596fecd151c5f638c0ee5d5bd487b6e0ea232e5dc96d5250f6f94b1d"
+dependencies = [
+ "crc32fast",
+ "miniz_oxide",
+]
+[[package]]
+name = "fnv"
+version = "1.0.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1"
+[[package]]
+name = "foreign-types"
+version = "0.3.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f6f339eb8adc052cd2ca78910fda869aefa38d22d5cb648e6485e4d3fc06f3b1"
+dependencies = [
+ "foreign-types-shared 0.1.1",
+]
+[[package]]
+name = "foreign-types"
+version = "0.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d737d9aa519fb7b749cbc3b962edcf310a8dd1f4b67c91c4f83975dbdd17d965"
+dependencies = [
+ "foreign-types-macros",
+ "foreign-types-shared 0.3.1",
+]
+[[package]]
+name = "foreign-types-macros"
+version = "0.2.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1a5c6c585bc94aaf2c7b51dd4c2ba22680844aba4c687be581871a6f518c5742"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "foreign-types-shared"
+version = "0.1.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "00b0228411908ca8685dba7fc2cdd70ec9990a6e753e89b6ac91a84c40fbaf4b"
+[[package]]
+name = "foreign-types-shared"
+version = "0.3.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "aa9a19cbb55df58761df49b23516a86d432839add4af60fc256da840f66ed35b"
+[[package]]
+name = "form_urlencoded"
+version = "1.2.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e13624c2627564efccf4934284bdd98cbaa14e79b0b5a141218e507b3a823456"
+dependencies = [
+ "percent-encoding",
+]
+[[package]]
+name = "futures"
+version = "0.3.31"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "65bc07b1a8bc7c85c5f2e110c476c7389b4554ba72af57d8445ea63a576b0876"
+dependencies = [
+ "futures-channel",
+ "futures-core",
+ "futures-executor",
+ "futures-io",
+ "futures-sink",
+ "futures-task",
+ "futures-util",
+]
+[[package]]
+name = "futures-channel"
+version = "0.3.31"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2dff15bf788c671c1934e366d07e30c1814a8ef514e1af724a602e8a2fbe1b10"
+dependencies = [
+ "futures-core",
+ "futures-sink",
+]
+[[package]]
+name = "futures-core"
+version = "0.3.31"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "05f29059c0c2090612e8d742178b0580d2dc940c837851ad723096f87af6663e"
+[[package]]
+name = "futures-executor"
+version = "0.3.31"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1e28d1d997f585e54aebc3f97d39e72338912123a67330d723fdbb564d646c9f"
+dependencies = [
+ "futures-core",
+ "futures-task",
+ "futures-util",
+]
+[[package]]
+name = "futures-io"
+version = "0.3.31"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9e5c1b78ca4aae1ac06c48a526a655760685149f0d465d21f37abfe57ce075c6"
+[[package]]
+name = "futures-macro"
+version = "0.3.31"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "162ee34ebcb7c64a8abebc059ce0fee27c2262618d7b60ed8faf72fef13c3650"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "futures-sink"
+version = "0.3.31"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e575fab7d1e0dcb8d0c7bcf9a63ee213816ab51902e6d244a95819acacf1d4f7"
+[[package]]
+name = "futures-task"
+version = "0.3.31"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f90f7dce0722e95104fcb095585910c0977252f286e354b5e3bd38902cd99988"
+[[package]]
+name = "futures-util"
+version = "0.3.31"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9fa08315bb612088cc391249efdc3bc77536f16c91f6cf495e6fbe85b20a4a81"
+dependencies = [
+ "futures-channel",
+ "futures-core",
+ "futures-io",
+ "futures-macro",
+ "futures-sink",
+ "futures-task",
+ "memchr",
+ "pin-project-lite",
+ "pin-utils",
+ "slab",
+]
+[[package]]
+name = "gemm"
+version = "0.17.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6ab24cc62135b40090e31a76a9b2766a501979f3070fa27f689c27ec04377d32"
+dependencies = [
+ "dyn-stack 0.10.0",
+ "gemm-c32 0.17.1",
+ "gemm-c64 0.17.1",
+ "gemm-common 0.17.1",
+ "gemm-f16 0.17.1",
+ "gemm-f32 0.17.1",
+ "gemm-f64 0.17.1",
+ "num-complex",
+ "num-traits",
+ "paste",
+ "raw-cpuid 10.7.0",
+ "seq-macro",
+]
+[[package]]
+name = "gemm"
+version = "0.18.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ab96b703d31950f1aeddded248bc95543c9efc7ac9c4a21fda8703a83ee35451"
+dependencies = [
+ "dyn-stack 0.13.0",
+ "gemm-c32 0.18.2",
+ "gemm-c64 0.18.2",
+ "gemm-common 0.18.2",
+ "gemm-f16 0.18.2",
+ "gemm-f32 0.18.2",
+ "gemm-f64 0.18.2",
+ "num-complex",
+ "num-traits",
+ "paste",
+ "raw-cpuid 11.5.0",
+ "seq-macro",
+]
+[[package]]
+name = "gemm-c32"
+version = "0.17.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b9c030d0b983d1e34a546b86e08f600c11696fde16199f971cd46c12e67512c0"
+dependencies = [
+ "dyn-stack 0.10.0",
+ "gemm-common 0.17.1",
+ "num-complex",
+ "num-traits",
+ "paste",
+ "raw-cpuid 10.7.0",
+ "seq-macro",
+]
+[[package]]
+name = "gemm-c32"
+version = "0.18.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f6db9fd9f40421d00eea9dd0770045a5603b8d684654816637732463f4073847"
+dependencies = [
+ "dyn-stack 0.13.0",
+ "gemm-common 0.18.2",
+ "num-complex",
+ "num-traits",
+ "paste",
+ "raw-cpuid 11.5.0",
+ "seq-macro",
+]
+[[package]]
+name = "gemm-c64"
+version = "0.17.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "fbb5f2e79fefb9693d18e1066a557b4546cd334b226beadc68b11a8f9431852a"
+dependencies = [
+ "dyn-stack 0.10.0",
+ "gemm-common 0.17.1",
+ "num-complex",
+ "num-traits",
+ "paste",
+ "raw-cpuid 10.7.0",
+ "seq-macro",
+]
+[[package]]
+name = "gemm-c64"
+version = "0.18.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "dfcad8a3d35a43758330b635d02edad980c1e143dc2f21e6fd25f9e4eada8edf"
+dependencies = [
+ "dyn-stack 0.13.0",
+ "gemm-common 0.18.2",
+ "num-complex",
+ "num-traits",
+ "paste",
+ "raw-cpuid 11.5.0",
+ "seq-macro",
+]
+[[package]]
+name = "gemm-common"
+version = "0.17.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a2e7ea062c987abcd8db95db917b4ffb4ecdfd0668471d8dc54734fdff2354e8"
+dependencies = [
+ "bytemuck",
+ "dyn-stack 0.10.0",
+ "half",
+ "num-complex",
+ "num-traits",
+ "once_cell",
+ "paste",
+ "pulp 0.18.22",
+ "raw-cpuid 10.7.0",
+ "rayon",
+ "seq-macro",
+ "sysctl 0.5.5",
+]
+[[package]]
+name = "gemm-common"
+version = "0.18.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a352d4a69cbe938b9e2a9cb7a3a63b7e72f9349174a2752a558a8a563510d0f3"
+dependencies = [
+ "bytemuck",
+ "dyn-stack 0.13.0",
+ "half",
+ "libm",
+ "num-complex",
+ "num-traits",
+ "once_cell",
+ "paste",
+ "pulp 0.21.5",
+ "raw-cpuid 11.5.0",
+ "rayon",
+ "seq-macro",
+ "sysctl 0.6.0",
+]
+[[package]]
+name = "gemm-f16"
+version = "0.17.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7ca4c06b9b11952071d317604acb332e924e817bd891bec8dfb494168c7cedd4"
+dependencies = [
+ "dyn-stack 0.10.0",
+ "gemm-common 0.17.1",
+ "gemm-f32 0.17.1",
+ "half",
+ "num-complex",
+ "num-traits",
+ "paste",
+ "raw-cpuid 10.7.0",
+ "rayon",
+ "seq-macro",
+]
+[[package]]
+name = "gemm-f16"
+version = "0.18.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cff95ae3259432f3c3410eaa919033cd03791d81cebd18018393dc147952e109"
+dependencies = [
+ "dyn-stack 0.13.0",
+ "gemm-common 0.18.2",
+ "gemm-f32 0.18.2",
+ "half",
+ "num-complex",
+ "num-traits",
+ "paste",
+ "raw-cpuid 11.5.0",
+ "rayon",
+ "seq-macro",
+]
+[[package]]
+name = "gemm-f32"
+version = "0.17.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e9a69f51aaefbd9cf12d18faf273d3e982d9d711f60775645ed5c8047b4ae113"
+dependencies = [
+ "dyn-stack 0.10.0",
+ "gemm-common 0.17.1",
+ "num-complex",
+ "num-traits",
+ "paste",
+ "raw-cpuid 10.7.0",
+ "seq-macro",
+]
+[[package]]
+name = "gemm-f32"
+version = "0.18.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "bc8d3d4385393304f407392f754cd2dc4b315d05063f62cf09f47b58de276864"
+dependencies = [
+ "dyn-stack 0.13.0",
+ "gemm-common 0.18.2",
+ "num-complex",
+ "num-traits",
+ "paste",
+ "raw-cpuid 11.5.0",
+ "seq-macro",
+]
+[[package]]
+name = "gemm-f64"
+version = "0.17.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "aa397a48544fadf0b81ec8741e5c0fba0043008113f71f2034def1935645d2b0"
+dependencies = [
+ "dyn-stack 0.10.0",
+ "gemm-common 0.17.1",
+ "num-complex",
+ "num-traits",
+ "paste",
+ "raw-cpuid 10.7.0",
+ "seq-macro",
+]
+[[package]]
+name = "gemm-f64"
+version = "0.18.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "35b2a4f76ce4b8b16eadc11ccf2e083252d8237c1b589558a49b0183545015bd"
+dependencies = [
+ "dyn-stack 0.13.0",
+ "gemm-common 0.18.2",
+ "num-complex",
+ "num-traits",
+ "paste",
+ "raw-cpuid 11.5.0",
+ "seq-macro",
+]
+[[package]]
+name = "getrandom"
+version = "0.2.16"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "335ff9f135e4384c8150d6f27c6daed433577f86b4750418338c01a1a2528592"
+dependencies = [
+ "cfg-if",
+ "libc",
+ "wasi 0.11.1+wasi-snapshot-preview1",
+]
+[[package]]
+name = "getrandom"
+version = "0.3.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "26145e563e54f2cadc477553f1ec5ee650b00862f0a58bcd12cbdc5f0ea2d2f4"
+dependencies = [
+ "cfg-if",
+ "libc",
+ "r-efi",
+ "wasi 0.14.2+wasi-0.2.4",
+]
+[[package]]
+name = "gimli"
+version = "0.31.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "07e28edb80900c19c28f1072f2e8aeca7fa06b23cd4169cefe1af5aa3260783f"
+[[package]]
+name = "glob"
+version = "0.3.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a8d1add55171497b4705a648c6b583acafb01d58050a51727785f0b2c8e0a2b2"
+[[package]]
+name = "h2"
+version = "0.4.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a9421a676d1b147b16b82c9225157dc629087ef8ec4d5e2960f9437a90dac0a5"
+dependencies = [
+ "atomic-waker",
+ "bytes",
+ "fnv",
+ "futures-core",
+ "futures-sink",
+ "http",
+ "indexmap",
+ "slab",
+ "tokio",
+ "tokio-util 0.7.15",
+ "tracing",
+]
+[[package]]
+name = "half"
+version = "2.6.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "459196ed295495a68f7d7fe1d84f6c4b7ff0e21fe3017b2f283c6fac3ad803c9"
+dependencies = [
+ "bytemuck",
+ "cfg-if",
+ "crunchy",
+ "num-traits",
+ "rand",
+ "rand_distr",
+]
+[[package]]
+name = "hashbrown"
+version = "0.15.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5971ac85611da7067dbfcabef3c70ebb5606018acd9e2a3903a0da507521e0d5"
+[[package]]
+name = "heck"
+version = "0.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2304e00983f87ffb38b55b444b5e3b60a884b5d30c0fca7d82fe33449bbe55ea"
+[[package]]
+name = "hermit-abi"
+version = "0.5.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "fc0fef456e4baa96da950455cd02c081ca953b141298e41db3fc7e36b1da849c"
+[[package]]
+name = "hf-hub"
+version = "0.4.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "629d8f3bbeda9d148036d6b0de0a3ab947abd08ce90626327fc3547a49d59d97"
+dependencies = [
+ "dirs",
+ "futures",
+ "http",
+ "indicatif",
+ "libc",
+ "log",
+ "native-tls",
+ "num_cpus",
+ "rand",
+ "reqwest",
+ "serde",
+ "serde_json",
+ "thiserror 2.0.12",
+ "tokio",
+ "ureq",
+ "windows-sys 0.60.2",
+]
+[[package]]
+name = "http"
+version = "1.3.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f4a85d31aea989eead29a3aaf9e1115a180df8282431156e533de47660892565"
+dependencies = [
+ "bytes",
+ "fnv",
+ "itoa",
+]
+[[package]]
+name = "http-body"
+version = "1.0.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1efedce1fb8e6913f23e0c92de8e62cd5b772a67e7b3946df930a62566c93184"
+dependencies = [
+ "bytes",
+ "http",
+]
+[[package]]
+name = "http-body-util"
+version = "0.1.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b021d93e26becf5dc7e1b75b1bed1fd93124b374ceb73f43d4d4eafec896a64a"
+dependencies = [
+ "bytes",
+ "futures-core",
+ "http",
+ "http-body",
+ "pin-project-lite",
+]
+[[package]]
+name = "httparse"
+version = "1.10.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6dbf3de79e51f3d586ab4cb9d5c3e2c14aa28ed23d180cf89b4df0454a69cc87"
+[[package]]
+name = "hyper"
+version = "1.6.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cc2b571658e38e0c01b1fdca3bbbe93c00d3d71693ff2770043f8c29bc7d6f80"
+dependencies = [
+ "bytes",
+ "futures-channel",
+ "futures-util",
+ "h2",
+ "http",
+ "http-body",
+ "httparse",
+ "itoa",
+ "pin-project-lite",
+ "smallvec",
+ "tokio",
+ "want",
+]
+[[package]]
+name = "hyper-rustls"
+version = "0.27.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e3c93eb611681b207e1fe55d5a71ecf91572ec8a6705cdb6857f7d8d5242cf58"
+dependencies = [
+ "http",
+ "hyper",
+ "hyper-util",
+ "rustls",
+ "rustls-pki-types",
+ "tokio",
+ "tokio-rustls",
+ "tower-service",
+]
+[[package]]
+name = "hyper-tls"
+version = "0.6.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "70206fc6890eaca9fde8a0bf71caa2ddfc9fe045ac9e5c70df101a7dbde866e0"
+dependencies = [
+ "bytes",
+ "http-body-util",
+ "hyper",
+ "hyper-util",
+ "native-tls",
+ "tokio",
+ "tokio-native-tls",
+ "tower-service",
+]
+[[package]]
+name = "hyper-util"
+version = "0.1.14"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "dc2fdfdbff08affe55bb779f33b053aa1fe5dd5b54c257343c17edfa55711bdb"
+dependencies = [
+ "base64",
+ "bytes",
+ "futures-channel",
+ "futures-core",
+ "futures-util",
+ "http",
+ "http-body",
+ "hyper",
+ "ipnet",
+ "libc",
+ "percent-encoding",
+ "pin-project-lite",
+ "socket2",
+ "system-configuration",
+ "tokio",
+ "tower-service",
+ "tracing",
+ "windows-registry",
+]
+[[package]]
+name = "icu_collections"
+version = "2.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "200072f5d0e3614556f94a9930d5dc3e0662a652823904c3a75dc3b0af7fee47"
+dependencies = [
+ "displaydoc",
+ "potential_utf",
+ "yoke 0.8.0",
+ "zerofrom",
+ "zerovec",
+]
+[[package]]
+name = "icu_locale_core"
+version = "2.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0cde2700ccaed3872079a65fb1a78f6c0a36c91570f28755dda67bc8f7d9f00a"
+dependencies = [
+ "displaydoc",
+ "litemap",
+ "tinystr",
+ "writeable",
+ "zerovec",
+]
+[[package]]
+name = "icu_normalizer"
+version = "2.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "436880e8e18df4d7bbc06d58432329d6458cc84531f7ac5f024e93deadb37979"
+dependencies = [
+ "displaydoc",
+ "icu_collections",
+ "icu_normalizer_data",
+ "icu_properties",
+ "icu_provider",
+ "smallvec",
+ "zerovec",
+]
+[[package]]
+name = "icu_normalizer_data"
+version = "2.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "00210d6893afc98edb752b664b8890f0ef174c8adbb8d0be9710fa66fbbf72d3"
+[[package]]
+name = "icu_properties"
+version = "2.0.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "016c619c1eeb94efb86809b015c58f479963de65bdb6253345c1a1276f22e32b"
+dependencies = [
+ "displaydoc",
+ "icu_collections",
+ "icu_locale_core",
+ "icu_properties_data",
+ "icu_provider",
+ "potential_utf",
+ "zerotrie",
+ "zerovec",
+]
+[[package]]
+name = "icu_properties_data"
+version = "2.0.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "298459143998310acd25ffe6810ed544932242d3f07083eee1084d83a71bd632"
+[[package]]
+name = "icu_provider"
+version = "2.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "03c80da27b5f4187909049ee2d72f276f0d9f99a42c306bd0131ecfe04d8e5af"
+dependencies = [
+ "displaydoc",
+ "icu_locale_core",
+ "stable_deref_trait",
+ "tinystr",
+ "writeable",
+ "yoke 0.8.0",
+ "zerofrom",
+ "zerotrie",
+ "zerovec",
+]
+[[package]]
+name = "idna"
+version = "1.0.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "686f825264d630750a544639377bae737628043f20d38bbc029e8f29ea968a7e"
+dependencies = [
+ "idna_adapter",
+ "smallvec",
+ "utf8_iter",
+]
+[[package]]
+name = "idna_adapter"
+version = "1.2.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "3acae9609540aa318d1bc588455225fb2085b9ed0c4f6bd0d9d5bcd86f1a0344"
+dependencies = [
+ "icu_normalizer",
+ "icu_properties",
+]
+[[package]]
+name = "indexmap"
+version = "2.9.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cea70ddb795996207ad57735b50c5982d8844f38ba9ee5f1aedcfb708a2aa11e"
+dependencies = [
+ "equivalent",
+ "hashbrown",
+]
+[[package]]
+name = "indicatif"
+version = "0.17.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "183b3088984b400f4cfac3620d5e076c84da5364016b4f49473de574b2586235"
+dependencies = [
+ "console",
+ "number_prefix",
+ "portable-atomic",
+ "unicode-width",
+ "web-time",
+]
+[[package]]
+name = "ipnet"
+version = "2.11.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "469fb0b9cefa57e3ef31275ee7cacb78f2fdca44e4765491884a2b119d4eb130"
+[[package]]
+name = "iri-string"
+version = "0.7.8"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "dbc5ebe9c3a1a7a5127f920a418f7585e9e758e911d0466ed004f393b0e380b2"
+dependencies = [
+ "memchr",
+ "serde",
+]
+[[package]]
+name = "is_terminal_polyfill"
+version = "1.70.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7943c866cc5cd64cbc25b2e01621d07fa8eb2a1a23160ee81ce38704e97b8ecf"
+[[package]]
+name = "itertools"
+version = "0.10.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b0fd2260e829bddf4cb6ea802289de2f86d6a7a690192fbe91b3f46e0f2c8473"
+dependencies = [
+ "either",
+]
+[[package]]
+name = "itoa"
+version = "1.0.15"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4a5f13b858c8d314ee3e8f639011f7ccefe71f97f96e50151fb991f267928e2c"
+[[package]]
+name = "js-sys"
+version = "0.3.77"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1cfaf33c695fc6e08064efbc1f72ec937429614f25eef83af942d0e227c3a28f"
+dependencies = [
+ "once_cell",
+ "wasm-bindgen",
+]
+[[package]]
+name = "kaudio"
+version = "0.2.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "03fa91d027e02814ae876667542dd1b7cf91cb12f511ef29ea00022d7463699e"
+dependencies = [
+ "byteorder",
+ "futures-util",
+ "ogg",
+ "opus",
+ "regex",
+ "rubato",
+ "serde",
+ "serde_json",
+ "symphonia",
+ "thiserror 2.0.12",
+ "tokio",
+]
+[[package]]
+name = "kyutai-stt-rs"
+version = "0.1.0"
+dependencies = [
+ "anyhow",
+ "candle-core",
+ "candle-nn",
+ "candle-transformers",
+ "clap",
+ "hf-hub",
+ "kaudio",
+ "moshi",
+ "sentencepiece",
+ "serde",
+ "serde_json",
+]
+[[package]]
+name = "lazy_static"
+version = "1.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "bbd2bcb4c963f2ddae06a2efc7e9f3591312473c50c6685e1f298068316e66fe"
+[[package]]
+name = "libc"
+version = "0.2.174"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1171693293099992e19cddea4e8b849964e9846f4acee11b3948bcc337be8776"
+[[package]]
+name = "libloading"
+version = "0.8.8"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "07033963ba89ebaf1584d767badaa2e8fcec21aedea6b8c0346d487d49c28667"
+dependencies = [
+ "cfg-if",
+ "windows-targets 0.53.2",
+]
+[[package]]
+name = "libm"
+version = "0.2.15"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f9fbbcab51052fe104eb5e5d351cf728d30a5be1fe14d9be8a3b097481fb97de"
+[[package]]
+name = "libredox"
+version = "0.1.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c0ff37bd590ca25063e35af745c343cb7a0271906fb7b37e4813e8f79f00268d"
+dependencies = [
+ "bitflags 2.9.1",
+ "libc",
+]
+[[package]]
+name = "linux-raw-sys"
+version = "0.9.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cd945864f07fe9f5371a27ad7b52a172b4b499999f1d97574c9fa68373937e12"
+[[package]]
+name = "litemap"
+version = "0.8.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "241eaef5fd12c88705a01fc1066c48c4b36e0dd4377dcdc7ec3942cea7a69956"
+[[package]]
+name = "lock_api"
+version = "0.4.13"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "96936507f153605bddfcda068dd804796c84324ed2510809e5b2a624c81da765"
+dependencies = [
+ "autocfg",
+ "scopeguard",
+]
+[[package]]
+name = "log"
+version = "0.4.27"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "13dc2df351e3202783a1fe0d44375f7295ffb4049267b0f3018346dc122a1d94"
+[[package]]
+name = "malloc_buf"
+version = "0.0.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "62bb907fe88d54d8d9ce32a3cceab4218ed2f6b7d35617cafe9adf84e43919cb"
+dependencies = [
+ "libc",
+]
+[[package]]
+name = "memchr"
+version = "2.7.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "32a282da65faaf38286cf3be983213fcf1d2e2a58700e808f83f4ea9a4804bc0"
+[[package]]
+name = "memmap2"
+version = "0.9.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "fd3f7eed9d3848f8b98834af67102b720745c4ec028fcd0aa0239277e7de374f"
+dependencies = [
+ "libc",
+ "stable_deref_trait",
+]
+[[package]]
+name = "metal"
+version = "0.27.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c43f73953f8cbe511f021b58f18c3ce1c3d1ae13fe953293e13345bf83217f25"
+dependencies = [
+ "bitflags 2.9.1",
+ "block",
+ "core-graphics-types",
+ "foreign-types 0.5.0",
+ "log",
+ "objc",
+ "paste",
+]
+[[package]]
+name = "metal"
+version = "0.29.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7ecfd3296f8c56b7c1f6fbac3c71cefa9d78ce009850c45000015f206dc7fa21"
+dependencies = [
+ "bitflags 2.9.1",
+ "block",
+ "core-graphics-types",
+ "foreign-types 0.5.0",
+ "log",
+ "objc",
+ "paste",
+]
+[[package]]
+name = "mime"
+version = "0.3.17"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6877bb514081ee2a7ff5ef9de3281f14a4dd4bceac4c09388074a6b5df8a139a"
+[[package]]
+name = "miniz_oxide"
+version = "0.8.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1fa76a2c86f704bdb222d66965fb3d63269ce38518b83cb0575fca855ebb6316"
+dependencies = [
+ "adler2",
+]
+[[package]]
+name = "mio"
+version = "1.0.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "78bed444cc8a2160f01cbcf811ef18cac863ad68ae8ca62092e8db51d51c761c"
+dependencies = [
+ "libc",
+ "wasi 0.11.1+wasi-snapshot-preview1",
+ "windows-sys 0.59.0",
+]
+[[package]]
+name = "moshi"
+version = "0.6.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f72457c4b5dfbd77f67af691b470ed92ff2a71908d610498b19c82e638cd8ae2"
+dependencies = [
+ "candle-core",
+ "candle-nn",
+ "candle-transformers",
+ "rayon",
+ "serde",
+ "tracing",
+]
+[[package]]
+name = "native-tls"
+version = "0.2.14"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "87de3442987e9dbec73158d5c715e7ad9072fda936bb03d19d7fa10e00520f0e"
+dependencies = [
+ "libc",
+ "log",
+ "openssl",
+ "openssl-probe",
+ "openssl-sys",
+ "schannel",
+ "security-framework",
+ "security-framework-sys",
+ "tempfile",
+]
+[[package]]
+name = "num"
+version = "0.4.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "35bd024e8b2ff75562e5f34e7f4905839deb4b22955ef5e73d2fea1b9813cb23"
+dependencies = [
+ "num-bigint",
+ "num-complex",
+ "num-integer",
+ "num-iter",
+ "num-rational",
+ "num-traits",
+]
+[[package]]
+name = "num-bigint"
+version = "0.4.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a5e44f723f1133c9deac646763579fdb3ac745e418f2a7af9cd0c431da1f20b9"
+dependencies = [
+ "num-integer",
+ "num-traits",
+]
+[[package]]
+name = "num-complex"
+version = "0.4.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "73f88a1307638156682bada9d7604135552957b7818057dcef22705b4d509495"
+dependencies = [
+ "bytemuck",
+ "num-traits",
+]
+[[package]]
+name = "num-derive"
+version = "0.4.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ed3955f1a9c7c0c15e092f9c887db08b1fc683305fdf6eb6684f22555355e202"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "num-integer"
+version = "0.1.46"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7969661fd2958a5cb096e56c8e1ad0444ac2bbcd0061bd28660485a44879858f"
+dependencies = [
+ "num-traits",
+]
+[[package]]
+name = "num-iter"
+version = "0.1.45"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1429034a0490724d0075ebb2bc9e875d6503c3cf69e235a8941aa757d83ef5bf"
+dependencies = [
+ "autocfg",
+ "num-integer",
+ "num-traits",
+]
+[[package]]
+name = "num-rational"
+version = "0.4.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f83d14da390562dca69fc84082e73e548e1ad308d24accdedd2720017cb37824"
+dependencies = [
+ "num-bigint",
+ "num-integer",
+ "num-traits",
+]
+[[package]]
+name = "num-traits"
+version = "0.2.19"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "071dfc062690e90b734c0b2273ce72ad0ffa95f0c74596bc250dcfd960262841"
+dependencies = [
+ "autocfg",
+ "libm",
+]
+[[package]]
+name = "num_cpus"
+version = "1.17.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "91df4bbde75afed763b708b7eee1e8e7651e02d97f6d5dd763e89367e957b23b"
+dependencies = [
+ "hermit-abi",
+ "libc",
+]
+[[package]]
+name = "num_enum"
+version = "0.7.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4e613fc340b2220f734a8595782c551f1250e969d87d3be1ae0579e8d4065179"
+dependencies = [
+ "num_enum_derive",
+]
+[[package]]
+name = "num_enum_derive"
+version = "0.7.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "af1844ef2428cc3e1cb900be36181049ef3d3193c63e43026cfe202983b27a56"
+dependencies = [
+ "proc-macro-crate",
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "number_prefix"
+version = "0.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "830b246a0e5f20af87141b25c173cd1b609bd7779a4617d6ec582abaf90870f3"
+[[package]]
+name = "objc"
+version = "0.2.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "915b1b472bc21c53464d6c8461c9d3af805ba1ef837e1cac254428f4a77177b1"
+dependencies = [
+ "malloc_buf",
+ "objc_exception",
+]
+[[package]]
+name = "objc_exception"
+version = "0.1.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ad970fb455818ad6cba4c122ad012fae53ae8b4795f86378bce65e4f6bab2ca4"
+dependencies = [
+ "cc",
+]
+[[package]]
+name = "object"
+version = "0.36.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "62948e14d923ea95ea2c7c86c71013138b66525b86bdc08d2dcc262bdb497b87"
+dependencies = [
+ "memchr",
+]
+[[package]]
+name = "ogg"
+version = "0.9.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "fdab8dcd8d4052eaacaf8fb07a3ccd9a6e26efadb42878a413c68fc4af1dee2b"
+dependencies = [
+ "byteorder",
+ "bytes",
+ "futures-core",
+ "futures-io",
+ "pin-project",
+ "tokio",
+ "tokio-util 0.6.10",
+]
+[[package]]
+name = "once_cell"
+version = "1.21.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "42f5e15c9953c5e4ccceeb2e7382a716482c34515315f7b03532b8b4e8393d2d"
+[[package]]
+name = "once_cell_polyfill"
+version = "1.70.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a4895175b425cb1f87721b59f0f286c2092bd4af812243672510e1ac53e2e0ad"
+[[package]]
+name = "openssl"
+version = "0.10.73"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8505734d46c8ab1e19a1dce3aef597ad87dcb4c37e7188231769bd6bd51cebf8"
+dependencies = [
+ "bitflags 2.9.1",
+ "cfg-if",
+ "foreign-types 0.3.2",
+ "libc",
+ "once_cell",
+ "openssl-macros",
+ "openssl-sys",
+]
+[[package]]
+name = "openssl-macros"
+version = "0.1.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a948666b637a0f465e8564c73e89d4dde00d72d4d473cc972f390fc3dcee7d9c"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "openssl-probe"
+version = "0.1.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d05e27ee213611ffe7d6348b942e8f942b37114c00cc03cec254295a4a17852e"
+[[package]]
+name = "openssl-sys"
+version = "0.9.109"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "90096e2e47630d78b7d1c20952dc621f957103f8bc2c8359ec81290d75238571"
+dependencies = [
+ "cc",
+ "libc",
+ "pkg-config",
+ "vcpkg",
+]
+[[package]]
+name = "option-ext"
+version = "0.2.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "04744f49eae99ab78e0d5c0b603ab218f515ea8cfe5a456d7629ad883a3b6e7d"
+[[package]]
+name = "opus"
+version = "0.3.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6526409b274a7e98e55ff59d96aafd38e6cd34d46b7dbbc32ce126dffcd75e8e"
+dependencies = [
+ "audiopus_sys",
+ "libc",
+]
+[[package]]
+name = "parking_lot"
+version = "0.12.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "70d58bf43669b5795d1576d0641cfb6fbb2057bf629506267a92807158584a13"
+dependencies = [
+ "lock_api",
+ "parking_lot_core",
+]
+[[package]]
+name = "parking_lot_core"
+version = "0.9.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "bc838d2a56b5b1a6c25f55575dfc605fabb63bb2365f6c2353ef9159aa69e4a5"
+dependencies = [
+ "cfg-if",
+ "libc",
+ "redox_syscall",
+ "smallvec",
+ "windows-targets 0.52.6",
+]
+[[package]]
+name = "paste"
+version = "1.0.15"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "57c0d7b74b563b49d38dae00a0c37d4d6de9b432382b2892f0574ddcae73fd0a"
+[[package]]
+name = "percent-encoding"
+version = "2.3.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e3148f5046208a5d56bcfc03053e3ca6334e51da8dfb19b6cdc8b306fae3283e"
+[[package]]
+name = "pin-project"
+version = "1.1.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "677f1add503faace112b9f1373e43e9e054bfdd22ff1a63c1bc485eaec6a6a8a"
+dependencies = [
+ "pin-project-internal",
+]
+[[package]]
+name = "pin-project-internal"
+version = "1.1.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6e918e4ff8c4549eb882f14b3a4bc8c8bc93de829416eacf579f1207a8fbf861"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "pin-project-lite"
+version = "0.2.16"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "3b3cff922bd51709b605d9ead9aa71031d81447142d828eb4a6eba76fe619f9b"
+[[package]]
+name = "pin-utils"
+version = "0.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8b870d8c151b6f2fb93e84a13146138f05d02ed11c7e7c54f8826aaaf7c9f184"
+[[package]]
+name = "pkg-config"
+version = "0.3.32"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7edddbd0b52d732b21ad9a5fab5c704c14cd949e5e9a1ec5929a24fded1b904c"
+[[package]]
+name = "portable-atomic"
+version = "1.11.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f84267b20a16ea918e43c6a88433c2d54fa145c92a811b5b047ccbe153674483"
+[[package]]
+name = "potential_utf"
+version = "0.1.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e5a7c30837279ca13e7c867e9e40053bc68740f988cb07f7ca6df43cc734b585"
+dependencies = [
+ "zerovec",
+]
+[[package]]
+name = "ppv-lite86"
+version = "0.2.21"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "85eae3c4ed2f50dcfe72643da4befc30deadb458a9b590d720cde2f2b1e97da9"
+dependencies = [
+ "zerocopy",
+]
+[[package]]
+name = "primal-check"
+version = "0.3.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "dc0d895b311e3af9902528fbb8f928688abbd95872819320517cc24ca6b2bd08"
+dependencies = [
+ "num-integer",
+]
+[[package]]
+name = "proc-macro-crate"
+version = "3.3.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "edce586971a4dfaa28950c6f18ed55e0406c1ab88bbce2c6f6293a7aaba73d35"
+dependencies = [
+ "toml_edit",
+]
+[[package]]
+name = "proc-macro2"
+version = "1.0.95"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "02b3e5e68a3a1a02aad3ec490a98007cbc13c37cbe84a3cd7b8e406d76e7f778"
+dependencies = [
+ "unicode-ident",
+]
+[[package]]
+name = "prost"
+version = "0.11.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0b82eaa1d779e9a4bc1c3217db8ffbeabaae1dca241bf70183242128d48681cd"
+dependencies = [
+ "bytes",
+ "prost-derive",
+]
+[[package]]
+name = "prost-derive"
+version = "0.11.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e5d2d8d10f3c6ded6da8b05b5fb3b8a5082514344d56c9f871412d29b4e075b4"
+dependencies = [
+ "anyhow",
+ "itertools",
+ "proc-macro2",
+ "quote",
+ "syn 1.0.109",
+]
+[[package]]
+name = "pulp"
+version = "0.18.22"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a0a01a0dc67cf4558d279f0c25b0962bd08fc6dec0137699eae304103e882fe6"
+dependencies = [
+ "bytemuck",
+ "libm",
+ "num-complex",
+ "reborrow",
+]
+[[package]]
+name = "pulp"
+version = "0.21.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "96b86df24f0a7ddd5e4b95c94fc9ed8a98f1ca94d3b01bdce2824097e7835907"
+dependencies = [
+ "bytemuck",
+ "cfg-if",
+ "libm",
+ "num-complex",
+ "reborrow",
+ "version_check",
+]
+[[package]]
+name = "quote"
+version = "1.0.40"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1885c039570dc00dcb4ff087a89e185fd56bae234ddc7f056a945bf36467248d"
+dependencies = [
+ "proc-macro2",
+]
+[[package]]
+name = "r-efi"
+version = "5.3.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "69cdb34c158ceb288df11e18b4bd39de994f6657d83847bdffdbd7f346754b0f"
+[[package]]
+name = "rand"
+version = "0.9.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9fbfd9d094a40bf3ae768db9361049ace4c0e04a4fd6b359518bd7b73a73dd97"
+dependencies = [
+ "rand_chacha",
+ "rand_core",
+]
+[[package]]
+name = "rand_chacha"
+version = "0.9.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d3022b5f1df60f26e1ffddd6c66e8aa15de382ae63b3a0c1bfc0e4d3e3f325cb"
+dependencies = [
+ "ppv-lite86",
+ "rand_core",
+]
+[[package]]
+name = "rand_core"
+version = "0.9.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "99d9a13982dcf210057a8a78572b2217b667c3beacbf3a0d8b454f6f82837d38"
+dependencies = [
+ "getrandom 0.3.3",
+]
+[[package]]
+name = "rand_distr"
+version = "0.5.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6a8615d50dcf34fa31f7ab52692afec947c4dd0ab803cc87cb3b0b4570ff7463"
+dependencies = [
+ "num-traits",
+ "rand",
+]
+[[package]]
+name = "raw-cpuid"
+version = "10.7.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6c297679cb867470fa8c9f67dbba74a78d78e3e98d7cf2b08d6d71540f797332"
+dependencies = [
+ "bitflags 1.3.2",
+]
+[[package]]
+name = "raw-cpuid"
+version = "11.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c6df7ab838ed27997ba19a4664507e6f82b41fe6e20be42929332156e5e85146"
+dependencies = [
+ "bitflags 2.9.1",
+]
+[[package]]
+name = "rayon"
+version = "1.10.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b418a60154510ca1a002a752ca9714984e21e4241e804d32555251faf8b78ffa"
+dependencies = [
+ "either",
+ "rayon-core",
+]
+[[package]]
+name = "rayon-core"
+version = "1.12.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1465873a3dfdaa8ae7cb14b4383657caab0b3e8a0aa9ae8e04b044854c8dfce2"
+dependencies = [
+ "crossbeam-deque",
+ "crossbeam-utils",
+]
+[[package]]
+name = "realfft"
+version = "3.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f821338fddb99d089116342c46e9f1fbf3828dba077674613e734e01d6ea8677"
+dependencies = [
+ "rustfft",
+]
+[[package]]
+name = "reborrow"
+version = "0.5.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "03251193000f4bd3b042892be858ee50e8b3719f2b08e5833ac4353724632430"
+[[package]]
+name = "redox_syscall"
+version = "0.5.13"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0d04b7d0ee6b4a0207a0a7adb104d23ecb0b47d6beae7152d0fa34b692b29fd6"
+dependencies = [
+ "bitflags 2.9.1",
+]
+[[package]]
+name = "redox_users"
+version = "0.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "dd6f9d3d47bdd2ad6945c5015a226ec6155d0bcdfd8f7cd29f86b71f8de99d2b"
+dependencies = [
+ "getrandom 0.2.16",
+ "libredox",
+ "thiserror 2.0.12",
+]
+[[package]]
+name = "regex"
+version = "1.11.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b544ef1b4eac5dc2db33ea63606ae9ffcfac26c1416a2806ae0bf5f56b201191"
+dependencies = [
+ "aho-corasick",
+ "memchr",
+ "regex-automata",
+ "regex-syntax",
+]
+[[package]]
+name = "regex-automata"
+version = "0.4.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "809e8dc61f6de73b46c85f4c96486310fe304c434cfa43669d7b40f711150908"
+dependencies = [
+ "aho-corasick",
+ "memchr",
+ "regex-syntax",
+]
+[[package]]
+name = "regex-syntax"
+version = "0.8.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2b15c43186be67a4fd63bee50d0303afffcef381492ebe2c5d87f324e1b8815c"
+[[package]]
+name = "reqwest"
+version = "0.12.20"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "eabf4c97d9130e2bf606614eb937e86edac8292eaa6f422f995d7e8de1eb1813"
+dependencies = [
+ "base64",
+ "bytes",
+ "encoding_rs",
+ "futures-core",
+ "futures-util",
+ "h2",
+ "http",
+ "http-body",
+ "http-body-util",
+ "hyper",
+ "hyper-rustls",
+ "hyper-tls",
+ "hyper-util",
+ "js-sys",
+ "log",
+ "mime",
+ "native-tls",
+ "percent-encoding",
+ "pin-project-lite",
+ "rustls-pki-types",
+ "serde",
+ "serde_json",
+ "serde_urlencoded",
+ "sync_wrapper",
+ "tokio",
+ "tokio-native-tls",
+ "tokio-util 0.7.15",
+ "tower",
+ "tower-http",
+ "tower-service",
+ "url",
+ "wasm-bindgen",
+ "wasm-bindgen-futures",
+ "wasm-streams",
+ "web-sys",
+]
+[[package]]
+name = "ring"
+version = "0.17.14"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a4689e6c2294d81e88dc6261c768b63bc4fcdb852be6d1352498b114f61383b7"
+dependencies = [
+ "cc",
+ "cfg-if",
+ "getrandom 0.2.16",
+ "libc",
+ "untrusted",
+ "windows-sys 0.52.0",
+]
+[[package]]
+name = "rubato"
+version = "0.15.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b5d18b486e7d29a408ef3f825bc1327d8f87af091c987ca2f5b734625940e234"
+dependencies = [
+ "num-complex",
+ "num-integer",
+ "num-traits",
+ "realfft",
+]
+[[package]]
+name = "rustc-demangle"
+version = "0.1.25"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "989e6739f80c4ad5b13e0fd7fe89531180375b18520cc8c82080e4dc4035b84f"
+[[package]]
+name = "rustfft"
+version = "6.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c6f140db74548f7c9d7cce60912c9ac414e74df5e718dc947d514b051b42f3f4"
+dependencies = [
+ "num-complex",
+ "num-integer",
+ "num-traits",
+ "primal-check",
+ "strength_reduce",
+ "transpose",
+]
+[[package]]
+name = "rustix"
+version = "1.0.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c71e83d6afe7ff64890ec6b71d6a69bb8a610ab78ce364b3352876bb4c801266"
+dependencies = [
+ "bitflags 2.9.1",
+ "errno",
+ "libc",
+ "linux-raw-sys",
+ "windows-sys 0.59.0",
+]
+[[package]]
+name = "rustls"
+version = "0.23.28"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7160e3e10bf4535308537f3c4e1641468cd0e485175d6163087c0393c7d46643"
+dependencies = [
+ "log",
+ "once_cell",
+ "ring",
+ "rustls-pki-types",
+ "rustls-webpki",
+ "subtle",
+ "zeroize",
+]
+[[package]]
+name = "rustls-pki-types"
+version = "1.12.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "229a4a4c221013e7e1f1a043678c5cc39fe5171437c88fb47151a21e6f5b5c79"
+dependencies = [
+ "zeroize",
+]
+[[package]]
+name = "rustls-webpki"
+version = "0.103.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e4a72fe2bcf7a6ac6fd7d0b9e5cb68aeb7d4c0a0271730218b3e92d43b4eb435"
+dependencies = [
+ "ring",
+ "rustls-pki-types",
+ "untrusted",
+]
+[[package]]
+name = "rustversion"
+version = "1.0.21"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8a0d197bd2c9dc6e53b84da9556a69ba4cdfab8619eb41a8bd1cc2027a0f6b1d"
+[[package]]
+name = "ryu"
+version = "1.0.20"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "28d3b2b1366ec20994f1fd18c3c594f05c5dd4bc44d8bb0c1c632c8d6829481f"
+[[package]]
+name = "safetensors"
+version = "0.4.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "44560c11236a6130a46ce36c836a62936dc81ebf8c36a37947423571be0e55b6"
+dependencies = [
+ "serde",
+ "serde_json",
+]
+[[package]]
+name = "same-file"
+version = "1.0.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "93fc1dc3aaa9bfed95e02e6eadabb4baf7e3078b0bd1b4d7b6b0b68378900502"
+dependencies = [
+ "winapi-util",
+]
+[[package]]
+name = "schannel"
+version = "0.1.27"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1f29ebaa345f945cec9fbbc532eb307f0fdad8161f281b6369539c8d84876b3d"
+dependencies = [
+ "windows-sys 0.59.0",
+]
+[[package]]
+name = "scopeguard"
+version = "1.2.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "94143f37725109f92c262ed2cf5e59bce7498c01bcc1502d7b9afe439a4e9f49"
+[[package]]
+name = "security-framework"
+version = "2.11.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "897b2245f0b511c87893af39b033e5ca9cce68824c4d7e7630b5a1d339658d02"
+dependencies = [
+ "bitflags 2.9.1",
+ "core-foundation",
+ "core-foundation-sys",
+ "libc",
+ "security-framework-sys",
+]
+[[package]]
+name = "security-framework-sys"
+version = "2.14.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "49db231d56a190491cb4aeda9527f1ad45345af50b0851622a7adb8c03b01c32"
+dependencies = [
+ "core-foundation-sys",
+ "libc",
+]
+[[package]]
+name = "sentencepiece"
+version = "0.11.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "286451da14703923eeb9d5e9d7717a15cbf236c037923fb7a6ff911ca45f4124"
+dependencies = [
+ "libc",
+ "num-derive",
+ "num-traits",
+ "prost",
+ "prost-derive",
+ "sentencepiece-sys",
+ "thiserror 1.0.69",
+]
+[[package]]
+name = "sentencepiece-sys"
+version = "0.11.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a721500103a50c284cd3908cca6c435fcc6a260a1cead830a040f904a89234fb"
+dependencies = [
+ "cc",
+ "cmake",
+ "pkg-config",
+]
+[[package]]
+name = "seq-macro"
+version = "0.3.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1bc711410fbe7399f390ca1c3b60ad0f53f80e95c5eb935e52268a0e2cd49acc"
+[[package]]
+name = "serde"
+version = "1.0.219"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5f0e2c6ed6606019b4e29e69dbaba95b11854410e5347d525002456dbbb786b6"
+dependencies = [
+ "serde_derive",
+]
+[[package]]
+name = "serde_derive"
+version = "1.0.219"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5b0276cf7f2c73365f7157c8123c21cd9a50fbbd844757af28ca1f5925fc2a00"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "serde_json"
+version = "1.0.140"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "20068b6e96dc6c9bd23e01df8827e6c7e1f2fddd43c21810382803c136b99373"
+dependencies = [
+ "itoa",
+ "memchr",
+ "ryu",
+ "serde",
+]
+[[package]]
+name = "serde_plain"
+version = "1.0.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9ce1fc6db65a611022b23a0dec6975d63fb80a302cb3388835ff02c097258d50"
+dependencies = [
+ "serde",
+]
+[[package]]
+name = "serde_urlencoded"
+version = "0.7.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d3491c14715ca2294c4d6a88f15e84739788c1d030eed8c110436aafdaa2f3fd"
+dependencies = [
+ "form_urlencoded",
+ "itoa",
+ "ryu",
+ "serde",
+]
+[[package]]
+name = "shlex"
+version = "1.3.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0fda2ff0d084019ba4d7c6f371c95d8fd75ce3524c3cb8fb653a3023f6323e64"
+[[package]]
+name = "signal-hook-registry"
+version = "1.4.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9203b8055f63a2a00e2f593bb0510367fe707d7ff1e5c872de2f537b339e5410"
+dependencies = [
+ "libc",
+]
+[[package]]
+name = "slab"
+version = "0.4.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "04dc19736151f35336d325007ac991178d504a119863a2fcb3758cdb5e52c50d"
+[[package]]
+name = "smallvec"
+version = "1.15.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "67b1b7a3b5fe4f1376887184045fcf45c69e92af734b7aaddc05fb777b6fbd03"
+[[package]]
+name = "socket2"
+version = "0.5.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e22376abed350d73dd1cd119b57ffccad95b4e585a7cda43e286245ce23c0678"
+dependencies = [
+ "libc",
+ "windows-sys 0.52.0",
+]
+[[package]]
+name = "socks"
+version = "0.3.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f0c3dbbd9ae980613c6dd8e28a9407b50509d3803b57624d5dfe8315218cd58b"
+dependencies = [
+ "byteorder",
+ "libc",
+ "winapi",
+]
+[[package]]
+name = "stable_deref_trait"
+version = "1.2.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a8f112729512f8e442d81f95a8a7ddf2b7c6b8a1a6f509a95864142b30cab2d3"
+[[package]]
+name = "strength_reduce"
+version = "0.2.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "fe895eb47f22e2ddd4dabc02bce419d2e643c8e3b585c78158b349195bc24d82"
+[[package]]
+name = "strsim"
+version = "0.11.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7da8b5736845d9f2fcb837ea5d9e2628564b3b043a70948a3f0b778838c5fb4f"
+[[package]]
+name = "subtle"
+version = "2.6.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "13c2bddecc57b384dee18652358fb23172facb8a2c51ccc10d74c157bdea3292"
+[[package]]
+name = "symphonia"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "815c942ae7ee74737bb00f965fa5b5a2ac2ce7b6c01c0cc169bbeaf7abd5f5a9"
+dependencies = [
+ "lazy_static",
+ "symphonia-bundle-flac",
+ "symphonia-bundle-mp3",
+ "symphonia-codec-aac",
+ "symphonia-codec-adpcm",
+ "symphonia-codec-alac",
+ "symphonia-codec-pcm",
+ "symphonia-codec-vorbis",
+ "symphonia-core",
+ "symphonia-format-caf",
+ "symphonia-format-isomp4",
+ "symphonia-format-mkv",
+ "symphonia-format-ogg",
+ "symphonia-format-riff",
+ "symphonia-metadata",
+]
+[[package]]
+name = "symphonia-bundle-flac"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "72e34f34298a7308d4397a6c7fbf5b84c5d491231ce3dd379707ba673ab3bd97"
+dependencies = [
+ "log",
+ "symphonia-core",
+ "symphonia-metadata",
+ "symphonia-utils-xiph",
+]
+[[package]]
+name = "symphonia-bundle-mp3"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c01c2aae70f0f1fb096b6f0ff112a930b1fb3626178fba3ae68b09dce71706d4"
+dependencies = [
+ "lazy_static",
+ "log",
+ "symphonia-core",
+ "symphonia-metadata",
+]
+[[package]]
+name = "symphonia-codec-aac"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cdbf25b545ad0d3ee3e891ea643ad115aff4ca92f6aec472086b957a58522f70"
+dependencies = [
+ "lazy_static",
+ "log",
+ "symphonia-core",
+]
+[[package]]
+name = "symphonia-codec-adpcm"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c94e1feac3327cd616e973d5be69ad36b3945f16b06f19c6773fc3ac0b426a0f"
+dependencies = [
+ "log",
+ "symphonia-core",
+]
+[[package]]
+name = "symphonia-codec-alac"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2d8a6666649a08412906476a8b0efd9b9733e241180189e9f92b09c08d0e38f3"
+dependencies = [
+ "log",
+ "symphonia-core",
+]
+[[package]]
+name = "symphonia-codec-pcm"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f395a67057c2ebc5e84d7bb1be71cce1a7ba99f64e0f0f0e303a03f79116f89b"
+dependencies = [
+ "log",
+ "symphonia-core",
+]
+[[package]]
+name = "symphonia-codec-vorbis"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5a98765fb46a0a6732b007f7e2870c2129b6f78d87db7987e6533c8f164a9f30"
+dependencies = [
+ "log",
+ "symphonia-core",
+ "symphonia-utils-xiph",
+]
+[[package]]
+name = "symphonia-core"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "798306779e3dc7d5231bd5691f5a813496dc79d3f56bf82e25789f2094e022c3"
+dependencies = [
+ "arrayvec",
+ "bitflags 1.3.2",
+ "bytemuck",
+ "lazy_static",
+ "log",
+]
+[[package]]
+name = "symphonia-format-caf"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e43c99c696a388295a29fe71b133079f5d8b18041cf734c5459c35ad9097af50"
+dependencies = [
+ "log",
+ "symphonia-core",
+ "symphonia-metadata",
+]
+[[package]]
+name = "symphonia-format-isomp4"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "abfdf178d697e50ce1e5d9b982ba1b94c47218e03ec35022d9f0e071a16dc844"
+dependencies = [
+ "encoding_rs",
+ "log",
+ "symphonia-core",
+ "symphonia-metadata",
+ "symphonia-utils-xiph",
+]
+[[package]]
+name = "symphonia-format-mkv"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1bb43471a100f7882dc9937395bd5ebee8329298e766250b15b3875652fe3d6f"
+dependencies = [
+ "lazy_static",
+ "log",
+ "symphonia-core",
+ "symphonia-metadata",
+ "symphonia-utils-xiph",
+]
+[[package]]
+name = "symphonia-format-ogg"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ada3505789516bcf00fc1157c67729eded428b455c27ca370e41f4d785bfa931"
+dependencies = [
+ "log",
+ "symphonia-core",
+ "symphonia-metadata",
+ "symphonia-utils-xiph",
+]
+[[package]]
+name = "symphonia-format-riff"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "05f7be232f962f937f4b7115cbe62c330929345434c834359425e043bfd15f50"
+dependencies = [
+ "extended",
+ "log",
+ "symphonia-core",
+ "symphonia-metadata",
+]
+[[package]]
+name = "symphonia-metadata"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "bc622b9841a10089c5b18e99eb904f4341615d5aa55bbf4eedde1be721a4023c"
+dependencies = [
+ "encoding_rs",
+ "lazy_static",
+ "log",
+ "symphonia-core",
+]
+[[package]]
+name = "symphonia-utils-xiph"
+version = "0.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "484472580fa49991afda5f6550ece662237b00c6f562c7d9638d1b086ed010fe"
+dependencies = [
+ "symphonia-core",
+ "symphonia-metadata",
+]
+[[package]]
+name = "syn"
+version = "1.0.109"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "72b64191b275b66ffe2469e8af2c1cfe3bafa67b529ead792a6d0160888b4237"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "unicode-ident",
+]
+[[package]]
+name = "syn"
+version = "2.0.103"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e4307e30089d6fd6aff212f2da3a1f9e32f3223b1f010fb09b7c95f90f3ca1e8"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "unicode-ident",
+]
+[[package]]
+name = "sync_wrapper"
+version = "1.0.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0bf256ce5efdfa370213c1dabab5935a12e49f2c58d15e9eac2870d3b4f27263"
+dependencies = [
+ "futures-core",
+]
+[[package]]
+name = "synstructure"
+version = "0.13.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "728a70f3dbaf5bab7f0c4b1ac8d7ae5ea60a4b5549c8a5914361c99147a709d2"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "sysctl"
+version = "0.5.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ec7dddc5f0fee506baf8b9fdb989e242f17e4b11c61dfbb0635b705217199eea"
+dependencies = [
+ "bitflags 2.9.1",
+ "byteorder",
+ "enum-as-inner",
+ "libc",
+ "thiserror 1.0.69",
+ "walkdir",
+]
+[[package]]
+name = "sysctl"
+version = "0.6.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "01198a2debb237c62b6826ec7081082d951f46dbb64b0e8c7649a452230d1dfc"
+dependencies = [
+ "bitflags 2.9.1",
+ "byteorder",
+ "enum-as-inner",
+ "libc",
+ "thiserror 1.0.69",
+ "walkdir",
+]
+[[package]]
+name = "system-configuration"
+version = "0.6.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "3c879d448e9d986b661742763247d3693ed13609438cf3d006f51f5368a5ba6b"
+dependencies = [
+ "bitflags 2.9.1",
+ "core-foundation",
+ "system-configuration-sys",
+]
+[[package]]
+name = "system-configuration-sys"
+version = "0.6.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8e1d1b10ced5ca923a1fcb8d03e96b8d3268065d724548c0211415ff6ac6bac4"
+dependencies = [
+ "core-foundation-sys",
+ "libc",
+]
+[[package]]
+name = "tempfile"
+version = "3.20.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e8a64e3985349f2441a1a9ef0b853f869006c3855f2cda6862a94d26ebb9d6a1"
+dependencies = [
+ "fastrand",
+ "getrandom 0.3.3",
+ "once_cell",
+ "rustix",
+ "windows-sys 0.59.0",
+]
+[[package]]
+name = "thiserror"
+version = "1.0.69"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b6aaf5339b578ea85b50e080feb250a3e8ae8cfcdff9a461c9ec2904bc923f52"
+dependencies = [
+ "thiserror-impl 1.0.69",
+]
+[[package]]
+name = "thiserror"
+version = "2.0.12"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "567b8a2dae586314f7be2a752ec7474332959c6460e02bde30d702a66d488708"
+dependencies = [
+ "thiserror-impl 2.0.12",
+]
+[[package]]
+name = "thiserror-impl"
+version = "1.0.69"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4fee6c4efc90059e10f81e6d42c60a18f76588c3d74cb83a0b242a2b6c7504c1"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "thiserror-impl"
+version = "2.0.12"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7f7cf42b4507d8ea322120659672cf1b9dbb93f8f2d4ecfd6e51350ff5b17a1d"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "tinystr"
+version = "0.8.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5d4f6d1145dcb577acf783d4e601bc1d76a13337bb54e6233add580b07344c8b"
+dependencies = [
+ "displaydoc",
+ "zerovec",
+]
+[[package]]
+name = "tokio"
+version = "1.45.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "75ef51a33ef1da925cea3e4eb122833cb377c61439ca401b770f54902b806779"
+dependencies = [
+ "backtrace",
+ "bytes",
+ "libc",
+ "mio",
+ "parking_lot",
+ "pin-project-lite",
+ "signal-hook-registry",
+ "socket2",
+ "tokio-macros",
+ "windows-sys 0.52.0",
+]
+[[package]]
+name = "tokio-macros"
+version = "2.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6e06d43f1345a3bcd39f6a56dbb7dcab2ba47e68e8ac134855e7e2bdbaf8cab8"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "tokio-native-tls"
+version = "0.3.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "bbae76ab933c85776efabc971569dd6119c580d8f5d448769dec1764bf796ef2"
+dependencies = [
+ "native-tls",
+ "tokio",
+]
+[[package]]
+name = "tokio-rustls"
+version = "0.26.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8e727b36a1a0e8b74c376ac2211e40c2c8af09fb4013c60d910495810f008e9b"
+dependencies = [
+ "rustls",
+ "tokio",
+]
+[[package]]
+name = "tokio-util"
+version = "0.6.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "36943ee01a6d67977dd3f84a5a1d2efeb4ada3a1ae771cadfaa535d9d9fc6507"
+dependencies = [
+ "bytes",
+ "futures-core",
+ "futures-io",
+ "futures-sink",
+ "log",
+ "pin-project-lite",
+ "tokio",
+]
+[[package]]
+name = "tokio-util"
+version = "0.7.15"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "66a539a9ad6d5d281510d5bd368c973d636c02dbf8a67300bfb6b950696ad7df"
+dependencies = [
+ "bytes",
+ "futures-core",
+ "futures-sink",
+ "pin-project-lite",
+ "tokio",
+]
+[[package]]
+name = "toml_datetime"
+version = "0.6.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "22cddaf88f4fbc13c51aebbf5f8eceb5c7c5a9da2ac40a13519eb5b0a0e8f11c"
+[[package]]
+name = "toml_edit"
+version = "0.22.27"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "41fe8c660ae4257887cf66394862d21dbca4a6ddd26f04a3560410406a2f819a"
+dependencies = [
+ "indexmap",
+ "toml_datetime",
+ "winnow",
+]
+[[package]]
+name = "tower"
+version = "0.5.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d039ad9159c98b70ecfd540b2573b97f7f52c3e8d9f8ad57a24b916a536975f9"
+dependencies = [
+ "futures-core",
+ "futures-util",
+ "pin-project-lite",
+ "sync_wrapper",
+ "tokio",
+ "tower-layer",
+ "tower-service",
+]
+[[package]]
+name = "tower-http"
+version = "0.6.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "adc82fd73de2a9722ac5da747f12383d2bfdb93591ee6c58486e0097890f05f2"
+dependencies = [
+ "bitflags 2.9.1",
+ "bytes",
+ "futures-util",
+ "http",
+ "http-body",
+ "iri-string",
+ "pin-project-lite",
+ "tower",
+ "tower-layer",
+ "tower-service",
+]
+[[package]]
+name = "tower-layer"
+version = "0.3.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "121c2a6cda46980bb0fcd1647ffaf6cd3fc79a013de288782836f6df9c48780e"
+[[package]]
+name = "tower-service"
+version = "0.3.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8df9b6e13f2d32c91b9bd719c00d1958837bc7dec474d94952798cc8e69eeec3"
+[[package]]
+name = "tracing"
+version = "0.1.41"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "784e0ac535deb450455cbfa28a6f0df145ea1bb7ae51b821cf5e7927fdcfbdd0"
+dependencies = [
+ "pin-project-lite",
+ "tracing-attributes",
+ "tracing-core",
+]
+[[package]]
+name = "tracing-attributes"
+version = "0.1.30"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "81383ab64e72a7a8b8e13130c49e3dab29def6d0c7d76a03087b3cf71c5c6903"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "tracing-core"
+version = "0.1.34"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b9d12581f227e93f094d3af2ae690a574abb8a2b9b7a96e7cfe9647b2b617678"
+dependencies = [
+ "once_cell",
+]
+[[package]]
+name = "transpose"
+version = "0.2.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1ad61aed86bc3faea4300c7aee358b4c6d0c8d6ccc36524c96e4c92ccf26e77e"
+dependencies = [
+ "num-integer",
+ "strength_reduce",
+]
+[[package]]
+name = "try-lock"
+version = "0.2.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e421abadd41a4225275504ea4d6566923418b7f05506fbc9c0fe86ba7396114b"
+[[package]]
+name = "ug"
+version = "0.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "90b70b37e9074642bc5f60bb23247fd072a84314ca9e71cdf8527593406a0dd3"
+dependencies = [
+ "gemm 0.18.2",
+ "half",
+ "libloading",
+ "memmap2",
+ "num",
+ "num-traits",
+ "num_cpus",
+ "rayon",
+ "safetensors",
+ "serde",
+ "thiserror 1.0.69",
+ "tracing",
+ "yoke 0.7.5",
+]
+[[package]]
+name = "ug-cuda"
+version = "0.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "14053653d0b7fa7b21015aa9a62edc8af2f60aa6f9c54e66386ecce55f22ed29"
+dependencies = [
+ "cudarc",
+ "half",
+ "serde",
+ "thiserror 1.0.69",
+ "ug",
+]
+[[package]]
+name = "ug-metal"
+version = "0.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "76daec3c7a32a1b4a0e3307b6b057fa067aa64e750713987410a2c402e5cd731"
+dependencies = [
+ "half",
+ "metal 0.29.0",
+ "objc",
+ "serde",
+ "thiserror 1.0.69",
+ "ug",
+]
+[[package]]
+name = "unicode-ident"
+version = "1.0.18"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5a5f39404a5da50712a4c1eecf25e90dd62b613502b7e925fd4e4d19b5c96512"
+[[package]]
+name = "unicode-width"
+version = "0.2.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4a1a07cc7db3810833284e8d372ccdc6da29741639ecc70c9ec107df0fa6154c"
+[[package]]
+name = "untrusted"
+version = "0.9.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8ecb6da28b8a351d773b68d5825ac39017e680750f980f3a1a85cd8dd28a47c1"
+[[package]]
+name = "ureq"
+version = "2.12.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "02d1a66277ed75f640d608235660df48c8e3c19f3b4edb6a263315626cc3c01d"
+dependencies = [
+ "base64",
+ "flate2",
+ "log",
+ "native-tls",
+ "once_cell",
+ "rustls",
+ "rustls-pki-types",
+ "serde",
+ "serde_json",
+ "socks",
+ "url",
+ "webpki-roots 0.26.11",
+]
+[[package]]
+name = "url"
+version = "2.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "32f8b686cadd1473f4bd0117a5d28d36b1ade384ea9b5069a1c40aefed7fda60"
+dependencies = [
+ "form_urlencoded",
+ "idna",
+ "percent-encoding",
+]
+[[package]]
+name = "utf8_iter"
+version = "1.0.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b6c140620e7ffbb22c2dee59cafe6084a59b5ffc27a8859a5f0d494b5d52b6be"
+[[package]]
+name = "utf8parse"
+version = "0.2.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "06abde3611657adf66d383f00b093d7faecc7fa57071cce2578660c9f1010821"
+[[package]]
+name = "vcpkg"
+version = "0.2.15"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "accd4ea62f7bb7a82fe23066fb0957d48ef677f6eeb8215f372f52e48bb32426"
+[[package]]
+name = "version_check"
+version = "0.9.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a"
+[[package]]
+name = "walkdir"
+version = "2.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "29790946404f91d9c5d06f9874efddea1dc06c5efe94541a7d6863108e3a5e4b"
+dependencies = [
+ "same-file",
+ "winapi-util",
+]
+[[package]]
+name = "want"
+version = "0.3.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "bfa7760aed19e106de2c7c0b581b509f2f25d3dacaf737cb82ac61bc6d760b0e"
+dependencies = [
+ "try-lock",
+]
+[[package]]
+name = "wasi"
+version = "0.11.1+wasi-snapshot-preview1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ccf3ec651a847eb01de73ccad15eb7d99f80485de043efb2f370cd654f4ea44b"
+[[package]]
+name = "wasi"
+version = "0.14.2+wasi-0.2.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9683f9a5a998d873c0d21fcbe3c083009670149a8fab228644b8bd36b2c48cb3"
+dependencies = [
+ "wit-bindgen-rt",
+]
+[[package]]
+name = "wasm-bindgen"
+version = "0.2.100"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1edc8929d7499fc4e8f0be2262a241556cfc54a0bea223790e71446f2aab1ef5"
+dependencies = [
+ "cfg-if",
+ "once_cell",
+ "rustversion",
+ "wasm-bindgen-macro",
+]
+[[package]]
+name = "wasm-bindgen-backend"
+version = "0.2.100"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2f0a0651a5c2bc21487bde11ee802ccaf4c51935d0d3d42a6101f98161700bc6"
+dependencies = [
+ "bumpalo",
+ "log",
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+ "wasm-bindgen-shared",
+]
+[[package]]
+name = "wasm-bindgen-futures"
+version = "0.4.50"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "555d470ec0bc3bb57890405e5d4322cc9ea83cebb085523ced7be4144dac1e61"
+dependencies = [
+ "cfg-if",
+ "js-sys",
+ "once_cell",
+ "wasm-bindgen",
+ "web-sys",
+]
+[[package]]
+name = "wasm-bindgen-macro"
+version = "0.2.100"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7fe63fc6d09ed3792bd0897b314f53de8e16568c2b3f7982f468c0bf9bd0b407"
+dependencies = [
+ "quote",
+ "wasm-bindgen-macro-support",
+]
+[[package]]
+name = "wasm-bindgen-macro-support"
+version = "0.2.100"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8ae87ea40c9f689fc23f209965b6fb8a99ad69aeeb0231408be24920604395de"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+ "wasm-bindgen-backend",
+ "wasm-bindgen-shared",
+]
+[[package]]
+name = "wasm-bindgen-shared"
+version = "0.2.100"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1a05d73b933a847d6cccdda8f838a22ff101ad9bf93e33684f39c1f5f0eece3d"
+dependencies = [
+ "unicode-ident",
+]
+[[package]]
+name = "wasm-streams"
+version = "0.4.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "15053d8d85c7eccdbefef60f06769760a563c7f0a9d6902a13d35c7800b0ad65"
+dependencies = [
+ "futures-util",
+ "js-sys",
+ "wasm-bindgen",
+ "wasm-bindgen-futures",
+ "web-sys",
+]
+[[package]]
+name = "web-sys"
+version = "0.3.77"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "33b6dd2ef9186f1f2072e409e99cd22a975331a6b3591b12c764e0e55c60d5d2"
+dependencies = [
+ "js-sys",
+ "wasm-bindgen",
+]
+[[package]]
+name = "web-time"
+version = "1.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5a6580f308b1fad9207618087a65c04e7a10bc77e02c8e84e9b00dd4b12fa0bb"
+dependencies = [
+ "js-sys",
+ "wasm-bindgen",
+]
+[[package]]
+name = "webpki-roots"
+version = "0.26.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "521bc38abb08001b01866da9f51eb7c5d647a19260e00054a8c7fd5f9e57f7a9"
+dependencies = [
+ "webpki-roots 1.0.0",
+]
+[[package]]
+name = "webpki-roots"
+version = "1.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2853738d1cc4f2da3a225c18ec6c3721abb31961096e9dbf5ab35fa88b19cfdb"
+dependencies = [
+ "rustls-pki-types",
+]
+[[package]]
+name = "winapi"
+version = "0.3.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419"
+dependencies = [
+ "winapi-i686-pc-windows-gnu",
+ "winapi-x86_64-pc-windows-gnu",
+]
+[[package]]
+name = "winapi-i686-pc-windows-gnu"
+version = "0.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6"
+[[package]]
+name = "winapi-util"
+version = "0.1.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cf221c93e13a30d793f7645a0e7762c55d169dbb0a49671918a2319d289b10bb"
+dependencies = [
+ "windows-sys 0.59.0",
+]
+[[package]]
+name = "winapi-x86_64-pc-windows-gnu"
+version = "0.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f"
+[[package]]
+name = "windows-link"
+version = "0.1.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5e6ad25900d524eaabdbbb96d20b4311e1e7ae1699af4fb28c17ae66c80d798a"
+[[package]]
+name = "windows-registry"
+version = "0.5.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5b8a9ed28765efc97bbc954883f4e6796c33a06546ebafacbabee9696967499e"
+dependencies = [
+ "windows-link",
+ "windows-result",
+ "windows-strings",
+]
+[[package]]
+name = "windows-result"
+version = "0.3.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "56f42bd332cc6c8eac5af113fc0c1fd6a8fd2aa08a0119358686e5160d0586c6"
+dependencies = [
+ "windows-link",
+]
+[[package]]
+name = "windows-strings"
+version = "0.4.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "56e6c93f3a0c3b36176cb1327a4958a0353d5d166c2a35cb268ace15e91d3b57"
+dependencies = [
+ "windows-link",
+]
+[[package]]
+name = "windows-sys"
+version = "0.52.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "282be5f36a8ce781fad8c8ae18fa3f9beff57ec1b52cb3de0789201425d9a33d"
+dependencies = [
+ "windows-targets 0.52.6",
+]
+[[package]]
+name = "windows-sys"
+version = "0.59.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1e38bc4d79ed67fd075bcc251a1c39b32a1776bbe92e5bef1f0bf1f8c531853b"
+dependencies = [
+ "windows-targets 0.52.6",
+]
+[[package]]
+name = "windows-sys"
+version = "0.60.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f2f500e4d28234f72040990ec9d39e3a6b950f9f22d3dba18416c35882612bcb"
+dependencies = [
+ "windows-targets 0.53.2",
+]
+[[package]]
+name = "windows-targets"
+version = "0.52.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9b724f72796e036ab90c1021d4780d4d3d648aca59e491e6b98e725b84e99973"
+dependencies = [
+ "windows_aarch64_gnullvm 0.52.6",
+ "windows_aarch64_msvc 0.52.6",
+ "windows_i686_gnu 0.52.6",
+ "windows_i686_gnullvm 0.52.6",
+ "windows_i686_msvc 0.52.6",
+ "windows_x86_64_gnu 0.52.6",
+ "windows_x86_64_gnullvm 0.52.6",
+ "windows_x86_64_msvc 0.52.6",
+]
+[[package]]
+name = "windows-targets"
+version = "0.53.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c66f69fcc9ce11da9966ddb31a40968cad001c5bedeb5c2b82ede4253ab48aef"
+dependencies = [
+ "windows_aarch64_gnullvm 0.53.0",
+ "windows_aarch64_msvc 0.53.0",
+ "windows_i686_gnu 0.53.0",
+ "windows_i686_gnullvm 0.53.0",
+ "windows_i686_msvc 0.53.0",
+ "windows_x86_64_gnu 0.53.0",
+ "windows_x86_64_gnullvm 0.53.0",
+ "windows_x86_64_msvc 0.53.0",
+]
+[[package]]
+name = "windows_aarch64_gnullvm"
+version = "0.52.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "32a4622180e7a0ec044bb555404c800bc9fd9ec262ec147edd5989ccd0c02cd3"
+[[package]]
+name = "windows_aarch64_gnullvm"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "86b8d5f90ddd19cb4a147a5fa63ca848db3df085e25fee3cc10b39b6eebae764"
+[[package]]
+name = "windows_aarch64_msvc"
+version = "0.52.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "09ec2a7bb152e2252b53fa7803150007879548bc709c039df7627cabbd05d469"
+[[package]]
+name = "windows_aarch64_msvc"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c7651a1f62a11b8cbd5e0d42526e55f2c99886c77e007179efff86c2b137e66c"
+[[package]]
+name = "windows_i686_gnu"
+version = "0.52.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8e9b5ad5ab802e97eb8e295ac6720e509ee4c243f69d781394014ebfe8bbfa0b"
+[[package]]
+name = "windows_i686_gnu"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c1dc67659d35f387f5f6c479dc4e28f1d4bb90ddd1a5d3da2e5d97b42d6272c3"
+[[package]]
+name = "windows_i686_gnullvm"
+version = "0.52.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0eee52d38c090b3caa76c563b86c3a4bd71ef1a819287c19d586d7334ae8ed66"
+[[package]]
+name = "windows_i686_gnullvm"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9ce6ccbdedbf6d6354471319e781c0dfef054c81fbc7cf83f338a4296c0cae11"
+[[package]]
+name = "windows_i686_msvc"
+version = "0.52.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "240948bc05c5e7c6dabba28bf89d89ffce3e303022809e73deaefe4f6ec56c66"
+[[package]]
+name = "windows_i686_msvc"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "581fee95406bb13382d2f65cd4a908ca7b1e4c2f1917f143ba16efe98a589b5d"
+[[package]]
+name = "windows_x86_64_gnu"
+version = "0.52.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "147a5c80aabfbf0c7d901cb5895d1de30ef2907eb21fbbab29ca94c5b08b1a78"
+[[package]]
+name = "windows_x86_64_gnu"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2e55b5ac9ea33f2fc1716d1742db15574fd6fc8dadc51caab1c16a3d3b4190ba"
+[[package]]
+name = "windows_x86_64_gnullvm"
+version = "0.52.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "24d5b23dc417412679681396f2b49f3de8c1473deb516bd34410872eff51ed0d"
+[[package]]
+name = "windows_x86_64_gnullvm"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0a6e035dd0599267ce1ee132e51c27dd29437f63325753051e71dd9e42406c57"
+[[package]]
+name = "windows_x86_64_msvc"
+version = "0.52.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "589f6da84c646204747d1270a2a5661ea66ed1cced2631d546fdfb155959f9ec"
+[[package]]
+name = "windows_x86_64_msvc"
+version = "0.53.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "271414315aff87387382ec3d271b52d7ae78726f5d44ac98b4f4030c91880486"
+[[package]]
+name = "winnow"
+version = "0.7.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "74c7b26e3480b707944fc872477815d29a8e429d2f93a1ce000f5fa84a15cbcd"
+dependencies = [
+ "memchr",
+]
+[[package]]
+name = "wit-bindgen-rt"
+version = "0.39.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6f42320e61fe2cfd34354ecb597f86f413484a798ba44a8ca1165c58d42da6c1"
+dependencies = [
+ "bitflags 2.9.1",
+]
+[[package]]
+name = "writeable"
+version = "0.6.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ea2f10b9bb0928dfb1b42b65e1f9e36f7f54dbdf08457afefb38afcdec4fa2bb"
+[[package]]
+name = "yoke"
+version = "0.7.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "120e6aef9aa629e3d4f52dc8cc43a015c7724194c97dfaf45180d2daf2b77f40"
+dependencies = [
+ "serde",
+ "stable_deref_trait",
+ "yoke-derive 0.7.5",
+ "zerofrom",
+]
+[[package]]
+name = "yoke"
+version = "0.8.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5f41bb01b8226ef4bfd589436a297c53d118f65921786300e427be8d487695cc"
+dependencies = [
+ "serde",
+ "stable_deref_trait",
+ "yoke-derive 0.8.0",
+ "zerofrom",
+]
+[[package]]
+name = "yoke-derive"
+version = "0.7.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2380878cad4ac9aac1e2435f3eb4020e8374b5f13c296cb75b4620ff8e229154"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+ "synstructure",
+]
+[[package]]
+name = "yoke-derive"
+version = "0.8.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "38da3c9736e16c5d3c8c597a9aaa5d1fa565d0532ae05e27c24aa62fb32c0ab6"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+ "synstructure",
+]
+[[package]]
+name = "zerocopy"
+version = "0.8.26"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1039dd0d3c310cf05de012d8a39ff557cb0d23087fd44cad61df08fc31907a2f"
+dependencies = [
+ "zerocopy-derive",
+]
+[[package]]
+name = "zerocopy-derive"
+version = "0.8.26"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9ecf5b4cc5364572d7f4c329661bcc82724222973f2cab6f050a4e5c22f75181"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "zerofrom"
+version = "0.1.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "50cc42e0333e05660c3587f3bf9d0478688e15d870fab3346451ce7f8c9fbea5"
+dependencies = [
+ "zerofrom-derive",
+]
+[[package]]
+name = "zerofrom-derive"
+version = "0.1.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d71e5d6e06ab090c67b5e44993ec16b72dcbaabc526db883a360057678b48502"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+ "synstructure",
+]
+[[package]]
+name = "zeroize"
+version = "1.8.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ced3678a2879b30306d323f4542626697a464a97c0a07c9aebf7ebca65cd4dde"
+[[package]]
+name = "zerotrie"
+version = "0.2.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "36f0bbd478583f79edad978b407914f61b2972f5af6fa089686016be8f9af595"
+dependencies = [
+ "displaydoc",
+ "yoke 0.8.0",
+ "zerofrom",
+]
+[[package]]
+name = "zerovec"
+version = "0.11.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4a05eb080e015ba39cc9e23bbe5e7fb04d5fb040350f99f34e338d5fdd294428"
+dependencies = [
+ "yoke 0.8.0",
+ "zerofrom",
+ "zerovec-derive",
+]
+[[package]]
+name = "zerovec-derive"
+version = "0.11.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5b96237efa0c878c64bd89c436f661be4e46b2f3eff1ebb976f7ef2321d2f58f"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.103",
+]
+[[package]]
+name = "zip"
+version = "1.1.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9cc23c04387f4da0374be4533ad1208cbb091d5c11d070dfef13676ad6497164"
+dependencies = [
+ "arbitrary",
+ "crc32fast",
+ "crossbeam-utils",
+ "displaydoc",
+ "indexmap",
+ "num_enum",
+ "thiserror 1.0.69",
+]

stt-rs/Cargo.toml ADDED Viewed

	@@ -0,0 +1,31 @@

+[package]
+name = "kyutai-stt-rs"
+version = "0.1.0"
+edition = "2024"
+[dependencies]
+anyhow = "1.0"
+candle = { version = "0.9.1",  package = "candle-core" }
+candle-nn = "0.9.1"
+candle-transformers = "0.9.1"
+clap = { version = "4.4.12", features = ["derive"] }
+hf-hub = "0.4.3"
+kaudio = "0.2.1"
+moshi = "0.6.1"
+sentencepiece = "0.11.3"
+serde = { version = "1.0.210", features = ["derive"] }
+serde_json = "1.0.115"
+[features]
+default = []
+cuda = ["candle/cuda", "candle-nn/cuda"]
+cudnn = ["candle/cudnn", "candle-nn/cudnn"]
+metal = ["candle/metal", "candle-nn/metal"]
+[profile.release]
+debug = true
+[profile.release-no-debug]
+inherits = "release"
+debug = false

stt-rs/src/main.rs ADDED Viewed

	@@ -0,0 +1,260 @@

+// Copyright (c) Kyutai, all rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+use anyhow::Result;
+use candle::{Device, Tensor};
+use clap::Parser;
+#[derive(Debug, Parser)]
+struct Args {
+    /// The audio input file, in wav/mp3/ogg/... format.
+    in_file: String,
+    /// The repo where to get the model from.
+    #[arg(long, default_value = "kyutai/stt-1b-en_fr-candle")]
+    hf_repo: String,
+    /// Path to the model file in the repo.
+    #[arg(long, default_value = "model.safetensors")]
+    model_path: String,
+    /// Run the model on cpu.
+    #[arg(long)]
+    cpu: bool,
+    /// Display word level timestamps.
+    #[arg(long)]
+    timestamps: bool,
+    /// Display the level of voice activity detection (VAD).
+    #[arg(long)]
+    vad: bool,
+}
+fn device(cpu: bool) -> Result<Device> {
+    if cpu {
+        Ok(Device::Cpu)
+    } else if candle::utils::cuda_is_available() {
+        Ok(Device::new_cuda(0)?)
+    } else if candle::utils::metal_is_available() {
+        Ok(Device::new_metal(0)?)
+    } else {
+        Ok(Device::Cpu)
+    }
+}
+#[derive(Debug, serde::Deserialize)]
+struct SttConfig {
+    audio_silence_prefix_seconds: f64,
+    audio_delay_seconds: f64,
+}
+#[derive(Debug, serde::Deserialize)]
+struct Config {
+    mimi_name: String,
+    tokenizer_name: String,
+    card: usize,
+    text_card: usize,
+    dim: usize,
+    n_q: usize,
+    context: usize,
+    max_period: f64,
+    num_heads: usize,
+    num_layers: usize,
+    causal: bool,
+    stt_config: SttConfig,
+}
+impl Config {
+    fn model_config(&self, vad: bool) -> moshi::lm::Config {
+        let lm_cfg = moshi::transformer::Config {
+            d_model: self.dim,
+            num_heads: self.num_heads,
+            num_layers: self.num_layers,
+            dim_feedforward: self.dim * 4,
+            causal: self.causal,
+            norm_first: true,
+            bias_ff: false,
+            bias_attn: false,
+            layer_scale: None,
+            context: self.context,
+            max_period: self.max_period as usize,
+            use_conv_block: false,
+            use_conv_bias: true,
+            cross_attention: None,
+            gating: Some(candle_nn::Activation::Silu),
+            norm: moshi::NormType::RmsNorm,
+            positional_embedding: moshi::transformer::PositionalEmbedding::Rope,
+            conv_layout: false,
+            conv_kernel_size: 3,
+            kv_repeat: 1,
+            max_seq_len: 4096 * 4,
+            shared_cross_attn: false,
+        };
+        let extra_heads = if vad {
+            Some(moshi::lm::ExtraHeadsConfig {
+                num_heads: 4,
+                dim: 6,
+            })
+        } else {
+            None
+        };
+        moshi::lm::Config {
+            transformer: lm_cfg,
+            depformer: None,
+            audio_vocab_size: self.card + 1,
+            text_in_vocab_size: self.text_card + 1,
+            text_out_vocab_size: self.text_card,
+            audio_codebooks: self.n_q,
+            conditioners: Default::default(),
+            extra_heads,
+        }
+    }
+}
+struct Model {
+    state: moshi::asr::State,
+    text_tokenizer: sentencepiece::SentencePieceProcessor,
+    timestamps: bool,
+    vad: bool,
+    config: Config,
+    dev: Device,
+}
+impl Model {
+    fn load_from_hf(args: &Args, dev: &Device) -> Result<Self> {
+        // Retrieve the model files from the Hugging Face Hub
+        let api = hf_hub::api::sync::Api::new()?;
+        let repo = api.model(args.hf_repo.to_string());
+        let config_file = repo.get("config.json")?;
+        let config: Config = serde_json::from_str(&std::fs::read_to_string(&config_file)?)?;
+        let tokenizer_file = repo.get(&config.tokenizer_name)?;
+        let model_file = repo.get(&args.model_path)?;
+        let mimi_file = repo.get(&config.mimi_name)?;
+        let is_quantized = model_file.to_str().unwrap().ends_with(".gguf");
+        let text_tokenizer = sentencepiece::SentencePieceProcessor::open(&tokenizer_file)?;
+        let lm = if is_quantized {
+            let vb_lm = candle_transformers::quantized_var_builder::VarBuilder::from_gguf(
+                &model_file,
+                dev,
+            )?;
+            moshi::lm::LmModel::new(
+                &config.model_config(args.vad),
+                moshi::nn::MaybeQuantizedVarBuilder::Quantized(vb_lm),
+            )?
+        } else {
+            let dtype = dev.bf16_default_to_f32();
+            let vb_lm = unsafe {
+                candle_nn::VarBuilder::from_mmaped_safetensors(&[&model_file], dtype, dev)?
+            };
+            moshi::lm::LmModel::new(
+                &config.model_config(args.vad),
+                moshi::nn::MaybeQuantizedVarBuilder::Real(vb_lm),
+            )?
+        };
+        let audio_tokenizer = moshi::mimi::load(mimi_file.to_str().unwrap(), Some(32), dev)?;
+        let asr_delay_in_tokens = (config.stt_config.audio_delay_seconds * 12.5) as usize;
+        let state = moshi::asr::State::new(1, asr_delay_in_tokens, 0., audio_tokenizer, lm)?;
+        Ok(Model {
+            state,
+            config,
+            text_tokenizer,
+            timestamps: args.timestamps,
+            vad: args.vad,
+            dev: dev.clone(),
+        })
+    }
+    fn run(&mut self, mut pcm: Vec<f32>) -> Result<()> {
+        use std::io::Write;
+        // Add the silence prefix to the audio.
+        if self.config.stt_config.audio_silence_prefix_seconds > 0.0 {
+            let silence_len =
+                (self.config.stt_config.audio_silence_prefix_seconds * 24000.0) as usize;
+            pcm.splice(0..0, vec![0.0; silence_len]);
+        }
+        // Add some silence at the end to ensure all the audio is processed.
+        let suffix = (self.config.stt_config.audio_delay_seconds * 24000.0) as usize;
+        pcm.resize(pcm.len() + suffix + 24000, 0.0);
+        let mut last_word = None;
+        let mut printed_eot = false;
+        for pcm in pcm.chunks(1920) {
+            let pcm = Tensor::new(pcm, &self.dev)?.reshape((1, 1, ()))?;
+            let asr_msgs = self.state.step_pcm(pcm, None, &().into(), |_, _, _| ())?;
+            for asr_msg in asr_msgs.iter() {
+                match asr_msg {
+                    moshi::asr::AsrMsg::Step { prs, .. } => {
+                        // prs is the probability of having no voice activity for different time
+                        // horizons.
+                        // In kyutai/stt-1b-en_fr-candle, these horizons are 0.5s, 1s, 2s, and 3s.
+                        if self.vad && prs[2][0] > 0.5 && !printed_eot {
+                            printed_eot = true;
+                            if !self.timestamps {
+                                print!(" <endofturn pr={}>", prs[2][0]);
+                            } else {
+                                println!("<endofturn pr={}>", prs[2][0]);
+                            }
+                        }
+                    }
+                    moshi::asr::AsrMsg::EndWord { stop_time, .. } => {
+                        printed_eot = false;
+                        #[allow(clippy::collapsible_if)]
+                        if self.timestamps {
+                            if let Some((word, start_time)) = last_word.take() {
+                                println!("[{start_time:5.2}-{stop_time:5.2}] {word}");
+                            }
+                        }
+                    }
+                    moshi::asr::AsrMsg::Word {
+                        tokens, start_time, ..
+                    } => {
+                        printed_eot = false;
+                        let word = self
+                            .text_tokenizer
+                            .decode_piece_ids(tokens)
+                            .unwrap_or_else(|_| String::new());
+                        if !self.timestamps {
+                            print!(" {word}");
+                            std::io::stdout().flush()?
+                        } else {
+                            if let Some((word, prev_start_time)) = last_word.take() {
+                                println!("[{prev_start_time:5.2}-{start_time:5.2}] {word}");
+                            }
+                            last_word = Some((word, *start_time));
+                        }
+                    }
+                }
+            }
+        }
+        if let Some((word, start_time)) = last_word.take() {
+            println!("[{start_time:5.2}-     ] {word}");
+        }
+        println!();
+        Ok(())
+    }
+}
+fn main() -> Result<()> {
+    let args = Args::parse();
+    let device = device(args.cpu)?;
+    println!("Using device: {:?}", device);
+    println!("Loading audio file from: {}", args.in_file);
+    let (pcm, sample_rate) = kaudio::pcm_decode(&args.in_file)?;
+    let pcm = if sample_rate != 24_000 {
+        kaudio::resample(&pcm, sample_rate as usize, 24_000)?
+    } else {
+        pcm
+    };
+    println!("Loading model from repository: {}", args.hf_repo);
+    let mut model = Model::load_from_hf(&args, &device)?;
+    println!("Running inference");
+    model.run(pcm)?;
+    Ok(())
+}

stt_pytorch.ipynb ADDED Viewed

	@@ -0,0 +1,237 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "gJEMjPgeI-rw",
+    "outputId": "7491c067-b1be-4505-b3f5-19ba4c00a593"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install moshi"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "CA4K5iDFJcqJ",
+    "outputId": "b609843a-a193-4729-b099-5f8780532333"
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://github.com/kyutai-labs/moshi/raw/refs/heads/main/data/sample_fr_hibiki_crepes.mp3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "VA3Haix3IZ8Q"
+   },
+   "outputs": [],
+   "source": [
+    "from dataclasses import dataclass\n",
+    "import time\n",
+    "import sentencepiece\n",
+    "import sphn\n",
+    "import textwrap\n",
+    "import torch\n",
+    "\n",
+    "from moshi.models import loaders, MimiModel, LMModel, LMGen"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "9AK5zBMTI9bw"
+   },
+   "outputs": [],
+   "source": [
+    "@dataclass\n",
+    "class InferenceState:\n",
+    "    mimi: MimiModel\n",
+    "    text_tokenizer: sentencepiece.SentencePieceProcessor\n",
+    "    lm_gen: LMGen\n",
+    "\n",
+    "    def __init__(\n",
+    "        self,\n",
+    "        mimi: MimiModel,\n",
+    "        text_tokenizer: sentencepiece.SentencePieceProcessor,\n",
+    "        lm: LMModel,\n",
+    "        batch_size: int,\n",
+    "        device: str | torch.device,\n",
+    "    ):\n",
+    "        self.mimi = mimi\n",
+    "        self.text_tokenizer = text_tokenizer\n",
+    "        self.lm_gen = LMGen(lm, temp=0, temp_text=0, use_sampling=False)\n",
+    "        self.device = device\n",
+    "        self.frame_size = int(self.mimi.sample_rate / self.mimi.frame_rate)\n",
+    "        self.batch_size = batch_size\n",
+    "        self.mimi.streaming_forever(batch_size)\n",
+    "        self.lm_gen.streaming_forever(batch_size)\n",
+    "\n",
+    "    def run(self, in_pcms: torch.Tensor):\n",
+    "        ntokens = 0\n",
+    "        first_frame = True\n",
+    "        chunks = [\n",
+    "            c\n",
+    "            for c in in_pcms.split(self.frame_size, dim=2)\n",
+    "            if c.shape[-1] == self.frame_size\n",
+    "        ]\n",
+    "        start_time = time.time()\n",
+    "        all_text = []\n",
+    "        for chunk in chunks:\n",
+    "            codes = self.mimi.encode(chunk)\n",
+    "            if first_frame:\n",
+    "                # Ensure that the first slice of codes is properly seen by the transformer\n",
+    "                # as otherwise the first slice is replaced by the initial tokens.\n",
+    "                tokens = self.lm_gen.step(codes)\n",
+    "                first_frame = False\n",
+    "            tokens = self.lm_gen.step(codes)\n",
+    "            if tokens is None:\n",
+    "                continue\n",
+    "            assert tokens.shape[1] == 1\n",
+    "            one_text = tokens[0, 0].cpu()\n",
+    "            if one_text.item() not in [0, 3]:\n",
+    "                text = self.text_tokenizer.id_to_piece(one_text.item())\n",
+    "                text = text.replace(\"▁\", \" \")\n",
+    "                all_text.append(text)\n",
+    "            ntokens += 1\n",
+    "        dt = time.time() - start_time\n",
+    "        print(\n",
+    "            f\"processed {ntokens} steps in {dt:.0f}s, {1000 * dt / ntokens:.2f}ms/step\"\n",
+    "        )\n",
+    "        return \"\".join(all_text)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 353,
+     "referenced_widgets": [
+      "0a5f6f887e2b4cd1990a0e9ec0153ed9",
+      "f7893826fcba4bdc87539589d669249b",
+      "8805afb12c484781be85082ff02dad13",
+      "97679c0d9ab44bed9a3456f2fcb541fd",
+      "d73c0321bed54a52b5e1da0a7788e32a",
+      "d67be13a920d4fc89e5570b5b29fc1d2",
+      "6b377c2d7bf945fb89e46c39d246a332",
+      "b82ff365c78e41ad8094b46daf79449d",
+      "477aa7fa82dc42d5bce6f1743c45d626",
+      "cbd288510c474430beb66f346f382c45",
+      "aafc347cdf28428ea6a7abe5b46b726f",
+      "fca09acd5d0d45468c8b04bfb2de7646",
+      "79e35214b51b4a9e9b3f7144b0b34f7b",
+      "89e9a37f69904bd48b954d627bff6687",
+      "57028789c78248a7b0ad4f031c9545c9",
+      "1150fcb427994c2984d4d0f4e4745fe5",
+      "e24b1fc52f294f849019c9b3befb613f",
+      "8724878682cf4c3ca992667c45009398",
+      "36a22c977d5242008871310133b7d2af",
+      "5b3683cad5cb4877b43fadd003edf97f",
+      "703f98272e4d469d8f27f5a465715dd8",
+      "9dbe02ef5fac41cfaee3d02946e65c88",
+      "37faa87ad03a4271992c21ce6a629e18",
+      "570c547e48cd421b814b2c5e028e4c0b",
+      "b173768580fc4c0a8e3abf272e4c363a",
+      "e57d1620f0a9427b85d8b4885ef4e8e3",
+      "5dd4474df70743498b616608182714dd",
+      "cc907676a65f4ad1bf68a77b4a00e89b",
+      "a34abc3b118e4305951a466919c28ff6",
+      "a77ccfcdb90146c7a63b4b2d232bc494",
+      "f7313e6e3a27475993cab3961d6ae363",
+      "39b47fad9c554839868fe9e4bbf7def2",
+      "14e9511ea0bd44c49f0cf3abf1a6d40e",
+      "a4ea8e0c4cac4d5e88b7e3f527e4fe90",
+      "571afc0f4b2840c9830d6b5a307ed1f9",
+      "6ec593cab5b64f0ea638bb175b9daa5c",
+      "77a52aed00ae408bb24524880e19ec8a",
+      "0b2de4b29b4b44fe9d96361a40c793d0",
+      "3c5b5fb1a5ac468a89c1058bd90cfb58",
+      "e53e0a2a240e43cfa562c89b3d703dea",
+      "35966343cf9249ef8bc028a0d5c5f97d",
+      "e36a37e0d41c47ccb8bc6d56c19fb17c",
+      "279ccf7de43847a1a6579c9182a46cc8",
+      "41b5d6ab0b7d43c790a55f125c0e7494"
+     ]
+    },
+    "id": "UsQJdAgkLp9n",
+    "outputId": "9b7131c3-69c5-4323-8312-2ce7621d8869"
+   },
+   "outputs": [],
+   "source": [
+    "device = \"cuda\"\n",
+    "# Use the en+fr low latency model, an alternative is kyutai/stt-2.6b-en\n",
+    "checkpoint_info = loaders.CheckpointInfo.from_hf_repo(\"kyutai/stt-1b-en_fr\")\n",
+    "mimi = checkpoint_info.get_mimi(device=device)\n",
+    "text_tokenizer = checkpoint_info.get_text_tokenizer()\n",
+    "lm = checkpoint_info.get_moshi(device=device)\n",
+    "in_pcms, _ = sphn.read(\"sample_fr_hibiki_crepes.mp3\", sample_rate=mimi.sample_rate)\n",
+    "in_pcms = torch.from_numpy(in_pcms).to(device=device)\n",
+    "\n",
+    "stt_config = checkpoint_info.stt_config\n",
+    "pad_left = int(stt_config.get(\"audio_silence_prefix_seconds\", 0.0) * 24000)\n",
+    "pad_right = int((stt_config.get(\"audio_delay_seconds\", 0.0) + 1.0) * 24000)\n",
+    "in_pcms = torch.nn.functional.pad(in_pcms, (pad_left, pad_right), mode=\"constant\")\n",
+    "in_pcms = in_pcms[None, 0:1].expand(1, -1, -1)\n",
+    "\n",
+    "state = InferenceState(mimi, text_tokenizer, lm, batch_size=1, device=device)\n",
+    "text = state.run(in_pcms)\n",
+    "print(textwrap.fill(text, width=100))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 75
+    },
+    "id": "CIAXs9oaPrtj",
+    "outputId": "94cc208c-2454-4dd4-a64e-d79025144af5"
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.display import Audio\n",
+    "\n",
+    "Audio(\"sample_fr_hibiki_crepes.mp3\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "qkUZ6CBKOdTa"
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "L4",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}

tts_pytorch.ipynb ADDED Viewed

	@@ -0,0 +1,139 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Fast install, might break in the future.\n",
+    "!pip install 'safetensors<0.6'\n",
+    "!pip install 'sphn<0.2'\n",
+    "!pip install --no-deps \"moshi==0.2.11\"\n",
+    "# Slow install (will download torch and cuda), but future proof.\n",
+    "# !pip install \"moshi==0.2.11\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import torch\n",
+    "from moshi.models.loaders import CheckpointInfo\n",
+    "from moshi.models.tts import DEFAULT_DSM_TTS_REPO, DEFAULT_DSM_TTS_VOICE_REPO, TTSModel\n",
+    "\n",
+    "from IPython.display import display, Audio"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Configuration\n",
+    "text = \"Hey there! How are you? I had the craziest day today.\"\n",
+    "voice = \"expresso/ex03-ex01_happy_001_channel1_334s.wav\"\n",
+    "print(f\"See https://huggingface.co/{DEFAULT_DSM_TTS_VOICE_REPO} for available voices.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Set everything up\n",
+    "checkpoint_info = CheckpointInfo.from_hf_repo(DEFAULT_DSM_TTS_REPO)\n",
+    "tts_model = TTSModel.from_checkpoint_info(\n",
+    "    checkpoint_info, n_q=32, temp=0.6, device=torch.device(\"cuda\")\n",
+    ")\n",
+    "\n",
+    "# If you want to make a dialog, you can pass more than one turn [text_speaker_1, text_speaker_2, text_2_speaker_1, ...]\n",
+    "entries = tts_model.prepare_script([text], padding_between=1)\n",
+    "voice_path = tts_model.get_voice_path(voice)\n",
+    "# CFG coef goes here because the model was trained with CFG distillation,\n",
+    "# so it's not _actually_ doing CFG at inference time.\n",
+    "# Also, if you are generating a dialog, you should have two voices in the list.\n",
+    "condition_attributes = tts_model.make_condition_attributes([voice_path], cfg_coef=2.0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Generating audio...\")\n",
+    "\n",
+    "pcms = []\n",
+    "\n",
+    "\n",
+    "def _on_frame(frame):\n",
+    "    print(\"Step\", len(pcms), end=\"\\r\")\n",
+    "    if (frame != -1).all():\n",
+    "        pcm = tts_model.mimi.decode(frame[:, 1:, :]).cpu().numpy()\n",
+    "        pcms.append(np.clip(pcm[0, 0], -1, 1))\n",
+    "\n",
+    "\n",
+    "# You could also generate multiple audios at once by extending the following lists.\n",
+    "all_entries = [entries]\n",
+    "all_condition_attributes = [condition_attributes]\n",
+    "with tts_model.mimi.streaming(len(all_entries)):\n",
+    "    result = tts_model.generate(\n",
+    "        all_entries, all_condition_attributes, on_frame=_on_frame\n",
+    "    )\n",
+    "\n",
+    "print(\"Done generating.\")\n",
+    "audio = np.concatenate(pcms, axis=-1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "display(Audio(audio, rate=tts_model.mimi.sample_rate, autoplay=True))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}