Windows Support?

by PierrunoYT - opened Dec 15, 2025

Discussion

PierrunoYT

Dec 15, 2025

Windows Support?

dolphinfan

Dec 15, 2025

•

edited Dec 15, 2025

It works fine in Windows!

Download Miniconda (to access Anaconda Prompt): https://www.anaconda.com/docs/getting-started/miniconda/install#windows-installation
Github Repo: https://github.com/FunAudioLLM/CosyVoice

First step (in Anaconda Prompt):

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
git submodule update --init --recursive

Second step:

conda create -n cosyvoice -y python=3.10
conda activate cosyvoice

cd /d C:\Users\<YOUR-NAME-HERE>\CosyVoice
pip install --upgrade pip
pip install -r requirements.txt

Third step:
Download the model to...

C:\Users\<YOUR-NAME-HERE>\CosyVoice\pretrained_models\Fun-CosyVoice3-0.5B

You may run into errors. But I just used ChatGPT to correct them and get things running. One thing I had to do was uninstall Torch and upgrade it to a newer version since my computer has a new GPU. I also had to install x_transformers

conda activate cosyvoice
pip install x-transformers

Just be patient bc you'll have to jump through some hoops and make some tweaks. But it's definitely worth the wait!

CynicalSpore

Dec 16, 2025

•

edited Dec 16, 2025

yeah i did the same and it went downhill so ....all easy like you said

got an chinese Gradio unfunctional and with a bunch of errors so ChatGPT gave up too i uninstalled it and let it go.

dolphinfan

Dec 16, 2025

•

edited Dec 16, 2025

yeah i did the same and it went downhill so ....all easy like you said

got an chinese Gradio unfunctional and with a bunch of errors so ChatGPT gave up too i uninstalled it and let it go.

No, don't give up. You're almost there! LOL, I was just about to give up myself when I was in that exact same spot, but then the very next fix from ChatGPT got everything working!

So, reinstall it again, and have chatgpt create you a customized English Gradio for you (which is what it did for me).

You can have ChatGPT draw inspiration from the English Gradio that it custom built for me (save as 'webui_english.py' or something):

import os
import sys
import argparse
import tempfile
import random

import gradio as gr
import numpy as np
import soundfile as sf
import torch
import torchaudio

ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.join(ROOT_DIR, "third_party", "Matcha-TTS"))

from cosyvoice.cli.cosyvoice import AutoModel
from cosyvoice.utils.file_utils import logging
from cosyvoice.utils.common import set_all_random_seed


PROMPT_SR_MIN = 16000
PROMPT_SR_TARGET = 24000
MAX_ABS = 0.8


def generate_seed():
    seed = random.randint(1, 100000000)
    return {"__type__": "update", "value": seed}


def _to_mono(x: np.ndarray) -> np.ndarray:
    # x can be (T,) or (T, C)
    if x.ndim == 1:
        return x
    return x.mean(axis=1)


def _read_audio_file_float_mono(path: str):
    """
    Reads audio using soundfile.
    Returns (sr:int, mono_float32:np.ndarray).

    NOTE: soundfile handles WAV/FLAC/OGG(Vorbis) depending on libsndfile build.
    If you upload MP3/M4A/etc, this may fail (and would need FFmpeg, which is broken in your env).
    """
    try:
        data, sr = sf.read(path, dtype="float32", always_2d=True)  # [T, C]
        mono = data.mean(axis=1).astype(np.float32)  # [T]
        return int(sr), mono
    except Exception as e:
        raise gr.Error(
            "Could not read the prompt audio file. "
            "Please upload a WAV (recommended) or FLAC. "
            f"Details: {e}"
        )


def _write_prompt_to_temp_wav(prompt_audio_path: str, target_sr: int = PROMPT_SR_TARGET) -> str:
    """
    Takes a prompt audio FILEPATH (from Gradio), reads it cleanly,
    resamples to target_sr, writes a temp WAV, returns its path.
    """
    sr, mono = _read_audio_file_float_mono(prompt_audio_path)

    # Resample to target_sr using torchaudio (no FFmpeg)
    x = torch.from_numpy(mono).unsqueeze(0)  # [1, T]
    if int(sr) != int(target_sr):
        x = torchaudio.functional.resample(x, int(sr), int(target_sr))
    y = x.squeeze(0).cpu().numpy().astype(np.float32)

    # Keep prompt from clipping (usually unnecessary for clean WAV, but safe)
    y = np.nan_to_num(y, nan=0.0, posinf=0.0, neginf=0.0)
    peak = float(np.max(np.abs(y))) if y.size else 0.0
    if peak > 1.0:
        y = y / peak

    fd, outpath = tempfile.mkstemp(suffix=".wav")
    os.close(fd)
    sf.write(outpath, y, int(target_sr))
    return outpath


def _get_sr(path: str) -> int:
    return int(sf.info(path).samplerate)


def _collect_audio(gen, sample_rate: int) -> np.ndarray:
    chunks = []
    for i in gen:
        t = i["tts_speech"]
        if isinstance(t, torch.Tensor):
            t = t.detach().cpu()
            if t.ndim > 1:
                t = t.reshape(-1)
            arr = t.numpy().astype(np.float32)
        else:
            arr = np.asarray(t, dtype=np.float32).reshape(-1)
        chunks.append(arr)

    if not chunks:
        return np.zeros(sample_rate, dtype=np.float32)

    audio = np.concatenate(chunks)
    audio = np.nan_to_num(audio, nan=0.0, posinf=0.0, neginf=0.0)
    audio = np.clip(audio, -MAX_ABS, MAX_ABS)
    return audio.astype(np.float32)


def _call_with_optional_kwargs(fn, *args, **kwargs):
    """
    CosyVoice variants differ slightly in accepted kwargs (e.g., speed).
    Try calling with kwargs; if TypeError, retry without unknown kwargs.
    """
    try:
        return fn(*args, **kwargs)
    except TypeError:
        return fn(*args)


def build_ui(cosyvoice, args):
    # Rough capability detection
    has_sft = hasattr(cosyvoice, "inference_sft")
    has_zero_shot = hasattr(cosyvoice, "inference_zero_shot")
    has_cross = hasattr(cosyvoice, "inference_cross_lingual")
    has_instruct2 = hasattr(cosyvoice, "inference_instruct2")
    has_instruct = hasattr(cosyvoice, "inference_instruct")

    # List speakers if available
    sft_spk = []
    try:
        sft_spk = cosyvoice.list_available_spks() or []
    except Exception:
        sft_spk = []

    modes = []
    if has_sft and len(sft_spk) > 0:
        modes.append("Pretrained tone (SFT)")
    if has_zero_shot:
        modes.append("Zero-shot clone (prompt audio + transcript)")
    if has_cross:
        modes.append("Cross-lingual clone (prompt audio)")
    if has_instruct2 or has_instruct:
        modes.append("Instruct (control)")

    if not modes:
        raise RuntimeError("No supported inference modes detected for this model.")

    def generate_audio(
        tts_text,
        mode,
        sft_dropdown,
        system_prompt,
        prompt_transcript,
        prompt_audio_upload_path,
        prompt_audio_record_path,
        instruct_text,
        seed,
        speed,
    ):
        # Non-stream UI only (avoid ffprobe/pydub ADTS conversion)
        if not (tts_text and str(tts_text).strip()):
            raise gr.Error("Please enter text to synthesize.")

        # Choose prompt source (filepath)
        prompt_audio_path = prompt_audio_upload_path or prompt_audio_record_path
        prompt_wav_path = None

        if mode in [
            "Zero-shot clone (prompt audio + transcript)",
            "Cross-lingual clone (prompt audio)",
            "Instruct (control)",
        ]:
            if not prompt_audio_path:
                raise gr.Error("Please provide a prompt audio clip (upload or record).")

            # Convert prompt to clean 24k wav
            prompt_wav_path = _write_prompt_to_temp_wav(prompt_audio_path, target_sr=PROMPT_SR_TARGET)

            sr = _get_sr(prompt_wav_path)
            if sr < PROMPT_SR_MIN:
                gr.Warning(f"Prompt sample rate {sr} < {PROMPT_SR_MIN}. Results may degrade.")

        set_all_random_seed(int(seed) if seed is not None else 0)

        # ----- Mode: SFT -----
        if mode == "Pretrained tone (SFT)":
            if not sft_dropdown:
                raise gr.Error("No pretrained tone selected.")
            logging.info("SFT inference request")
            gen = _call_with_optional_kwargs(
                cosyvoice.inference_sft,
                tts_text,
                sft_dropdown,
                stream=False,
                speed=speed,
            )
            audio = _collect_audio(gen, cosyvoice.sample_rate)
            return (int(cosyvoice.sample_rate), audio)

        # ----- Mode: Zero-shot -----
        if mode == "Zero-shot clone (prompt audio + transcript)":
            if not (prompt_transcript and prompt_transcript.strip()):
                raise gr.Error("Prompt transcript is required (the words spoken in the prompt audio).")

            # CV3 convention: system + <|endofprompt|> + transcript
            # CV2 convention: transcript-only often works
            pt = prompt_transcript.strip()
            sp = (system_prompt or "").strip()

            if sp:
                prompt_text = sp + "<|endofprompt|>" + pt
            else:
                prompt_text = pt

            logging.info("Zero-shot inference request")
            gen = _call_with_optional_kwargs(
                cosyvoice.inference_zero_shot,
                tts_text,
                prompt_text,
                prompt_wav_path,
                stream=False,
                speed=speed,
            )
            audio = _collect_audio(gen, cosyvoice.sample_rate)
            return (int(cosyvoice.sample_rate), audio)

        # ----- Mode: Cross-lingual -----
        if mode == "Cross-lingual clone (prompt audio)":
            logging.info("Cross-lingual inference request")
            gen = _call_with_optional_kwargs(
                cosyvoice.inference_cross_lingual,
                tts_text,
                prompt_wav_path,
                stream=False,
                speed=speed,
            )
            audio = _collect_audio(gen, cosyvoice.sample_rate)
            return (int(cosyvoice.sample_rate), audio)

        # ----- Mode: Instruct -----
        if mode == "Instruct (control)":
            if not (instruct_text and instruct_text.strip()):
                raise gr.Error("Instruct text is required for this mode.")

            logging.info("Instruct inference request")

            if has_instruct2:
                # CosyVoice2/3 style
                gen = _call_with_optional_kwargs(
                    cosyvoice.inference_instruct2,
                    tts_text,
                    instruct_text,
                    prompt_wav_path,
                    stream=False,
                    speed=speed,
                )
            else:
                # Older CosyVoice1 style (kept for compatibility)
                gen = _call_with_optional_kwargs(
                    cosyvoice.inference_instruct,
                    tts_text,
                    sft_dropdown if sft_dropdown else "",
                    instruct_text,
                    stream=False,
                    speed=speed,
                )

            audio = _collect_audio(gen, cosyvoice.sample_rate)
            return (int(cosyvoice.sample_rate), audio)

        raise gr.Error(f"Unsupported mode: {mode}")

    with gr.Blocks() as demo:
        gr.Markdown("## CosyVoice Unified WebUI (Windows-safe, non-stream)")
        gr.Markdown(
            f"**Model dir:** `{args.model_dir}`  \n"
            "Notes: returns full audio (no Gradio streaming) to avoid ffprobe/pydub issues."
        )

        tts_text = gr.Textbox(
            label="Text to synthesize",
            lines=4,
            value="Hello! This is a local CosyVoice test.",
        )

        with gr.Row():
            mode = gr.Radio(choices=modes, label="Mode", value=modes[0])
            seed_btn = gr.Button("🎲 Random seed")
            seed = gr.Number(value=0, label="Seed")

        speed = gr.Number(value=1.0, label="Speed (if supported)", minimum=0.5, maximum=2.0, step=0.1)

        # Prompt audio (FILEPATH) — closest to CLI behavior
        with gr.Row():
            prompt_audio_upload_path = gr.Audio(
                sources=["upload"],
                type="filepath",
                label="Prompt audio (upload) — recommended: WAV/FLAC",
            )
            prompt_audio_record_path = gr.Audio(
                sources=["microphone"],
                type="filepath",
                label="Prompt audio (record) — if this fails, upload WAV instead",
            )

        # CV3-friendly prompt fields
        system_prompt = gr.Textbox(
            label="System prompt (optional; CV3 typically uses this)",
            lines=2,
            value="<|endofprompt|>",
        )
        prompt_transcript = gr.Textbox(
            label="Prompt transcript (required for Zero-shot mode)",
            lines=2,
            placeholder="Type the exact words spoken in the prompt audio…",
        )

        # SFT selection (only relevant if model supports it)
        sft_dropdown = gr.Dropdown(
            choices=(sft_spk if len(sft_spk) > 0 else [""]),
            label="Pretrained tone (SFT speaker)",
            value=(sft_spk[0] if len(sft_spk) > 0 else ""),
        )

        instruct_text = gr.Textbox(
            label="Instruct text (required for Instruct mode)",
            lines=2,
            placeholder="E.g., 'Please speak much faster.<|endofprompt|>'",
        )

        btn = gr.Button("Generate")
        out_audio = gr.Audio(label="Output (full audio)", autoplay=True, type="numpy")

        seed_btn.click(generate_seed, inputs=[], outputs=seed)
        btn.click(
            generate_audio,
            inputs=[
                tts_text,
                mode,
                sft_dropdown,
                system_prompt,
                prompt_transcript,
                prompt_audio_upload_path,
                prompt_audio_record_path,
                instruct_text,
                seed,
                speed,
            ],
            outputs=[out_audio],
        )

    return demo


def main():
    cosyvoice = AutoModel(model_dir=args.model_dir)
    demo = build_ui(cosyvoice, args)
    demo.queue(max_size=4, default_concurrency_limit=2)
    demo.launch(server_name="0.0.0.0", server_port=args.port, share=False)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--port", type=int, default=50000)
    parser.add_argument(
        "--model_dir",
        type=str,
        default="pretrained_models/Fun-CosyVoice3-0.5B",
        help="Local path to a model folder (recommended on Windows).",
    )
    args = parser.parse_args()
    main()

If you installed everything that your system needs, this should run perfectly for you. If not, just copy the error code into ChatGPT and let it do the troubleshooting and heavylifting. That's how I got mine to work flawlessly.

Don't upload prompt audio over 30 seconds, otherwise you'll get an error. And try to keep your "text to synthesize" no more than 200 characters or so. If you want to generate longer clips, have ChatGPT write you another python script that generates audio from longer transcripts (directly in Anaconda Prompt).

Cheers! 🍻

CynicalSpore

Dec 16, 2025

•

edited Dec 17, 2025

alright i give it another run... i will post if some get downhill again

/edit I've completely given up. It's absolute chaos, a hell of dependencies. Never again, but thank you for trying to help me. But I'm not programming Windows 12 and Windows 13 at the same time... Because for me, it just felt like chaos and not a solution. I asked ChatGPT to help me, but every time it said: “Oh, you have to do this...” Hours later: Oh yes, right, that has to be renamed back to what it was from the beginning. A cycle of endless doom.

Sorry, I threw it away until someone makes a reasonable attempt for Windows users that is easy to use...

/edit some typos fixed

aluminumbox

FunAudioLLM org Dec 17, 2025

It can run on windows, make sure set text_frontend=False because text_normalize may not support windows

CynicalSpore

Dec 17, 2025

2 hours ? insane

dolphinfan

Dec 17, 2025

alright i give it another run... i will post if some get downhill again

/edit I've completely given up. It's absolute chaos, a hell of dependencies. Never again, but thank you for trying to help me. But I'm not programming Windows 12 and Windows 13 at the same time... Because for me, it just felt like chaos and not a solution. I asked ChatGPT to help me, but every time it said: “Oh, you have to do this...” Hours later: Oh yes, right, that has to be renamed back to what it was from the beginning. A cycle of endless doom.

Sorry, I threw it away until someone makes a reasonable attempt for Windows users that is easy to use...

/edit some typos fixed

Dang! Sorry to hear that man.

When you restarted the process, did you wipe out all of the previous data from the first attempt? Maybe having overlapping packages/dependencies F'd things up?

Yes, it does require a lot of persistence and patience. It took me hours, too. But it definitely paid off for me. If I had time, I'd do a Youtube video of an A-Z install. But eventually someone will probably do it to make it a lot easier. These long install used to make me question my sanity, too, haha. So I know how ya feel. Good luck man!

CynicalSpore

Dec 17, 2025

•

edited Dec 17, 2025

alright i give it another run... i will post if some get downhill again

/edit I've completely given up. It's absolute chaos, a hell of dependencies. Never again, but thank you for trying to help me. But I'm not programming Windows 12 and Windows 13 at the same time... Because for me, it just felt like chaos and not a solution. I asked ChatGPT to help me, but every time it said: “Oh, you have to do this...” Hours later: Oh yes, right, that has to be renamed back to what it was from the beginning. A cycle of endless doom.

Sorry, I threw it away until someone makes a reasonable attempt for Windows users that is easy to use...

/edit some typos fixed

Dang! Sorry to hear that man.

When you restarted the process, did you wipe out all of the previous data from the first attempt? Maybe having overlapping packages/dependencies F'd things up?

Yes, it does require a lot of persistence and patience. It took me hours, too. But it definitely paid off for me. If I had time, I'd do a Youtube video of an A-Z install. But eventually someone will probably do it to make it a lot easier. These long install used to make me question my sanity, too, haha. So I know how ya feel. Good luck man!

I was starting with a new env all fresh, but I ran into a circle and ChatGPT let me run through each circle again and again. It let me rename the cosyvoice3.yaml to cosyvoice.yaml, so it went into the drain at this point, later it could not remember that step we took, it was literally a sauce of cosyvoice3 mixed into cosyvoice(2) or such. I asked myself what kinda BS ChatGPT mixed together... it was just a mixed-up, nonfunctional mess in the end. As I wrote then to ChatGPT, "how about the Gradio thingy?" it was fully over .. it destroyed the last bit of any remains that were cluttered and nonfunctional at this time anyway.

dolphinfan

Dec 18, 2025

I was starting with a new env all fresh, but I ran into a circle and ChatGPT let me run through each circle again and again. It let me rename the cosyvoice3.yaml to cosyvoice.yaml, so it went into the drain at this point, later it could not remember that step we took, it was literally a sauce of cosyvoice3 mixed into cosyvoice(2) or such. I asked myself what kinda BS ChatGPT mixed together... it was just a mixed-up, nonfunctional mess in the end. As I wrote then to ChatGPT, "how about the Gradio thingy?" it was fully over .. it destroyed the last bit of any remains that were cluttered and nonfunctional at this time anyway.

Yeah, it definitely led you in the wrong direction. I didn't have to touch the .yaml files nor CosyVoice2 on my install. You shouldn't have had to do ANY of that.

After I did the basic install, and installed what was needed to get it working on my new GPU... GPT-5.2 modified my Gradio to English, and switched something with numpy. After that, everything just magically worked very easily!

Aside from the basic install and git clone, all you really should've needed to do was tell ChatGPT:

specifically which GPU you have
tell it to only work on installing CosyVoice 3 0.5B (forget about CosyVoice 2; it's not needed for any of this)
ask which version of torch and torchaudio is best for your GPU to get this working
give it the README.md and give it the Gradio python code I posted above (for an example of a working English Gradio version for CosyVoice 3 0.5B)

Aside from that, you should really be good. I don't know which path your ChatGPT took. But that's wild, man! If you try again and run into a major block, post the error here and I can ask my ChatGPT what's going wrong. Since it already knows how to get this working for me, it might be in a better position to help you than your own ChatGPT.

If I have some time after the Holidays, I might try to install it on another computer and create an easy to follow guide. Good luck!

dolphinfan

Dec 18, 2025

Btw, I just asked my ChatGPT to summarize the blocks/fixes that I ran into to get things working for me on Windows. Maybe this will help some of you...

CosyVoice3‑0.5B on Windows: practical checklist

0) Use a fresh conda env (avoid “mystery DLL soup”)

conda create -n cosyvoice -y python=3.10
conda activate cosyvoice

1) Clone repo + submodules

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd /d CosyVoice
git submodule update --init --recursive

Verify submodule:

git submodule status --recursive

2) Install Python requirements but don’t let them downgrade your Torch

CosyVoice’s requirements.txt may pin torch==2.3.1 and a cu121 extra-index. That breaks RTX 5090 / sm_120.

Recommended order:

install repo requirements
then override Torch to a Blackwell-capable build

2A) Install requirements

pip install -r requirements.txt

2B) Fix Torch for RTX 50‑series (Blackwell / sm_120)

If you have an RTX 50‑series laptop GPU (e.g., 5090 Laptop), install cu128:

pip uninstall -y torch torchaudio torchvision
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Sanity test CUDA:

python -c "import torch; print(torch.__version__); print('CUDA:', torch.cuda.is_available()); print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'no gpu')"
python -c "import torch; x=torch.randn(64,64,device='cuda'); print((x@x).sum().item())"

If the matmul errors, your Torch build still doesn’t support your GPU.

2C) Prevent accidental Torch downgrade

Edit requirements.txt and remove/comment lines like:

--extra-index-url https://download.pytorch.org/whl/cu121
torch==2.3.1
torchaudio==2.3.1

Otherwise rerunning pip install -r requirements.txt can re-break CUDA.

3) Put models in the exact folders the code expects

For example.py, the folder names matter.

Example structure:

CosyVoice\
  pretrained_models\
    Fun-CosyVoice3-0.5B\            (CV3-0.5B-2512 downloaded/renamed here)
    CosyVoice2-0.5B\
    CosyVoice-300M\
    ...

Then run:

python example.py

4) Common Windows blocker: TorchAudio 2.9 + TorchCodec + FFmpeg

On some Windows setups, TorchAudio 2.9 routes audio I/O through TorchCodec/FFmpeg and that can explode with DLL errors.

Symptoms

ImportError: TorchCodec is required...
RuntimeError: Could not load libtorchcodec...
ffmpeg.exe - Entry Point Not Found popups
Gradio audio streaming crashes (pydub/ffprobe)

Working workaround (what you used): patch `torchaudio.load/save/info` to use `soundfile`

Create or overwrite:

%CONDA_PREFIX%\Lib\site-packages\sitecustomize.py

Use this (safe and simple):

import warnings

try:
    import torch
    import torchaudio
    import soundfile as sf
    import numpy as np
    import types
except Exception as e:
    warnings.warn(f"sitecustomize torchaudio patch not applied (imports failed): {e}")
else:
    def _load_soundfile(uri, frame_offset=0, num_frames=-1, channels_first=True, **kwargs):
        data, sr = sf.read(uri, dtype="float32", always_2d=True)  # [T, C]
        if frame_offset:
            data = data[int(frame_offset):]
        if num_frames is not None and int(num_frames) > 0:
            data = data[: int(num_frames)]
        if channels_first:
            data = np.transpose(data, (1, 0))  # [C, T]
        return torch.from_numpy(data), int(sr)

    def _save_soundfile(uri, src, sample_rate, channels_first=True, **kwargs):
        if isinstance(src, torch.Tensor):
            src = src.detach().cpu()
        x = src.numpy()
        y = x if x.ndim == 1 else (np.transpose(x, (1, 0)) if channels_first else x)
        y = np.nan_to_num(y, nan=0.0, posinf=0.0, neginf=0.0)
        y = np.clip(y, -1.0, 1.0)
        sf.write(uri, y, int(sample_rate))

    def _info_soundfile(uri, *args, **kwargs):
        i = sf.info(uri)
        return types.SimpleNamespace(sample_rate=int(i.samplerate), num_frames=int(i.frames), num_channels=int(i.channels))

    torchaudio.load = _load_soundfile
    torchaudio.save = _save_soundfile
    if not hasattr(torchaudio, "info"):
        torchaudio.info = _info_soundfile

    warnings.warn("Patched torchaudio.load/save/info to use soundfile (bypassing TorchCodec/FFmpeg).")

Restart the terminal and verify:

python -c "import torchaudio; print(hasattr(torchaudio,'info')); print(torchaudio.info(r'.\asset\zero_shot_prompt.wav').sample_rate)"

5) Gradio UI: use prompt audio as filepath, not numpy

A major “why does cloning sound wrong?” issue on Gradio is prompt audio scaling.

Symptom

CLI cloning is perfect
UI cloning sounds generic / not like the speaker

Fix

In the UI, use:

gr.Audio(type="filepath") for prompt audio,
read with soundfile yourself,
resample to 24k.

(Using type="numpy" can cause amplitude scaling/clipping unless you normalize carefully.)

6) CV3 zero-shot cloning requires prompt transcript

For best similarity:

prompt WAV alone is not enough
supply the exact words spoken in the prompt clip (prompt transcript)

A reliable CV3 prompt format is:

<|endofprompt|> + (prompt transcript)

You discovered system prompts can strongly steer accent/style; keep them stable.

7) Port in use error

If Gradio fails with:

[Errno 10048] ... port already in use

Pick a new port:

python webui_unified.py --port 50001 --model_dir pretrained_models/Fun-CosyVoice3-0.5B

“Top 5 gotchas” summary

RTX 50‑series needs cu128 Torch (cu121/cu124 can fail with sm_120).
Don’t let requirements.txt downgrade Torch.
If TorchCodec/FFmpeg breaks, patch torchaudio.* to use soundfile.
In Gradio, use prompt audio filepath, not numpy (or normalize).
Always provide prompt transcript for zero-shot similarity.

CynicalSpore

Dec 19, 2025

here is mine and now i'm outta here
ty for your help even it did not help me

1️⃣ Initial Situation

You wanted to install CosyVoice 3.0.5B with Gradio on Windows with RTX 3060 (12 GB).

You tried Conda and pip.

You ran into a series of DLL and module import errors, including:

_C.pyd procedure entry point missing

vcruntime140.dll checks

Torch failing to load (shm.dll)

numpy.core.multiarray import failing

HuggingFace / Modelscope / Transformers / Whisper / tiktoken / kaldifst / wetext errors

2️⃣ Actions Taken

Checked for Visual C++ runtime (vcruntime140.dll) — it existed.

Tried reinstalling PyTorch + CUDA.

Tried installing missing packages (huggingface_hub, transformers, modelscope, tiktoken, wetext, kaldifst…).

Attempted version fixes for Transformers, PyTorch, Modelscope, etc.

Tried creating a clean Conda environment multiple times.

Discussed manually editing llm.py to replace Qwen2ForCausalLM with AutoModelForCausalLM.

3️⃣ Core Issues Identified

Package version conflicts were rampant.

Qwen2ForCausalLM does not exist in Transformers releases — CosyVoice code is incompatible with current Transformers.

Even after fixing all other missing packages and DLLs, CosyVoice itself requires a code change to run properly.

The environment and dependency issues are compounded by the hardcoded imports and assumptions in CosyVoice’s source code.

4️⃣ Honest Assessment

Despite fixing all Python packages, DLLs, and dependencies, CosyVoice will still not run on your machine without editing its source code (llm.py) to replace Qwen2ForCausalLM with a compatible model-loading approach.

Therefore, no amount of environment tweaking, package reinstalling, or DLL fixes alone will make CosyVoice work.

All the previous troubleshooting (DLLs, Torch, Modelscope, Transformers, etc.) did not actually solve the blocking problem.

⚠️ Conclusion

Clear Statement:

I could not help you get CosyVoice 3.0.5B running without modifying the source code. All dependency fixes and environment resets are insufficient; the code itself is incompatible with the current Transformers library.

If you want, I can make a ready-to-run patched version of llm.py for your environment so it will start without further errors — that’s the only way it will actually run.

Do you want me to do that? (Edit from me :NOOOOOOOOO thanks)

CynicalSpore

Dec 19, 2025

and after the rewrite of the llm.py

the next 100 steps awaiting you 🤪😂 welcome to ChatGPT madhouse

dolphinfan

Dec 19, 2025

•

edited Dec 19, 2025

and after the rewrite of the llm.py

the next 100 steps awaiting you 🤪😂 welcome to ChatGPT madhouse

Gotcha! I pasted your previous posts into my ChatGPT, and here's what it said:

Yeah — based on that list, I strongly suspect their install is failing due to a “poisoned” Python environment (DLL / ABI conflicts + mixed installs), not because CosyVoice “requires editing llm.py”.

1) The claim “`Qwen2ForCausalLM` doesn’t exist” is almost certainly false

Hugging Face’s Transformers docs for Qwen2 exist and Qwen2 was added to Transformers long ago; if they can’t import Qwen2ForCausalLM, they’re almost certainly on the wrong Transformers version (too old, or a broken install), not on “current Transformers”. (huggingface.co)

So the right advice to them is: don’t patch CosyVoice code first—verify their transformers is sane and matches the repo pin.

What I’d suggest to that user (CynicalSpore) to get unstuck

A) Don’t debug in a dirty env: create a clean one and verify it’s actually being used

In Anaconda Prompt:

conda create -n cosyvoice_win -y python=3.10
conda activate cosyvoice_win

where python
python --version
where pip
pip --version

If where python shows anything outside the env (or multiple Pythons), stop and fix PATH first.

B) Install CosyVoice with the pinned versions (especially `transformers`)

Inside the CosyVoice repo:

pip install -r requirements.txt

Then immediately verify the exact thing they claimed was missing:

python -c "import transformers; print(transformers.__version__); from transformers import Qwen2ForCausalLM; print('Qwen2ForCausalLM OK')"

If that fails:

they’re not actually in the env they think they’re in, or
transformers is corrupted, or
requirements.txt didn’t install correctly.

Fix attempt:

pip uninstall -y transformers tokenizers
pip install transformers==4.51.3 tokenizers==0.21.4
python -c "from transformers import Qwen2ForCausalLM; print('OK')"

(Those versions match what CosyVoice commonly pins.)

C) Don’t chase 100 missing packages manually — it’s usually one ABI/DLL issue

Their errors like:

numpy.core.multiarray failing
_C.pyd entry point missing
torch shm.dll
random runtime DLL problems

…are classic signs of binary incompatibility from mixing:

pip wheels + conda packages (especially numpy/scipy/torch),
multiple channels (defaults + conda-forge) with no strict priority,
or installing stuff into base + envs + user site-packages.

Quick “hard reset” inside the env:

pip uninstall -y numpy scipy
pip install numpy==1.26.4
python -c "import numpy as np; print(np.__version__)"

If numpy still fails there, the Windows Python install is seriously contaminated (often fixed by reinstalling Miniconda or using Miniforge).

D) Torch choice for RTX 3060 (12GB) is NOT the same as RTX 5090

For an RTX 3060 (Ampere, sm_86), they can usually stick with the repo’s pinned torch (often torch==2.3.1+cu121) and it should work.

Verify CUDA works:

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'no gpu')"
python -c "import torch; x=torch.randn(64,64,device='cuda'); print((x@x).sum().item())"

If CUDA is false, they may need:

newer NVIDIA driver
or reinstall torch with an appropriate CUDA wheel.

E) Use CLI first; Gradio later

A ton of “Windows CosyVoice problems” are actually Gradio audio conversion problems (ffprobe/ffmpeg/pydub), not model inference problems.

Best path:

python example.py (or your CLI script)
then UI

If they insist on UI:

avoid Gradio streaming audio
use prompt audio as filepath instead of numpy to avoid scaling/clipping issues (you saw this firsthand).

F) Don’t install “extra stuff” until you have one successful generation

Tell them to stop installing Whisper/tiktoken/onnxruntime-gpu/etc. until they can run:

python example.py

Once that works, add optional dependencies.

The shortest “tell me what to paste” request for them

If you can ask them for just these outputs, you can diagnose quickly:

where python
python --version
python -c "import torch; print('torch', torch.__version__); print('cuda', torch.cuda.is_available())"
python -c "import transformers; print('transformers', transformers.__version__); from transformers import Qwen2ForCausalLM; print('qwen2 ok')"
python -c "import numpy as np; print('numpy', np.__version__)"

If they paste those, you can tell whether they have:

wrong env
wrong transformers
broken numpy
broken torch

…without a “next 100 steps” spiral.

If you want, paste the exact CosyVoice requirements.txt lines around torch/transformers from your repo version, and I’ll write a single clean install recipe for RTX 3060 that avoids your 5090-specific Torch workaround.

CynicalSpore

Dec 19, 2025

•

edited Dec 19, 2025

Regarding your post before your last, I followed the steps and checked everything with ChatGPT MadBro after cleaning up the dirty environment. The madhouse gave me this nonsense that I posted. I double-checked everything four times, and every time a missing dependency popped up, I tried three times, and three times it was such a mess that it's kind of weird that everything else I run works great. I've installed other environments with Triton Sage Flash, whatever you want, but this one is just a mess, and at the beginning I was told not to install pip and conda at the same time, and three steps later the first pip command was presented by the crazy GPT, as I now call it... and so it went the same way every time. The only option is to do a complete hard reset of Miniconda and install this new version, but that would destroy all my environments, and unfortunately, someone told me that you should never change a running system, and this one is not fully designed for Windows. The crazy GPT told me the same thing every time, that it is not fully designed for Windows and should run better on Linux without any problems. Thank you for your kind words, but I'm not going to destroy my entire system because of a TTS. I'll wait until some ComfyUI implementiation will come up or some other way. Ty Buddy but i'm out @dolphinfan

i used of course the RTX 3060 torch stuff because blackwell not run on sm86 ampere stuff the same on comfy and triton stuff sage ......

CUDA 12.0/12.4/12.8

As I said, Cuda worked, I tested everything, all I had to do was rewrite the _C.pyd file, and that was my way of saying goodbye to all the nonsense I mentioned in previous posts. I'm not here to program Windows 12 or Windows 13. If a repo didn't work, I didn't reinvent the wheel for a repo that should work as mentioned, so I left. I'm not a programmer, I'm just someone who has been looking into some things I want to try. If it's feasible, okay, if not, then goodbye. I'm not someone who has 5 hours or more to debug things.

quote :

wrong env
wrong transformers
broken numpy
broken torch

and yeah all in all that was the last thing ChatGPT also mentioned to me after break my Installment after he destroyed it

aluminumbox

FunAudioLLM org Dec 23, 2025

hi, do not use Chatgpt to modify the code. Just install requirements according to the readme, and set text_frontend=False during inference

tester2905

Dec 26, 2025

yeah i did the same and it went downhill so ....all easy like you said

got an chinese Gradio unfunctional and with a bunch of errors so ChatGPT gave up too i uninstalled it and let it go.

Same... good lord this one is impossible to install on Windows...

dolphinfan

Dec 26, 2025

yeah i did the same and it went downhill so ....all easy like you said

got an chinese Gradio unfunctional and with a bunch of errors so ChatGPT gave up too i uninstalled it and let it go.

Same... good lord this one is impossible to install on Windows...

I installed it on Windows fine. Just had to make a few tweaks to the Gradio using AI. And you have to find compatible versions of torch and torchaudio that work with your system. Works great for me on Windows. It's just not a straightforward install, and you need to have a very clean environment.

CynicalSpore

Dec 27, 2025

•

edited Dec 27, 2025

There is an implementation for Comfy UI that I could run this way. I tested it (ComfyUI version), but it doesn't meet my requirements, so I went back to my old TTS, which is easier to install and has better command of my language. With it (CosyVoice3), I got 18 out of 20 generations with an accent, which is useless to me, regardless of whether I use Zero Shot or the other features. It's the same Chinese-influenced stuff that's unbearable when you need a clean 100% TTS speaker without that accent. And for me, the installation was also unbearable (CosyVoice3), so I won't bother with it next time. It just felt like I was being told, “Take it or leave it, I don't care,” and the installation process is a joke.

Excuse my harsh words, but others can do it, so why can't you? Right, because it doesn't matter, since users are supposed to do it themselves. The product matures with the customers, and we already know that the service (installment process) is terrible everywhere (python, linux deploy).

But take a look for yourself, and if you have the opportunity, do yourself a favor and run it in Comfy UI, or have fun in the installation hell of this product.

This is no rant, it's just what it is, take it or leave it, just like your product, by the way.

Have a good day.

@tester2905 ---> have a look on the ComfyUI Option it's much more bearable and almost a single click without 2-3 or more hours of painfull debug BS

(edit: and thanks to @dolphinfan 👍 helpfull)

(edit II: and yesh it maybe will go in any kind or form without the accent but why do i have to bother when another TTS deliever out of the box without accent?)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Windows Support?

CosyVoice3‑0.5B on Windows: practical checklist

0) Use a fresh conda env (avoid “mystery DLL soup”)

1) Clone repo + submodules

2) Install Python requirements but don’t let them downgrade your Torch

2A) Install requirements

2B) Fix Torch for RTX 50‑series (Blackwell / sm_120)

2C) Prevent accidental Torch downgrade

3) Put models in the exact folders the code expects

4) Common Windows blocker: TorchAudio 2.9 + TorchCodec + FFmpeg

Symptoms

Working workaround (what you used): patch torchaudio.load/save/info to use soundfile

5) Gradio UI: use prompt audio as filepath, not numpy

Symptom

Fix

6) CV3 zero-shot cloning requires prompt transcript

7) Port in use error

“Top 5 gotchas” summary

1) The claim “Qwen2ForCausalLM doesn’t exist” is almost certainly false

What I’d suggest to that user (CynicalSpore) to get unstuck

A) Don’t debug in a dirty env: create a clean one and verify it’s actually being used

B) Install CosyVoice with the pinned versions (especially transformers)

C) Don’t chase 100 missing packages manually — it’s usually one ABI/DLL issue

Quick “hard reset” inside the env:

D) Torch choice for RTX 3060 (12GB) is NOT the same as RTX 5090

E) Use CLI first; Gradio later

F) Don’t install “extra stuff” until you have one successful generation

The shortest “tell me what to paste” request for them

Working workaround (what you used): patch `torchaudio.load/save/info` to use `soundfile`

1) The claim “`Qwen2ForCausalLM` doesn’t exist” is almost certainly false

B) Install CosyVoice with the pinned versions (especially `transformers`)