Windows Support?
Windows Support?
It works fine in Windows!
Download Miniconda (to access Anaconda Prompt): https://www.anaconda.com/docs/getting-started/miniconda/install#windows-installation
Github Repo: https://github.com/FunAudioLLM/CosyVoice
First step (in Anaconda Prompt):
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
git submodule update --init --recursive
Second step:
conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
cd /d C:\Users\<YOUR-NAME-HERE>\CosyVoice
pip install --upgrade pip
pip install -r requirements.txt
Third step:
Download the model to...
C:\Users\<YOUR-NAME-HERE>\CosyVoice\pretrained_models\Fun-CosyVoice3-0.5B
You may run into errors. But I just used ChatGPT to correct them and get things running. One thing I had to do was uninstall Torch and upgrade it to a newer version since my computer has a new GPU. I also had to install x_transformers
conda activate cosyvoice
pip install x-transformers
Just be patient bc you'll have to jump through some hoops and make some tweaks. But it's definitely worth the wait!
yeah i did the same and it went downhill so ....all easy like you said
got an chinese Gradio unfunctional and with a bunch of errors so ChatGPT gave up too i uninstalled it and let it go.
yeah i did the same and it went downhill so ....all easy like you said
got an chinese Gradio unfunctional and with a bunch of errors so ChatGPT gave up too i uninstalled it and let it go.
No, don't give up. You're almost there! LOL, I was just about to give up myself when I was in that exact same spot, but then the very next fix from ChatGPT got everything working!
So, reinstall it again, and have chatgpt create you a customized English Gradio for you (which is what it did for me).
You can have ChatGPT draw inspiration from the English Gradio that it custom built for me (save as 'webui_english.py' or something):
import os
import sys
import argparse
import tempfile
import random
import gradio as gr
import numpy as np
import soundfile as sf
import torch
import torchaudio
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.join(ROOT_DIR, "third_party", "Matcha-TTS"))
from cosyvoice.cli.cosyvoice import AutoModel
from cosyvoice.utils.file_utils import logging
from cosyvoice.utils.common import set_all_random_seed
PROMPT_SR_MIN = 16000
PROMPT_SR_TARGET = 24000
MAX_ABS = 0.8
def generate_seed():
seed = random.randint(1, 100000000)
return {"__type__": "update", "value": seed}
def _to_mono(x: np.ndarray) -> np.ndarray:
# x can be (T,) or (T, C)
if x.ndim == 1:
return x
return x.mean(axis=1)
def _read_audio_file_float_mono(path: str):
"""
Reads audio using soundfile.
Returns (sr:int, mono_float32:np.ndarray).
NOTE: soundfile handles WAV/FLAC/OGG(Vorbis) depending on libsndfile build.
If you upload MP3/M4A/etc, this may fail (and would need FFmpeg, which is broken in your env).
"""
try:
data, sr = sf.read(path, dtype="float32", always_2d=True) # [T, C]
mono = data.mean(axis=1).astype(np.float32) # [T]
return int(sr), mono
except Exception as e:
raise gr.Error(
"Could not read the prompt audio file. "
"Please upload a WAV (recommended) or FLAC. "
f"Details: {e}"
)
def _write_prompt_to_temp_wav(prompt_audio_path: str, target_sr: int = PROMPT_SR_TARGET) -> str:
"""
Takes a prompt audio FILEPATH (from Gradio), reads it cleanly,
resamples to target_sr, writes a temp WAV, returns its path.
"""
sr, mono = _read_audio_file_float_mono(prompt_audio_path)
# Resample to target_sr using torchaudio (no FFmpeg)
x = torch.from_numpy(mono).unsqueeze(0) # [1, T]
if int(sr) != int(target_sr):
x = torchaudio.functional.resample(x, int(sr), int(target_sr))
y = x.squeeze(0).cpu().numpy().astype(np.float32)
# Keep prompt from clipping (usually unnecessary for clean WAV, but safe)
y = np.nan_to_num(y, nan=0.0, posinf=0.0, neginf=0.0)
peak = float(np.max(np.abs(y))) if y.size else 0.0
if peak > 1.0:
y = y / peak
fd, outpath = tempfile.mkstemp(suffix=".wav")
os.close(fd)
sf.write(outpath, y, int(target_sr))
return outpath
def _get_sr(path: str) -> int:
return int(sf.info(path).samplerate)
def _collect_audio(gen, sample_rate: int) -> np.ndarray:
chunks = []
for i in gen:
t = i["tts_speech"]
if isinstance(t, torch.Tensor):
t = t.detach().cpu()
if t.ndim > 1:
t = t.reshape(-1)
arr = t.numpy().astype(np.float32)
else:
arr = np.asarray(t, dtype=np.float32).reshape(-1)
chunks.append(arr)
if not chunks:
return np.zeros(sample_rate, dtype=np.float32)
audio = np.concatenate(chunks)
audio = np.nan_to_num(audio, nan=0.0, posinf=0.0, neginf=0.0)
audio = np.clip(audio, -MAX_ABS, MAX_ABS)
return audio.astype(np.float32)
def _call_with_optional_kwargs(fn, *args, **kwargs):
"""
CosyVoice variants differ slightly in accepted kwargs (e.g., speed).
Try calling with kwargs; if TypeError, retry without unknown kwargs.
"""
try:
return fn(*args, **kwargs)
except TypeError:
return fn(*args)
def build_ui(cosyvoice, args):
# Rough capability detection
has_sft = hasattr(cosyvoice, "inference_sft")
has_zero_shot = hasattr(cosyvoice, "inference_zero_shot")
has_cross = hasattr(cosyvoice, "inference_cross_lingual")
has_instruct2 = hasattr(cosyvoice, "inference_instruct2")
has_instruct = hasattr(cosyvoice, "inference_instruct")
# List speakers if available
sft_spk = []
try:
sft_spk = cosyvoice.list_available_spks() or []
except Exception:
sft_spk = []
modes = []
if has_sft and len(sft_spk) > 0:
modes.append("Pretrained tone (SFT)")
if has_zero_shot:
modes.append("Zero-shot clone (prompt audio + transcript)")
if has_cross:
modes.append("Cross-lingual clone (prompt audio)")
if has_instruct2 or has_instruct:
modes.append("Instruct (control)")
if not modes:
raise RuntimeError("No supported inference modes detected for this model.")
def generate_audio(
tts_text,
mode,
sft_dropdown,
system_prompt,
prompt_transcript,
prompt_audio_upload_path,
prompt_audio_record_path,
instruct_text,
seed,
speed,
):
# Non-stream UI only (avoid ffprobe/pydub ADTS conversion)
if not (tts_text and str(tts_text).strip()):
raise gr.Error("Please enter text to synthesize.")
# Choose prompt source (filepath)
prompt_audio_path = prompt_audio_upload_path or prompt_audio_record_path
prompt_wav_path = None
if mode in [
"Zero-shot clone (prompt audio + transcript)",
"Cross-lingual clone (prompt audio)",
"Instruct (control)",
]:
if not prompt_audio_path:
raise gr.Error("Please provide a prompt audio clip (upload or record).")
# Convert prompt to clean 24k wav
prompt_wav_path = _write_prompt_to_temp_wav(prompt_audio_path, target_sr=PROMPT_SR_TARGET)
sr = _get_sr(prompt_wav_path)
if sr < PROMPT_SR_MIN:
gr.Warning(f"Prompt sample rate {sr} < {PROMPT_SR_MIN}. Results may degrade.")
set_all_random_seed(int(seed) if seed is not None else 0)
# ----- Mode: SFT -----
if mode == "Pretrained tone (SFT)":
if not sft_dropdown:
raise gr.Error("No pretrained tone selected.")
logging.info("SFT inference request")
gen = _call_with_optional_kwargs(
cosyvoice.inference_sft,
tts_text,
sft_dropdown,
stream=False,
speed=speed,
)
audio = _collect_audio(gen, cosyvoice.sample_rate)
return (int(cosyvoice.sample_rate), audio)
# ----- Mode: Zero-shot -----
if mode == "Zero-shot clone (prompt audio + transcript)":
if not (prompt_transcript and prompt_transcript.strip()):
raise gr.Error("Prompt transcript is required (the words spoken in the prompt audio).")
# CV3 convention: system + <|endofprompt|> + transcript
# CV2 convention: transcript-only often works
pt = prompt_transcript.strip()
sp = (system_prompt or "").strip()
if sp:
prompt_text = sp + "<|endofprompt|>" + pt
else:
prompt_text = pt
logging.info("Zero-shot inference request")
gen = _call_with_optional_kwargs(
cosyvoice.inference_zero_shot,
tts_text,
prompt_text,
prompt_wav_path,
stream=False,
speed=speed,
)
audio = _collect_audio(gen, cosyvoice.sample_rate)
return (int(cosyvoice.sample_rate), audio)
# ----- Mode: Cross-lingual -----
if mode == "Cross-lingual clone (prompt audio)":
logging.info("Cross-lingual inference request")
gen = _call_with_optional_kwargs(
cosyvoice.inference_cross_lingual,
tts_text,
prompt_wav_path,
stream=False,
speed=speed,
)
audio = _collect_audio(gen, cosyvoice.sample_rate)
return (int(cosyvoice.sample_rate), audio)
# ----- Mode: Instruct -----
if mode == "Instruct (control)":
if not (instruct_text and instruct_text.strip()):
raise gr.Error("Instruct text is required for this mode.")
logging.info("Instruct inference request")
if has_instruct2:
# CosyVoice2/3 style
gen = _call_with_optional_kwargs(
cosyvoice.inference_instruct2,
tts_text,
instruct_text,
prompt_wav_path,
stream=False,
speed=speed,
)
else:
# Older CosyVoice1 style (kept for compatibility)
gen = _call_with_optional_kwargs(
cosyvoice.inference_instruct,
tts_text,
sft_dropdown if sft_dropdown else "",
instruct_text,
stream=False,
speed=speed,
)
audio = _collect_audio(gen, cosyvoice.sample_rate)
return (int(cosyvoice.sample_rate), audio)
raise gr.Error(f"Unsupported mode: {mode}")
with gr.Blocks() as demo:
gr.Markdown("## CosyVoice Unified WebUI (Windows-safe, non-stream)")
gr.Markdown(
f"**Model dir:** `{args.model_dir}` \n"
"Notes: returns full audio (no Gradio streaming) to avoid ffprobe/pydub issues."
)
tts_text = gr.Textbox(
label="Text to synthesize",
lines=4,
value="Hello! This is a local CosyVoice test.",
)
with gr.Row():
mode = gr.Radio(choices=modes, label="Mode", value=modes[0])
seed_btn = gr.Button("🎲 Random seed")
seed = gr.Number(value=0, label="Seed")
speed = gr.Number(value=1.0, label="Speed (if supported)", minimum=0.5, maximum=2.0, step=0.1)
# Prompt audio (FILEPATH) — closest to CLI behavior
with gr.Row():
prompt_audio_upload_path = gr.Audio(
sources=["upload"],
type="filepath",
label="Prompt audio (upload) — recommended: WAV/FLAC",
)
prompt_audio_record_path = gr.Audio(
sources=["microphone"],
type="filepath",
label="Prompt audio (record) — if this fails, upload WAV instead",
)
# CV3-friendly prompt fields
system_prompt = gr.Textbox(
label="System prompt (optional; CV3 typically uses this)",
lines=2,
value="<|endofprompt|>",
)
prompt_transcript = gr.Textbox(
label="Prompt transcript (required for Zero-shot mode)",
lines=2,
placeholder="Type the exact words spoken in the prompt audio…",
)
# SFT selection (only relevant if model supports it)
sft_dropdown = gr.Dropdown(
choices=(sft_spk if len(sft_spk) > 0 else [""]),
label="Pretrained tone (SFT speaker)",
value=(sft_spk[0] if len(sft_spk) > 0 else ""),
)
instruct_text = gr.Textbox(
label="Instruct text (required for Instruct mode)",
lines=2,
placeholder="E.g., 'Please speak much faster.<|endofprompt|>'",
)
btn = gr.Button("Generate")
out_audio = gr.Audio(label="Output (full audio)", autoplay=True, type="numpy")
seed_btn.click(generate_seed, inputs=[], outputs=seed)
btn.click(
generate_audio,
inputs=[
tts_text,
mode,
sft_dropdown,
system_prompt,
prompt_transcript,
prompt_audio_upload_path,
prompt_audio_record_path,
instruct_text,
seed,
speed,
],
outputs=[out_audio],
)
return demo
def main():
cosyvoice = AutoModel(model_dir=args.model_dir)
demo = build_ui(cosyvoice, args)
demo.queue(max_size=4, default_concurrency_limit=2)
demo.launch(server_name="0.0.0.0", server_port=args.port, share=False)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--port", type=int, default=50000)
parser.add_argument(
"--model_dir",
type=str,
default="pretrained_models/Fun-CosyVoice3-0.5B",
help="Local path to a model folder (recommended on Windows).",
)
args = parser.parse_args()
main()
If you installed everything that your system needs, this should run perfectly for you. If not, just copy the error code into ChatGPT and let it do the troubleshooting and heavylifting. That's how I got mine to work flawlessly.
Don't upload prompt audio over 30 seconds, otherwise you'll get an error. And try to keep your "text to synthesize" no more than 200 characters or so. If you want to generate longer clips, have ChatGPT write you another python script that generates audio from longer transcripts (directly in Anaconda Prompt).
Cheers! 🍻
alright i give it another run... i will post if some get downhill again
/edit I've completely given up. It's absolute chaos, a hell of dependencies. Never again, but thank you for trying to help me. But I'm not programming Windows 12 and Windows 13 at the same time... Because for me, it just felt like chaos and not a solution. I asked ChatGPT to help me, but every time it said: “Oh, you have to do this...” Hours later: Oh yes, right, that has to be renamed back to what it was from the beginning. A cycle of endless doom.
Sorry, I threw it away until someone makes a reasonable attempt for Windows users that is easy to use...
/edit some typos fixed
It can run on windows, make sure set text_frontend=False because text_normalize may not support windows
2 hours ? insane
alright i give it another run... i will post if some get downhill again
/edit I've completely given up. It's absolute chaos, a hell of dependencies. Never again, but thank you for trying to help me. But I'm not programming Windows 12 and Windows 13 at the same time... Because for me, it just felt like chaos and not a solution. I asked ChatGPT to help me, but every time it said: “Oh, you have to do this...” Hours later: Oh yes, right, that has to be renamed back to what it was from the beginning. A cycle of endless doom.
Sorry, I threw it away until someone makes a reasonable attempt for Windows users that is easy to use...
/edit some typos fixed
Dang! Sorry to hear that man.
When you restarted the process, did you wipe out all of the previous data from the first attempt? Maybe having overlapping packages/dependencies F'd things up?
Yes, it does require a lot of persistence and patience. It took me hours, too. But it definitely paid off for me. If I had time, I'd do a Youtube video of an A-Z install. But eventually someone will probably do it to make it a lot easier. These long install used to make me question my sanity, too, haha. So I know how ya feel. Good luck man!
alright i give it another run... i will post if some get downhill again
/edit I've completely given up. It's absolute chaos, a hell of dependencies. Never again, but thank you for trying to help me. But I'm not programming Windows 12 and Windows 13 at the same time... Because for me, it just felt like chaos and not a solution. I asked ChatGPT to help me, but every time it said: “Oh, you have to do this...” Hours later: Oh yes, right, that has to be renamed back to what it was from the beginning. A cycle of endless doom.
Sorry, I threw it away until someone makes a reasonable attempt for Windows users that is easy to use...
/edit some typos fixed
Dang! Sorry to hear that man.
When you restarted the process, did you wipe out all of the previous data from the first attempt? Maybe having overlapping packages/dependencies F'd things up?
Yes, it does require a lot of persistence and patience. It took me hours, too. But it definitely paid off for me. If I had time, I'd do a Youtube video of an A-Z install. But eventually someone will probably do it to make it a lot easier. These long install used to make me question my sanity, too, haha. So I know how ya feel. Good luck man!
I was starting with a new env all fresh, but I ran into a circle and ChatGPT let me run through each circle again and again. It let me rename the cosyvoice3.yaml to cosyvoice.yaml, so it went into the drain at this point, later it could not remember that step we took, it was literally a sauce of cosyvoice3 mixed into cosyvoice(2) or such. I asked myself what kinda BS ChatGPT mixed together... it was just a mixed-up, nonfunctional mess in the end. As I wrote then to ChatGPT, "how about the Gradio thingy?" it was fully over .. it destroyed the last bit of any remains that were cluttered and nonfunctional at this time anyway.
I was starting with a new env all fresh, but I ran into a circle and ChatGPT let me run through each circle again and again. It let me rename the cosyvoice3.yaml to cosyvoice.yaml, so it went into the drain at this point, later it could not remember that step we took, it was literally a sauce of cosyvoice3 mixed into cosyvoice(2) or such. I asked myself what kinda BS ChatGPT mixed together... it was just a mixed-up, nonfunctional mess in the end. As I wrote then to ChatGPT, "how about the Gradio thingy?" it was fully over .. it destroyed the last bit of any remains that were cluttered and nonfunctional at this time anyway.
Yeah, it definitely led you in the wrong direction. I didn't have to touch the .yaml files nor CosyVoice2 on my install. You shouldn't have had to do ANY of that.
After I did the basic install, and installed what was needed to get it working on my new GPU... GPT-5.2 modified my Gradio to English, and switched something with numpy. After that, everything just magically worked very easily!
Aside from the basic install and git clone, all you really should've needed to do was tell ChatGPT:
- specifically which GPU you have
- tell it to only work on installing CosyVoice 3 0.5B (forget about CosyVoice 2; it's not needed for any of this)
- ask which version of torch and torchaudio is best for your GPU to get this working
- give it the README.md and give it the Gradio python code I posted above (for an example of a working English Gradio version for CosyVoice 3 0.5B)
Aside from that, you should really be good. I don't know which path your ChatGPT took. But that's wild, man! If you try again and run into a major block, post the error here and I can ask my ChatGPT what's going wrong. Since it already knows how to get this working for me, it might be in a better position to help you than your own ChatGPT.
If I have some time after the Holidays, I might try to install it on another computer and create an easy to follow guide. Good luck!
Btw, I just asked my ChatGPT to summarize the blocks/fixes that I ran into to get things working for me on Windows. Maybe this will help some of you...
CosyVoice3‑0.5B on Windows: practical checklist
0) Use a fresh conda env (avoid “mystery DLL soup”)
conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
1) Clone repo + submodules
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd /d CosyVoice
git submodule update --init --recursive
Verify submodule:
git submodule status --recursive
2) Install Python requirements but don’t let them downgrade your Torch
CosyVoice’s requirements.txt may pin torch==2.3.1 and a cu121 extra-index. That breaks RTX 5090 / sm_120.
Recommended order:
- install repo requirements
- then override Torch to a Blackwell-capable build
2A) Install requirements
pip install -r requirements.txt
2B) Fix Torch for RTX 50‑series (Blackwell / sm_120)
If you have an RTX 50‑series laptop GPU (e.g., 5090 Laptop), install cu128:
pip uninstall -y torch torchaudio torchvision
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
Sanity test CUDA:
python -c "import torch; print(torch.__version__); print('CUDA:', torch.cuda.is_available()); print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'no gpu')"
python -c "import torch; x=torch.randn(64,64,device='cuda'); print((x@x).sum().item())"
If the matmul errors, your Torch build still doesn’t support your GPU.
2C) Prevent accidental Torch downgrade
Edit requirements.txt and remove/comment lines like:
--extra-index-url https://download.pytorch.org/whl/cu121
torch==2.3.1
torchaudio==2.3.1
Otherwise rerunning pip install -r requirements.txt can re-break CUDA.
3) Put models in the exact folders the code expects
For example.py, the folder names matter.
Example structure:
CosyVoice\
pretrained_models\
Fun-CosyVoice3-0.5B\ (CV3-0.5B-2512 downloaded/renamed here)
CosyVoice2-0.5B\
CosyVoice-300M\
...
Then run:
python example.py
4) Common Windows blocker: TorchAudio 2.9 + TorchCodec + FFmpeg
On some Windows setups, TorchAudio 2.9 routes audio I/O through TorchCodec/FFmpeg and that can explode with DLL errors.
Symptoms
ImportError: TorchCodec is required...RuntimeError: Could not load libtorchcodec...ffmpeg.exe - Entry Point Not Foundpopups- Gradio audio streaming crashes (pydub/ffprobe)
Working workaround (what you used): patch torchaudio.load/save/info to use soundfile
Create or overwrite:
%CONDA_PREFIX%\Lib\site-packages\sitecustomize.py
Use this (safe and simple):
import warnings
try:
import torch
import torchaudio
import soundfile as sf
import numpy as np
import types
except Exception as e:
warnings.warn(f"sitecustomize torchaudio patch not applied (imports failed): {e}")
else:
def _load_soundfile(uri, frame_offset=0, num_frames=-1, channels_first=True, **kwargs):
data, sr = sf.read(uri, dtype="float32", always_2d=True) # [T, C]
if frame_offset:
data = data[int(frame_offset):]
if num_frames is not None and int(num_frames) > 0:
data = data[: int(num_frames)]
if channels_first:
data = np.transpose(data, (1, 0)) # [C, T]
return torch.from_numpy(data), int(sr)
def _save_soundfile(uri, src, sample_rate, channels_first=True, **kwargs):
if isinstance(src, torch.Tensor):
src = src.detach().cpu()
x = src.numpy()
y = x if x.ndim == 1 else (np.transpose(x, (1, 0)) if channels_first else x)
y = np.nan_to_num(y, nan=0.0, posinf=0.0, neginf=0.0)
y = np.clip(y, -1.0, 1.0)
sf.write(uri, y, int(sample_rate))
def _info_soundfile(uri, *args, **kwargs):
i = sf.info(uri)
return types.SimpleNamespace(sample_rate=int(i.samplerate), num_frames=int(i.frames), num_channels=int(i.channels))
torchaudio.load = _load_soundfile
torchaudio.save = _save_soundfile
if not hasattr(torchaudio, "info"):
torchaudio.info = _info_soundfile
warnings.warn("Patched torchaudio.load/save/info to use soundfile (bypassing TorchCodec/FFmpeg).")
Restart the terminal and verify:
python -c "import torchaudio; print(hasattr(torchaudio,'info')); print(torchaudio.info(r'.\asset\zero_shot_prompt.wav').sample_rate)"
5) Gradio UI: use prompt audio as filepath, not numpy
A major “why does cloning sound wrong?” issue on Gradio is prompt audio scaling.
Symptom
- CLI cloning is perfect
- UI cloning sounds generic / not like the speaker
Fix
In the UI, use:
gr.Audio(type="filepath")for prompt audio,- read with
soundfileyourself, - resample to 24k.
(Using type="numpy" can cause amplitude scaling/clipping unless you normalize carefully.)
6) CV3 zero-shot cloning requires prompt transcript
For best similarity:
- prompt WAV alone is not enough
- supply the exact words spoken in the prompt clip (prompt transcript)
A reliable CV3 prompt format is:
<|endofprompt|> + (prompt transcript)
You discovered system prompts can strongly steer accent/style; keep them stable.
7) Port in use error
If Gradio fails with:
[Errno 10048] ... port already in use
Pick a new port:
python webui_unified.py --port 50001 --model_dir pretrained_models/Fun-CosyVoice3-0.5B
“Top 5 gotchas” summary
- RTX 50‑series needs cu128 Torch (cu121/cu124 can fail with sm_120).
- Don’t let
requirements.txtdowngrade Torch. - If TorchCodec/FFmpeg breaks, patch
torchaudio.*to usesoundfile. - In Gradio, use prompt audio filepath, not numpy (or normalize).
- Always provide prompt transcript for zero-shot similarity.
here is mine and now i'm outta here
ty for your help even it did not help me
1️⃣ Initial Situation
You wanted to install CosyVoice 3.0.5B with Gradio on Windows with RTX 3060 (12 GB).
You tried Conda and pip.
You ran into a series of DLL and module import errors, including:
_C.pyd procedure entry point missing
vcruntime140.dll checks
Torch failing to load (shm.dll)
numpy.core.multiarray import failing
HuggingFace / Modelscope / Transformers / Whisper / tiktoken / kaldifst / wetext errors
2️⃣ Actions Taken
Checked for Visual C++ runtime (vcruntime140.dll) — it existed.
Tried reinstalling PyTorch + CUDA.
Tried installing missing packages (huggingface_hub, transformers, modelscope, tiktoken, wetext, kaldifst…).
Attempted version fixes for Transformers, PyTorch, Modelscope, etc.
Tried creating a clean Conda environment multiple times.
Discussed manually editing llm.py to replace Qwen2ForCausalLM with AutoModelForCausalLM.
3️⃣ Core Issues Identified
Package version conflicts were rampant.
Qwen2ForCausalLM does not exist in Transformers releases — CosyVoice code is incompatible with current Transformers.
Even after fixing all other missing packages and DLLs, CosyVoice itself requires a code change to run properly.
The environment and dependency issues are compounded by the hardcoded imports and assumptions in CosyVoice’s source code.
4️⃣ Honest Assessment
Despite fixing all Python packages, DLLs, and dependencies, CosyVoice will still not run on your machine without editing its source code (llm.py) to replace Qwen2ForCausalLM with a compatible model-loading approach.
Therefore, no amount of environment tweaking, package reinstalling, or DLL fixes alone will make CosyVoice work.
All the previous troubleshooting (DLLs, Torch, Modelscope, Transformers, etc.) did not actually solve the blocking problem.
⚠️ Conclusion
Clear Statement:
I could not help you get CosyVoice 3.0.5B running without modifying the source code. All dependency fixes and environment resets are insufficient; the code itself is incompatible with the current Transformers library.
If you want, I can make a ready-to-run patched version of llm.py for your environment so it will start without further errors — that’s the only way it will actually run.
Do you want me to do that? (Edit from me :NOOOOOOOOO thanks)
and after the rewrite of the llm.py
the next 100 steps awaiting you 🤪😂 welcome to ChatGPT madhouse
and after the rewrite of the llm.py
the next 100 steps awaiting you 🤪😂 welcome to ChatGPT madhouse
Gotcha! I pasted your previous posts into my ChatGPT, and here's what it said:
Yeah — based on that list, I strongly suspect their install is failing due to a “poisoned” Python environment (DLL / ABI conflicts + mixed installs), not because CosyVoice “requires editing llm.py”.
1) The claim “Qwen2ForCausalLM doesn’t exist” is almost certainly false
Hugging Face’s Transformers docs for Qwen2 exist and Qwen2 was added to Transformers long ago; if they can’t import Qwen2ForCausalLM, they’re almost certainly on the wrong Transformers version (too old, or a broken install), not on “current Transformers”. (huggingface.co)
So the right advice to them is: don’t patch CosyVoice code first—verify their transformers is sane and matches the repo pin.
What I’d suggest to that user (CynicalSpore) to get unstuck
A) Don’t debug in a dirty env: create a clean one and verify it’s actually being used
In Anaconda Prompt:
conda create -n cosyvoice_win -y python=3.10
conda activate cosyvoice_win
where python
python --version
where pip
pip --version
If where python shows anything outside the env (or multiple Pythons), stop and fix PATH first.
B) Install CosyVoice with the pinned versions (especially transformers)
Inside the CosyVoice repo:
pip install -r requirements.txt
Then immediately verify the exact thing they claimed was missing:
python -c "import transformers; print(transformers.__version__); from transformers import Qwen2ForCausalLM; print('Qwen2ForCausalLM OK')"
If that fails:
- they’re not actually in the env they think they’re in, or
transformersis corrupted, orrequirements.txtdidn’t install correctly.
Fix attempt:
pip uninstall -y transformers tokenizers
pip install transformers==4.51.3 tokenizers==0.21.4
python -c "from transformers import Qwen2ForCausalLM; print('OK')"
(Those versions match what CosyVoice commonly pins.)
C) Don’t chase 100 missing packages manually — it’s usually one ABI/DLL issue
Their errors like:
numpy.core.multiarrayfailing_C.pyd entry point missingtorch shm.dll- random runtime DLL problems
…are classic signs of binary incompatibility from mixing:
- pip wheels + conda packages (especially numpy/scipy/torch),
- multiple channels (
defaults+conda-forge) with no strict priority, - or installing stuff into base + envs + user site-packages.
Quick “hard reset” inside the env:
pip uninstall -y numpy scipy
pip install numpy==1.26.4
python -c "import numpy as np; print(np.__version__)"
If numpy still fails there, the Windows Python install is seriously contaminated (often fixed by reinstalling Miniconda or using Miniforge).
D) Torch choice for RTX 3060 (12GB) is NOT the same as RTX 5090
For an RTX 3060 (Ampere, sm_86), they can usually stick with the repo’s pinned torch (often torch==2.3.1+cu121) and it should work.
Verify CUDA works:
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'no gpu')"
python -c "import torch; x=torch.randn(64,64,device='cuda'); print((x@x).sum().item())"
If CUDA is false, they may need:
- newer NVIDIA driver
- or reinstall torch with an appropriate CUDA wheel.
E) Use CLI first; Gradio later
A ton of “Windows CosyVoice problems” are actually Gradio audio conversion problems (ffprobe/ffmpeg/pydub), not model inference problems.
Best path:
python example.py(or your CLI script)- then UI
If they insist on UI:
- avoid Gradio streaming audio
- use prompt audio as
filepathinstead ofnumpyto avoid scaling/clipping issues (you saw this firsthand).
F) Don’t install “extra stuff” until you have one successful generation
Tell them to stop installing Whisper/tiktoken/onnxruntime-gpu/etc. until they can run:
python example.py
Once that works, add optional dependencies.
The shortest “tell me what to paste” request for them
If you can ask them for just these outputs, you can diagnose quickly:
where python
python --version
python -c "import torch; print('torch', torch.__version__); print('cuda', torch.cuda.is_available())"
python -c "import transformers; print('transformers', transformers.__version__); from transformers import Qwen2ForCausalLM; print('qwen2 ok')"
python -c "import numpy as np; print('numpy', np.__version__)"
If they paste those, you can tell whether they have:
- wrong env
- wrong transformers
- broken numpy
- broken torch
…without a “next 100 steps” spiral.
If you want, paste the exact CosyVoice requirements.txt lines around torch/transformers from your repo version, and I’ll write a single clean install recipe for RTX 3060 that avoids your 5090-specific Torch workaround.
Regarding your post before your last, I followed the steps and checked everything with ChatGPT MadBro after cleaning up the dirty environment. The madhouse gave me this nonsense that I posted. I double-checked everything four times, and every time a missing dependency popped up, I tried three times, and three times it was such a mess that it's kind of weird that everything else I run works great. I've installed other environments with Triton Sage Flash, whatever you want, but this one is just a mess, and at the beginning I was told not to install pip and conda at the same time, and three steps later the first pip command was presented by the crazy GPT, as I now call it... and so it went the same way every time. The only option is to do a complete hard reset of Miniconda and install this new version, but that would destroy all my environments, and unfortunately, someone told me that you should never change a running system, and this one is not fully designed for Windows. The crazy GPT told me the same thing every time, that it is not fully designed for Windows and should run better on Linux without any problems. Thank you for your kind words, but I'm not going to destroy my entire system because of a TTS. I'll wait until some ComfyUI implementiation will come up or some other way. Ty Buddy but i'm out @dolphinfan
i used of course the RTX 3060 torch stuff because blackwell not run on sm86 ampere stuff the same on comfy and triton stuff sage ......
CUDA 12.0/12.4/12.8
As I said, Cuda worked, I tested everything, all I had to do was rewrite the _C.pyd file, and that was my way of saying goodbye to all the nonsense I mentioned in previous posts. I'm not here to program Windows 12 or Windows 13. If a repo didn't work, I didn't reinvent the wheel for a repo that should work as mentioned, so I left. I'm not a programmer, I'm just someone who has been looking into some things I want to try. If it's feasible, okay, if not, then goodbye. I'm not someone who has 5 hours or more to debug things.
quote :
wrong env
wrong transformers
broken numpy
broken torch
and yeah all in all that was the last thing ChatGPT also mentioned to me after break my Installment after he destroyed it
hi, do not use Chatgpt to modify the code. Just install requirements according to the readme, and set text_frontend=False during inference