Daniel777 commited on
Commit
e7be22c
·
verified ·
1 Parent(s): 0e6579f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -25
README.md CHANGED
@@ -11,51 +11,68 @@ tags:
11
 
12
  # VTS
13
 
14
- ![VTS overview](./Thumbnail.png)
15
 
16
- VTS (Voice To Sound) generates sound effects from:
17
-
18
- - a short vocal sketch
19
  - a text prompt
20
 
21
- This repository hosts the pretrained checkpoint files for the older `voice_cond` VTS pipeline.
 
22
 
23
  ## Files
24
 
25
- - `model_voice_1030_24.pth`: main diffusion checkpoint
26
- - `vae_weight.pth`: VAE checkpoint used for decoding
 
 
 
27
 
28
  ## Download
29
 
30
  ```bash
31
  pip install -U "huggingface_hub"
32
- hf download Daniel777/VTS model_voice_1030_24.pth vae_weight.pth --local-dir ./checkpoints
33
  ```
34
 
35
  ## Usage
36
 
37
- Use these checkpoints with the companion `voice_text_sfx` codebase.
38
 
39
  ```bash
40
- python3 scripts/infer.py \
41
- --model-ckpt ./checkpoints/model_voice_1030_24.pth \
42
- --ae-ckpt ./checkpoints/vae_weight.pth \
43
- --prompt-audio /path/to/prompt.wav \
44
- --text "glassy swipe with rising pitch" \
45
- --output /tmp/generated.wav \
46
- --duration 3.0 \
47
- --steps 100 \
48
- --cfg-scale 6.0 \
49
  --device cuda
50
  ```
51
 
52
- ## Notes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
- - This checkpoint matches the older `voice_cond` path.
55
- - It is not a drop-in checkpoint for later `script_embed` or `voice_prompt` variants.
56
- - This is a research checkpoint, not a packaged Hugging Face Inference API model.
 
57
 
58
- ## SHA256
59
 
60
- - `model_voice_1030_24.pth`: `a061bfb5e4fca61d8857c3056245304d0a421b55d4f86deca3b47442b08f5287`
61
- - `vae_weight.pth`: `45e2d5ab17e5bbb22dc533cd70798bb4ed96dbbe3487f6f20f5528fc9915558e`
 
11
 
12
  # VTS
13
 
14
+ VTS generates sound effects from:
15
 
16
+ - a short voice or audio sketch
 
 
17
  - a text prompt
18
 
19
+ This model repository hosts the pretrained checkpoint for the VTS inference
20
+ codebase.
21
 
22
  ## Files
23
 
24
+ - `dynamic_v3_0415.ckpt`: main VTS checkpoint
25
+
26
+ The companion inference repository downloads additional frozen components at
27
+ runtime, including `google/flan-t5-base` and vocoder files used by the local
28
+ `vts/vocos_custom` implementation.
29
 
30
  ## Download
31
 
32
  ```bash
33
  pip install -U "huggingface_hub"
34
+ hf download <your-user-or-org>/<your-model-repo> dynamic_v3_0415.ckpt --local-dir ./checkpoints
35
  ```
36
 
37
  ## Usage
38
 
39
+ Use this checkpoint with the companion `vts_inference` repository.
40
 
41
  ```bash
42
+ python -u infer.py \
43
+ --input-audio ./examples/voice.wav \
44
+ --text "scifi cannon charging and shooting" \
45
+ --temperature 0.7 \
46
+ --model-path ./checkpoints/dynamic_v3_0415.ckpt \
47
+ --output-dir ./outputs \
 
 
 
48
  --device cuda
49
  ```
50
 
51
+ ## Temperature Behavior
52
+
53
+ For normal inference, use `--temperature 0.7`. This keeps the original dynamic
54
+ conditioning from the input audio and runs the standard `generate` path.
55
+
56
+ - `< 0.6`: weak dynamic conditioning + `generate`
57
+ - `0.6 <= temperature < 0.8`: full dynamic conditioning + `generate`
58
+ - `>= 0.8`: input-audio latent mixing + `variation`
59
+
60
+ The input audio is not treated as a speaker embedding. It is converted into
61
+ frame-level dynamic features and, for high-temperature variation, also encoded
62
+ into the vocoder latent space.
63
+
64
+ ## Intended Use
65
+
66
+ This checkpoint is intended for research and creative sound-effect generation
67
+ from vocal sketches or short audio sketches plus text prompts.
68
+
69
+ ## Limitations
70
 
71
+ - The model is optimized for short sound-effect style clips.
72
+ - Output quality depends on checkpoint quality, input audio, prompt text, and
73
+ sampling settings.
74
+ - This is not packaged as a Hugging Face Inference API pipeline.
75
 
76
+ ## License
77
 
78
+ MIT.