ecker
/

vall-e

ecker commited on Nov 18, 2024

Commit

6a76916

1 Parent(s): be6c4d5

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -96,6 +96,8 @@ This repo contains the following configurations under `./models/`:
   * Speedups are immense compared to the `ar+nar-llama-8`, as the entire audio output is decoded in parallel rather than causally.
     * Throughput and memory usage should be constant between inferencing steps.
     * The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
   * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
     * ...except STT, this received no STT training out of fear of botching the model.
   * Weights will be added as the model is trained.

   * Speedups are immense compared to the `ar+nar-llama-8`, as the entire audio output is decoded in parallel rather than causally.
     * Throughput and memory usage should be constant between inferencing steps.
     * The model only needs to be invoked about 5+25+7 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels) instead.
+  * Seems to absolutely require classifier-free-guidance to keep the output stable.
+  * The "confidence" issue on voices it hasn't seen / hasn't seen much of is much more noticeable as RVQ level 0 is much more susceptable to it.
   * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
     * ...except STT, this received no STT training out of fear of botching the model.
   * Weights will be added as the model is trained.