ecker
/

vall-e

Model card Files Files and versions

xet

Community

mrq commited on Mar 29, 2025

Commit

72dfea2

1 Parent(s): 6d9b328

m

Browse files

Files changed (2) hide show

README.md +4 -26
models/ckpt/fp32.sft +0 -3

README.md CHANGED Viewed

@@ -105,7 +105,7 @@ This repo contains the following configurations under `./models/`:
   * These weights are mostly a experiment to ensure that a pure NAR model works (through demasking inferencing).
   * A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes.
     * Throughput and memory usage should be constant between inferencing steps.
-    * The model only needs to be invoked about 5+(25+7)*2 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels, double for CFG) instead.
   * Seems to absolutely require classifier-free-guidance >= 2.0 to keep the output stable (but this does replace the need of rep pen + low temp, even for normal AR+NAR).
   * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
   * This *was* slated as a dud until the final oversight was squashed in the inferencing code, but it works *almost-decently* as a TTS model.
@@ -118,37 +118,15 @@ This repo contains the following configurations under `./models/`:
     * `nvidia/audio-codec-44khz` as the audio codec used for 44KHz audio instead of EnCodec for 24KHz.
       * `descript-audio-codec` can be revisited as an RVQ-based codec since they're easier to train a model around over an FSQ codec
       * an EnCodec variant can also be revisited, as it's rather quick to get a model to have speech emerge
-  * At the moment, it doesn't seem to offer any benefits over the reference model.......
-    * Zero-shot performance / speaker similarity is piss poor
-      * More training is required, but I ***suppose*** these models can be used to finetune on a speaker for the meantime
-      * The `proms_emb` might be weak compared to the `resps_emb`. I'm not sure of the best remedy for this.
-    * Actual audio quality doesn't seem all that apparent
-      * I imagine a post-training step similar to RLHF with actual 44KHz audio rather than upsampled-from-24KHz audio could fix this
-    * The `smaller` model has the same inference speed as the `larger` model on my V100
-      * This could either simply be because a V100 doesn't benefit from a narrow model, or there's a bottleneck elsewhere
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
 * a model using [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/):
-  + the 24KHz model will *not* converge no matter what. However, naively using just the first 8 RVQ levels might not be good enough, as there's too many codebooks for viable use.
-  + the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.
-  	+ Because of this, training losses are high and it's having a hard time trying to converge.
-  + It has *sub-servicable* output for the first 4 RVQ levels, but it's massive cope to try and use it as a model.
-  + ~~I believe there's hope to use it when I requantize my audio properly.~~
-    + Addendum: even after properly processing my audio, the loss is actually *worse* than before. I imagine DAC just cannot be used as an intermediary for an LM.
 * a model with a causal size >1 (sampling more than one token for the AR):
-  + re-using an existing model or training from scratch does not have fruitful results.
-  + there's an inherent periodic stutter that doesn't seem to be able to be trained out, but it might require exotic sampling methods.
-  + unfortunately it requires:
-    + either something similar to Medusa heads, where there's additional parameters to perform speculative sampling,
-    + a solution similar to what VALL-E 2 uses with group token embeddings or whatever, which *will* harm the NAR tasks in an AR+NAR model.
-  + I just don't understand where the issue lies, since parallel decoding does work, as evidence with the NAR.
-Some current "achitectural features" are in-use, but their effects need to be experimented with further:
-* `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).
-* `audio_embeddings_sum` is also a mystery whether it matters if each later RVQ level should "see" the past levels through summing embeddings, or if not doing it is preferable.
-* Disabling `unified_position_ids` seems to help quality more often than not, but I'm still unsure if it's beneficial in practice.
 ## LoRAs

   * These weights are mostly a experiment to ensure that a pure NAR model works (through demasking inferencing).
   * A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes.
     * Throughput and memory usage should be constant between inferencing steps.
+    * The model only needs to be invoked about 5+(25+7)\*2 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels, double for CFG) instead.
   * Seems to absolutely require classifier-free-guidance >= 2.0 to keep the output stable (but this does replace the need of rep pen + low temp, even for normal AR+NAR).
   * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
   * This *was* slated as a dud until the final oversight was squashed in the inferencing code, but it works *almost-decently* as a TTS model.
     * `nvidia/audio-codec-44khz` as the audio codec used for 44KHz audio instead of EnCodec for 24KHz.
       * `descript-audio-codec` can be revisited as an RVQ-based codec since they're easier to train a model around over an FSQ codec
       * an EnCodec variant can also be revisited, as it's rather quick to get a model to have speech emerge
+  * Notes can be found in the [implementation documentation](https://github.com/e-c-k-e-r/vall-e/blob/master/docs/models_v2.md).
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
 * a model using [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/):
+  + the new implementation should be able to handle it.
 * a model with a causal size >1 (sampling more than one token for the AR):
+  * this seems a bit unneccessary as the NAR-len modality addresses the downsides of the AR+NAR modality.
 ## LoRAs

models/ckpt/fp32.sft DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:c7121cf968016aea3e752182724dec63a4eae25d4dba08188c425fa051b813ff
-size 493527168