mrq commited on
Commit
72dfea2
·
1 Parent(s): 6d9b328
Files changed (2) hide show
  1. README.md +4 -26
  2. models/ckpt/fp32.sft +0 -3
README.md CHANGED
@@ -105,7 +105,7 @@ This repo contains the following configurations under `./models/`:
105
  * These weights are mostly a experiment to ensure that a pure NAR model works (through demasking inferencing).
106
  * A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes.
107
  * Throughput and memory usage should be constant between inferencing steps.
108
- * The model only needs to be invoked about 5+(25+7)*2 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels, double for CFG) instead.
109
  * Seems to absolutely require classifier-free-guidance >= 2.0 to keep the output stable (but this does replace the need of rep pen + low temp, even for normal AR+NAR).
110
  * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
111
  * This *was* slated as a dud until the final oversight was squashed in the inferencing code, but it works *almost-decently* as a TTS model.
@@ -118,37 +118,15 @@ This repo contains the following configurations under `./models/`:
118
  * `nvidia/audio-codec-44khz` as the audio codec used for 44KHz audio instead of EnCodec for 24KHz.
119
  * `descript-audio-codec` can be revisited as an RVQ-based codec since they're easier to train a model around over an FSQ codec
120
  * an EnCodec variant can also be revisited, as it's rather quick to get a model to have speech emerge
121
- * At the moment, it doesn't seem to offer any benefits over the reference model.......
122
- * Zero-shot performance / speaker similarity is piss poor
123
- * More training is required, but I ***suppose*** these models can be used to finetune on a speaker for the meantime
124
- * The `proms_emb` might be weak compared to the `resps_emb`. I'm not sure of the best remedy for this.
125
- * Actual audio quality doesn't seem all that apparent
126
- * I imagine a post-training step similar to RLHF with actual 44KHz audio rather than upsampled-from-24KHz audio could fix this
127
- * The `smaller` model has the same inference speed as the `larger` model on my V100
128
- * This could either simply be because a V100 doesn't benefit from a narrow model, or there's a bottleneck elsewhere
129
 
130
  Some additional configurations have been explored with, but experiments have not been fruitful:
131
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
132
  * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
133
  * a model using [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/):
134
- + the 24KHz model will *not* converge no matter what. However, naively using just the first 8 RVQ levels might not be good enough, as there's too many codebooks for viable use.
135
- + the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.
136
- + Because of this, training losses are high and it's having a hard time trying to converge.
137
- + It has *sub-servicable* output for the first 4 RVQ levels, but it's massive cope to try and use it as a model.
138
- + ~~I believe there's hope to use it when I requantize my audio properly.~~
139
- + Addendum: even after properly processing my audio, the loss is actually *worse* than before. I imagine DAC just cannot be used as an intermediary for an LM.
140
  * a model with a causal size >1 (sampling more than one token for the AR):
141
- + re-using an existing model or training from scratch does not have fruitful results.
142
- + there's an inherent periodic stutter that doesn't seem to be able to be trained out, but it might require exotic sampling methods.
143
- + unfortunately it requires:
144
- + either something similar to Medusa heads, where there's additional parameters to perform speculative sampling,
145
- + a solution similar to what VALL-E 2 uses with group token embeddings or whatever, which *will* harm the NAR tasks in an AR+NAR model.
146
- + I just don't understand where the issue lies, since parallel decoding does work, as evidence with the NAR.
147
-
148
- Some current "achitectural features" are in-use, but their effects need to be experimented with further:
149
- * `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).
150
- * `audio_embeddings_sum` is also a mystery whether it matters if each later RVQ level should "see" the past levels through summing embeddings, or if not doing it is preferable.
151
- * Disabling `unified_position_ids` seems to help quality more often than not, but I'm still unsure if it's beneficial in practice.
152
 
153
  ## LoRAs
154
 
 
105
  * These weights are mostly a experiment to ensure that a pure NAR model works (through demasking inferencing).
106
  * A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes.
107
  * Throughput and memory usage should be constant between inferencing steps.
108
+ * The model only needs to be invoked about 5+(25+7)\*2 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels, double for CFG) instead.
109
  * Seems to absolutely require classifier-free-guidance >= 2.0 to keep the output stable (but this does replace the need of rep pen + low temp, even for normal AR+NAR).
110
  * Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
111
  * This *was* slated as a dud until the final oversight was squashed in the inferencing code, but it works *almost-decently* as a TTS model.
 
118
  * `nvidia/audio-codec-44khz` as the audio codec used for 44KHz audio instead of EnCodec for 24KHz.
119
  * `descript-audio-codec` can be revisited as an RVQ-based codec since they're easier to train a model around over an FSQ codec
120
  * an EnCodec variant can also be revisited, as it's rather quick to get a model to have speech emerge
121
+ * Notes can be found in the [implementation documentation](https://github.com/e-c-k-e-r/vall-e/blob/master/docs/models_v2.md).
 
 
 
 
 
 
 
122
 
123
  Some additional configurations have been explored with, but experiments have not been fruitful:
124
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
125
  * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
126
  * a model using [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/):
127
+ + the new implementation should be able to handle it.
 
 
 
 
 
128
  * a model with a causal size >1 (sampling more than one token for the AR):
129
+ * this seems a bit unneccessary as the NAR-len modality addresses the downsides of the AR+NAR modality.
 
 
 
 
 
 
 
 
 
 
130
 
131
  ## LoRAs
132
 
models/ckpt/fp32.sft DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:c7121cf968016aea3e752182724dec63a4eae25d4dba08188c425fa051b813ff
3
- size 493527168