mrq
commited on
Commit
·
72dfea2
1
Parent(s):
6d9b328
- README.md +4 -26
- models/ckpt/fp32.sft +0 -3
README.md
CHANGED
|
@@ -105,7 +105,7 @@ This repo contains the following configurations under `./models/`:
|
|
| 105 |
* These weights are mostly a experiment to ensure that a pure NAR model works (through demasking inferencing).
|
| 106 |
* A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes.
|
| 107 |
* Throughput and memory usage should be constant between inferencing steps.
|
| 108 |
-
* The model only needs to be invoked about 5+(25+7)
|
| 109 |
* Seems to absolutely require classifier-free-guidance >= 2.0 to keep the output stable (but this does replace the need of rep pen + low temp, even for normal AR+NAR).
|
| 110 |
* Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
|
| 111 |
* This *was* slated as a dud until the final oversight was squashed in the inferencing code, but it works *almost-decently* as a TTS model.
|
|
@@ -118,37 +118,15 @@ This repo contains the following configurations under `./models/`:
|
|
| 118 |
* `nvidia/audio-codec-44khz` as the audio codec used for 44KHz audio instead of EnCodec for 24KHz.
|
| 119 |
* `descript-audio-codec` can be revisited as an RVQ-based codec since they're easier to train a model around over an FSQ codec
|
| 120 |
* an EnCodec variant can also be revisited, as it's rather quick to get a model to have speech emerge
|
| 121 |
-
*
|
| 122 |
-
* Zero-shot performance / speaker similarity is piss poor
|
| 123 |
-
* More training is required, but I ***suppose*** these models can be used to finetune on a speaker for the meantime
|
| 124 |
-
* The `proms_emb` might be weak compared to the `resps_emb`. I'm not sure of the best remedy for this.
|
| 125 |
-
* Actual audio quality doesn't seem all that apparent
|
| 126 |
-
* I imagine a post-training step similar to RLHF with actual 44KHz audio rather than upsampled-from-24KHz audio could fix this
|
| 127 |
-
* The `smaller` model has the same inference speed as the `larger` model on my V100
|
| 128 |
-
* This could either simply be because a V100 doesn't benefit from a narrow model, or there's a bottleneck elsewhere
|
| 129 |
|
| 130 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
| 131 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
| 132 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|
| 133 |
* a model using [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/):
|
| 134 |
-
+ the
|
| 135 |
-
+ the 44KHz model was erroneously assumed to be an even 44KHz, when in reality it's 44.1KHz. *All* of my audio has to be requantized, as there's some stuttering in it.
|
| 136 |
-
+ Because of this, training losses are high and it's having a hard time trying to converge.
|
| 137 |
-
+ It has *sub-servicable* output for the first 4 RVQ levels, but it's massive cope to try and use it as a model.
|
| 138 |
-
+ ~~I believe there's hope to use it when I requantize my audio properly.~~
|
| 139 |
-
+ Addendum: even after properly processing my audio, the loss is actually *worse* than before. I imagine DAC just cannot be used as an intermediary for an LM.
|
| 140 |
* a model with a causal size >1 (sampling more than one token for the AR):
|
| 141 |
-
|
| 142 |
-
+ there's an inherent periodic stutter that doesn't seem to be able to be trained out, but it might require exotic sampling methods.
|
| 143 |
-
+ unfortunately it requires:
|
| 144 |
-
+ either something similar to Medusa heads, where there's additional parameters to perform speculative sampling,
|
| 145 |
-
+ a solution similar to what VALL-E 2 uses with group token embeddings or whatever, which *will* harm the NAR tasks in an AR+NAR model.
|
| 146 |
-
+ I just don't understand where the issue lies, since parallel decoding does work, as evidence with the NAR.
|
| 147 |
-
|
| 148 |
-
Some current "achitectural features" are in-use, but their effects need to be experimented with further:
|
| 149 |
-
* `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).
|
| 150 |
-
* `audio_embeddings_sum` is also a mystery whether it matters if each later RVQ level should "see" the past levels through summing embeddings, or if not doing it is preferable.
|
| 151 |
-
* Disabling `unified_position_ids` seems to help quality more often than not, but I'm still unsure if it's beneficial in practice.
|
| 152 |
|
| 153 |
## LoRAs
|
| 154 |
|
|
|
|
| 105 |
* These weights are mostly a experiment to ensure that a pure NAR model works (through demasking inferencing).
|
| 106 |
* A ***lot*** of pain was put into trying to get something working, through implementation issues to dumb mistakes.
|
| 107 |
* Throughput and memory usage should be constant between inferencing steps.
|
| 108 |
+
* The model only needs to be invoked about 5+(25+7)\*2 (duration inferencing + RVQ level 0 inferencing + remaining RVQ levels, double for CFG) instead.
|
| 109 |
* Seems to absolutely require classifier-free-guidance >= 2.0 to keep the output stable (but this does replace the need of rep pen + low temp, even for normal AR+NAR).
|
| 110 |
* Unlike the base model, this is trained with the current dataset without iteratively dripfeeding additional sources (like tacking on Emilia afterwards).
|
| 111 |
* This *was* slated as a dud until the final oversight was squashed in the inferencing code, but it works *almost-decently* as a TTS model.
|
|
|
|
| 118 |
* `nvidia/audio-codec-44khz` as the audio codec used for 44KHz audio instead of EnCodec for 24KHz.
|
| 119 |
* `descript-audio-codec` can be revisited as an RVQ-based codec since they're easier to train a model around over an FSQ codec
|
| 120 |
* an EnCodec variant can also be revisited, as it's rather quick to get a model to have speech emerge
|
| 121 |
+
* Notes can be found in the [implementation documentation](https://github.com/e-c-k-e-r/vall-e/blob/master/docs/models_v2.md).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
| 124 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
| 125 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|
| 126 |
* a model using [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/):
|
| 127 |
+
+ the new implementation should be able to handle it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
* a model with a causal size >1 (sampling more than one token for the AR):
|
| 129 |
+
* this seems a bit unneccessary as the NAR-len modality addresses the downsides of the AR+NAR modality.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
## LoRAs
|
| 132 |
|
models/ckpt/fp32.sft
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:c7121cf968016aea3e752182724dec63a4eae25d4dba08188c425fa051b813ff
|
| 3 |
-
size 493527168
|
|
|
|
|
|
|
|
|
|
|
|