mrq
commited on
Commit
·
337f9d2
1
Parent(s):
9a25eda
update readme
Browse files
README.md
CHANGED
|
@@ -112,6 +112,21 @@ This repo contains the following configurations under `./models/`:
|
|
| 112 |
* The output quality itself leaves a lot to be desired.
|
| 113 |
* Training is finalized with this model, as dedicated training time to the base model to extend it for NAR-len capabilities is optimal, but these weights will exist for who it may concern.
|
| 114 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
| 116 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
| 117 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|
|
|
|
| 112 |
* The output quality itself leaves a lot to be desired.
|
| 113 |
* Training is finalized with this model, as dedicated training time to the base model to extend it for NAR-len capabilities is optimal, but these weights will exist for who it may concern.
|
| 114 |
|
| 115 |
+
* `nemo-smaller-44khz-llama-8` / `nemo-larger-44khz-llama-8`: Fully non-autoregressively, fully parallel models.
|
| 116 |
+
* These models utilize:
|
| 117 |
+
* a new implementation that aims to operate on *all* codebook levels instead of one codebook level at a time.
|
| 118 |
+
* `nvidia/audio-codec-44khz` as the audio codec used for 44KHz audio instead of EnCodec for 24KHz.
|
| 119 |
+
* `descript-audio-codec` can be revisited as an RVQ-based codec since they're easier to train a model around over an FSQ codec
|
| 120 |
+
* an EnCodec variant can also be revisited, as it's rather quick to get a model to have speech emerge
|
| 121 |
+
* At the moment, it doesn't seem to offer any benefits over the reference model.......
|
| 122 |
+
* Zero-shot performance / speaker similarity is piss poor
|
| 123 |
+
* More training is required, but I ***suppose*** these models can be used to finetune on a speaker for the meantime
|
| 124 |
+
* The `proms_emb` might be weak compared to the `resps_emb`. I'm not sure of the best remedy for this.
|
| 125 |
+
* Actual audio quality doesn't seem all that apparent
|
| 126 |
+
* I imagine a post-training step similar to RLHF with actual 44KHz audio rather than upsampled-from-24KHz audio could fix this
|
| 127 |
+
* The `smaller` model has the same inference speed as the `larger` model on my V100
|
| 128 |
+
* This could either simply be because a V100 doesn't benefit from a narrow model, or there's a bottleneck elsewhere
|
| 129 |
+
|
| 130 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
| 131 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
| 132 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|