Update README.md
Browse files
README.md
CHANGED
|
@@ -119,12 +119,11 @@ This repo contains the following configurations under `./models/`:
|
|
| 119 |
* `descript-audio-codec` can be revisited as an RVQ-based codec since they're easier to train a model around over an FSQ codec
|
| 120 |
* an EnCodec variant can also be revisited, as it's rather quick to get a model to have speech emerge
|
| 121 |
* Notes can be found in the [implementation documentation](https://github.com/e-c-k-e-r/vall-e/blob/master/docs/models_v2.md).
|
|
|
|
| 122 |
|
| 123 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
| 124 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
| 125 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|
| 126 |
-
* a model using [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/):
|
| 127 |
-
+ the new implementation should be able to handle it.
|
| 128 |
* a model with a causal size >1 (sampling more than one token for the AR):
|
| 129 |
* this seems a bit unneccessary as the NAR-len modality addresses the downsides of the AR+NAR modality.
|
| 130 |
|
|
@@ -138,3 +137,5 @@ The only caveat is that my original dataset *does* contain (most of) these sampl
|
|
| 138 |
* However, the base model already has *almost adequate* output from these speakers, but not enough to be satisfactory.
|
| 139 |
|
| 140 |
LoRAs under `ckpt[ar+nar-old-llama-8]` are LoRAs married to an older checkpoint, while `ckpt` *should* work under the reference model.
|
|
|
|
|
|
|
|
|
| 119 |
* `descript-audio-codec` can be revisited as an RVQ-based codec since they're easier to train a model around over an FSQ codec
|
| 120 |
* an EnCodec variant can also be revisited, as it's rather quick to get a model to have speech emerge
|
| 121 |
* Notes can be found in the [implementation documentation](https://github.com/e-c-k-e-r/vall-e/blob/master/docs/models_v2.md).
|
| 122 |
+
* A proof-of-concept LoRA is provided.
|
| 123 |
|
| 124 |
Some additional configurations have been explored with, but experiments have not been fruitful:
|
| 125 |
* Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
|
| 126 |
* Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
|
|
|
|
|
|
|
| 127 |
* a model with a causal size >1 (sampling more than one token for the AR):
|
| 128 |
* this seems a bit unneccessary as the NAR-len modality addresses the downsides of the AR+NAR modality.
|
| 129 |
|
|
|
|
| 137 |
* However, the base model already has *almost adequate* output from these speakers, but not enough to be satisfactory.
|
| 138 |
|
| 139 |
LoRAs under `ckpt[ar+nar-old-llama-8]` are LoRAs married to an older checkpoint, while `ckpt` *should* work under the reference model.
|
| 140 |
+
|
| 141 |
+
LoRAs under `ckpt[nemo-larger-44khz-llama-8]` are LoRAs married to the `nemo-larger-44khz-llama-8` model.
|