Upload README.md
Browse files
README.md
CHANGED
|
@@ -7,23 +7,30 @@ license: apache-2.0
|
|
| 7 |
This is [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) with a Llama-3 vocabulary.
|
| 8 |
|
| 9 |
The intended use is as a draft model for Llama-3-70B-Instruct. Llama3-8B-Instruct works for this purpose, but it's on
|
| 10 |
-
the heavier side for drafting.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
## Procedure
|
| 13 |
|
| 14 |
-
The vocabulary was swapped by creating a new embedding layer (
|
| 15 |
the same) and initializing it as follows:
|
| 16 |
|
| 17 |
-
- every L3 token that
|
| 18 |
- every L3 token that decodes and re-encodes to multiple Qwen2 token is initialized with the mean of those embeddings
|
| 19 |
-
- there are no L3 tokens that cannot be translated to one or more Qwen2 tokens (both vocabularies are complete)
|
| 20 |
|
| 21 |
Swapping the vocabulary with the above method yields a mostly coherent but still very confused model. It especially
|
| 22 |
-
struggles with numbers
|
|
|
|
| 23 |
|
| 24 |
This is remedied by subsequent finetuning, first on
|
| 25 |
[this 2.41 million row sample from Common Crawl](https://huggingface.co/datasets/agentlans/common-crawl-sample), and
|
| 26 |
-
subsequently on about 25000 completions produced by Llama3-8B-Instruct
|
|
|
|
| 27 |
|
| 28 |
I did try tuning just the tied embeddings, but this didn't achieve good results.
|
| 29 |
|
|
@@ -61,4 +68,4 @@ Qwama-0.5B-instruct:
|
|
| 61 |
|
| 62 |
## EXL2 Quants
|
| 63 |
|
| 64 |
-
EXL2 quants uploaded [here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct-exl2).
|
|
|
|
| 7 |
This is [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) with a Llama-3 vocabulary.
|
| 8 |
|
| 9 |
The intended use is as a draft model for Llama-3-70B-Instruct. Llama3-8B-Instruct works for this purpose, but it's on
|
| 10 |
+
the heavier side for drafting.
|
| 11 |
+
|
| 12 |
+
The secondary purpose is to explore the feasibility of vocabulary swaps, either for adapting small models like
|
| 13 |
+
Qwen2-0.5b to produce drafts for other models, or for interoperability between dissimilar language models in general.
|
| 14 |
+
The conclusion in this regard is that the method works, but, since finetuning is required, it will be expensive for
|
| 15 |
+
larger models. It would be interesting to explore low-rank or quantized finetuning as an alternative.
|
| 16 |
|
| 17 |
## Procedure
|
| 18 |
|
| 19 |
+
The vocabulary was swapped by creating a new embedding layer (original model uses tied embeddings so the output layer is
|
| 20 |
the same) and initializing it as follows:
|
| 21 |
|
| 22 |
+
- every L3 token that is an exact match for a Qwen2 token is initialized with the corresponding embedding
|
| 23 |
- every L3 token that decodes and re-encodes to multiple Qwen2 token is initialized with the mean of those embeddings
|
| 24 |
+
- there are no L3 tokens that cannot be translated to one or more Qwen2 tokens (both vocabularies are complete).
|
| 25 |
|
| 26 |
Swapping the vocabulary with the above method yields a mostly coherent but still very confused model. It especially
|
| 27 |
+
struggles with numbers, and of course the embeddings for the Llama-3 control tokens do not have the significance they
|
| 28 |
+
would in an instruct-tuned model.
|
| 29 |
|
| 30 |
This is remedied by subsequent finetuning, first on
|
| 31 |
[this 2.41 million row sample from Common Crawl](https://huggingface.co/datasets/agentlans/common-crawl-sample), and
|
| 32 |
+
subsequently 3 epochs on about 25000 instruct-formatted completions produced by Llama3-8B-Instruct, included
|
| 33 |
+
[here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct/blob/main/llama3-instruct-prompts.json) for reference.
|
| 34 |
|
| 35 |
I did try tuning just the tied embeddings, but this didn't achieve good results.
|
| 36 |
|
|
|
|
| 68 |
|
| 69 |
## EXL2 Quants
|
| 70 |
|
| 71 |
+
EXL2 quants uploaded [here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct-exl2).
|