turboderp
/

Qwama-0.5B-Instruct

@@ -7,23 +7,30 @@ license: apache-2.0
 This is [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) with a Llama-3 vocabulary.
 The intended use is as a draft model for Llama-3-70B-Instruct. Llama3-8B-Instruct works for this purpose, but it's on
-the heavier side for drafting. Secondary purpose is just to explore the feasibility of vocabulary swaps.
 ## Procedure
-The vocabulary was swapped by creating a new embedding layer (origianl model uses tied embeddings so the output layer is
 the same) and initializing it as follows:
-- every L3 token that has a corresponding Qwen2 token is initialized with the corresponding embedding
 - every L3 token that decodes and re-encodes to multiple Qwen2 token is initialized with the mean of those embeddings
-- there are no L3 tokens that cannot be translated to one or more Qwen2 tokens (both vocabularies are complete)
 Swapping the vocabulary with the above method yields a mostly coherent but still very confused model. It especially
-struggles with numbers. (It likes to talk about people born in the year 1900670 having 695 beautiful children etc.)
 This is remedied by subsequent finetuning, first on
 [this 2.41 million row sample from Common Crawl](https://huggingface.co/datasets/agentlans/common-crawl-sample), and
-subsequently on about 25000 completions produced by Llama3-8B-Instruct in the L3 instruct format for 3 epochs.
 I did try tuning just the tied embeddings, but this didn't achieve good results.
@@ -61,4 +68,4 @@ Qwama-0.5B-instruct:
 ## EXL2 Quants
-EXL2 quants uploaded [here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct-exl2).

 This is [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) with a Llama-3 vocabulary.
 The intended use is as a draft model for Llama-3-70B-Instruct. Llama3-8B-Instruct works for this purpose, but it's on
+the heavier side for drafting.
+The secondary purpose is to explore the feasibility of vocabulary swaps, either for adapting small models like
+Qwen2-0.5b to produce drafts for other models, or for interoperability between dissimilar language models in general.
+The conclusion in this regard is that the method works, but, since finetuning is required, it will be expensive for
+larger models. It would be interesting to explore low-rank or quantized finetuning as an alternative.
 ## Procedure
+The vocabulary was swapped by creating a new embedding layer (original model uses tied embeddings so the output layer is
 the same) and initializing it as follows:
+- every L3 token that is an exact match for a Qwen2 token is initialized with the corresponding embedding
 - every L3 token that decodes and re-encodes to multiple Qwen2 token is initialized with the mean of those embeddings
+- there are no L3 tokens that cannot be translated to one or more Qwen2 tokens (both vocabularies are complete).
 Swapping the vocabulary with the above method yields a mostly coherent but still very confused model. It especially
+struggles with numbers, and of course the embeddings for the Llama-3 control tokens do not have the significance they
+would in an instruct-tuned model.
 This is remedied by subsequent finetuning, first on
 [this 2.41 million row sample from Common Crawl](https://huggingface.co/datasets/agentlans/common-crawl-sample), and
+subsequently 3 epochs on about 25000 instruct-formatted completions produced by Llama3-8B-Instruct, included
+[here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct/blob/main/llama3-instruct-prompts.json) for reference.
 I did try tuning just the tied embeddings, but this didn't achieve good results.
 ## EXL2 Quants
+EXL2 quants uploaded [here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct-exl2).