turboderp
/

Qwama-0.5B-Instruct

Text Generation

text-generation-inference

Model card Files Files and versions

turboderp commited on Jun 14, 2024

Commit

b723172

·

verified ·

1 Parent(s): 3c2fd60

Update README.md

Files changed (1) hide show

README.md +9 -0

README.md CHANGED Viewed

@@ -23,6 +23,15 @@ the same) and initializing it as follows:
 - every L3 token that decodes and re-encodes to multiple Qwen2 token is initialized with the mean of those embeddings
 - there are no L3 tokens that cannot be translated to one or more Qwen2 tokens (both vocabularies are complete).
 Swapping the vocabulary with the above method yields a mostly coherent but still very confused model. It especially
 struggles with numbers, and of course the embeddings for the Llama-3 control tokens do not have the significance they
 would in an instruct-tuned model.

 - every L3 token that decodes and re-encodes to multiple Qwen2 token is initialized with the mean of those embeddings
 - there are no L3 tokens that cannot be translated to one or more Qwen2 tokens (both vocabularies are complete).
+```python
+for idx in range(target_vocab_size):
+    decode = tokenizer_target.decode(torch.tensor(idx, dtype = torch.long), decode_special_tokens = True)
+    encode = tokenizer_source.encode(decode, add_special_tokens = False, return_tensors = "pt")
+    new_emb[idx] = old_emb[encode.flatten()].mean(dim = 0)
+    new_head[idx] = old_head[encode.flatten()].mean(dim = 0)
+```
+Full script is [here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct/blob/main/vocab_transplant.py).
 Swapping the vocabulary with the above method yields a mostly coherent but still very confused model. It especially
 struggles with numbers, and of course the embeddings for the Llama-3 control tokens do not have the significance they
 would in an instruct-tuned model.