chongli17
/

TokAlign-Pythia-1b-LLaMA3-Tokenizer

Model card Files Files and versions

chongli17 commited on Jun 4, 2025

Commit

2742f38

·

verified ·

1 Parent(s): a769010

Update README.md

Files changed (1) hide show

README.md +24 -3

README.md CHANGED Viewed

@@ -1,3 +1,24 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+# Model Card for TokAlign-Pythia-1b-LLaMA3-Tokenizer
+The model is initialized from [Pythia-1b](https://huggingface.co/EleutherAI/pythia-1b), replaced with the [LLaMA3 tokenizer](https://huggingface.co/meta-llama/Llama-3.1-8B), and fine-tuned 5k steps for vocabulary adaptation.
+# Code
+The code used to train this model refers to the [github](https://github.com/ZNLP/TokAlign) repo.
+# Citation
+```
+@inproceedings{li-etal-2025-TokAlign,
+  author    = {Chong Li and
+               Jiajun Zhang and
+               Chengqing Zong},
+  title = "TokAlign: Efficient Vocabulary Adaptation via Token Alignment",
+  booktitle = "Proceedings of the 63nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
+  year = "2025",
+  address = "Vienna, Austria",
+  publisher = "Association for Computational Linguistics",
+}
+```