--- license: apache-2.0 library_name: transformers pipeline_tag: feature-extraction tags: - tokenizer - embeddings - unicode - feature-extraction --- # bvv241-2-3: Unicode - based Tokenizer with Precomputed Frozen Embeddings This model is a tokenizer and associated precomputed frozen embeddings presented in the paper [Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations](https://huggingface.co/papers/2507.04886). Code: https://github.com/AVBochkov/Embeddings ## Tokenizer Description This tokenizer is based on a hybrid vocabulary: This tokenizer uses a strictly structured Unicode mapping scheme: - Plane 0 (0–65535): All single Unicode code points (monograms) are mapped 1:1 to token codes, directly matching standard Unicode BMP. - Private and unused code ranges (Plane 0, e.g., 0xE000–0xF8FF): - All multi-character tokens (bigrams, trigrams) are placed exclusively in these ranges. - This design achieves total, lossless Unicode text coverage, with all multi-symbol tokens isolated above the core Unicode range. - Data-driven bigrams and trigrams from Wikipedia (token co-occurrence), - Vocabulary size: 65,536 tokens, - Embedding dimension: 1024. The associated `normalized_embeddings_weights.pt` file contains a [vocab_size x embed_dim] matrix of precomputed, L2-normalized, frozen embeddings. No semantic information is encoded; embeddings remain fixed throughout LM pretraining. This tokenizer and embedding set is ideal for exploring semantic emergence and modular/fusion LM training over frozen, surface-level representations, enabling reproducible experiments and research. ## How to Get Started with the Tokenizer ```python from transformers import AutoTokenizer from huggingface_hub import hf_hub_download import torch tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-2-3') emb_path = hf_hub_download( repo_id="Bochkov/bvv241-2-3", filename="normalized_embeddings_weights.pt" ) embeddings = torch.load(emb_path) ``` ## 🧑‍🔬 Citation & Concept If you use this model or the underlying concepts in your research, please cite our work: ``` @article{ bochkov2025emergent, title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations}, author={Andrey Bochkov}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2025}, url={https://openreview.net/forum?id=Odh8IynO1o}, note={} } @misc{bochkov2025growingtransformersmodularcomposition, title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, author={A. Bochkov}, year={2025}, eprint={2507.07129}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2507.07129}, } ``` This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs.