|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
pipeline_tag: feature-extraction |
|
|
tags: |
|
|
- tokenizer |
|
|
- embeddings |
|
|
- unicode |
|
|
- feature-extraction |
|
|
--- |
|
|
|
|
|
# bvv241-2-3: Unicode - based Tokenizer with Precomputed Frozen Embeddings |
|
|
|
|
|
This model is a tokenizer and associated precomputed frozen embeddings presented in the paper [Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations](https://huggingface.co/papers/2507.04886). |
|
|
|
|
|
Code: https://github.com/AVBochkov/Embeddings |
|
|
|
|
|
## Tokenizer Description |
|
|
|
|
|
This tokenizer is based on a hybrid vocabulary: |
|
|
|
|
|
This tokenizer uses a strictly structured Unicode mapping scheme: |
|
|
|
|
|
- Plane 0 (0β65535): All single Unicode code points (monograms) are mapped 1:1 to token codes, directly matching standard Unicode BMP. |
|
|
- Private and unused code ranges (Plane 0, e.g., 0xE000β0xF8FF): |
|
|
- All multi-character tokens (bigrams, trigrams) are placed exclusively in these ranges. |
|
|
- This design achieves total, lossless Unicode text coverage, with all multi-symbol tokens isolated above the core Unicode range. |
|
|
- Data-driven bigrams and trigrams from Wikipedia (token co-occurrence), |
|
|
- Vocabulary size: 65,536 tokens, |
|
|
- Embedding dimension: 1024. |
|
|
|
|
|
The associated `normalized_embeddings_weights.pt` file contains a [vocab_size x embed_dim] matrix of precomputed, L2-normalized, frozen embeddings. |
|
|
No semantic information is encoded; embeddings remain fixed throughout LM pretraining. |
|
|
|
|
|
This tokenizer and embedding set is ideal for exploring semantic emergence and modular/fusion LM training over frozen, |
|
|
surface-level representations, enabling reproducible experiments and research. |
|
|
|
|
|
## How to Get Started with the Tokenizer |
|
|
|
|
|
```python |
|
|
|
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
import torch |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-2-3') |
|
|
|
|
|
|
|
|
emb_path = hf_hub_download( |
|
|
repo_id="Bochkov/bvv241-2-3", |
|
|
filename="normalized_embeddings_weights.pt" |
|
|
) |
|
|
|
|
|
embeddings = torch.load(emb_path) |
|
|
``` |
|
|
|
|
|
## π§βπ¬ Citation & Concept |
|
|
|
|
|
If you use this model or the underlying concepts in your research, please cite our work: |
|
|
|
|
|
``` |
|
|
@article{ |
|
|
bochkov2025emergent, |
|
|
title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations}, |
|
|
author={Andrey Bochkov}, |
|
|
journal={Transactions on Machine Learning Research}, |
|
|
issn={2835-8856}, |
|
|
year={2025}, |
|
|
url={https://openreview.net/forum?id=Odh8IynO1o}, |
|
|
note={} |
|
|
} |
|
|
|
|
|
@misc{bochkov2025growingtransformersmodularcomposition, |
|
|
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, |
|
|
author={A. Bochkov}, |
|
|
year={2025}, |
|
|
eprint={2507.07129}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.LG}, |
|
|
url={https://arxiv.org/abs/2507.07129}, |
|
|
} |
|
|
``` |
|
|
|
|
|
This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs β a step toward modular, fusable, multilingual LMs. |