Feature Extraction
Transformers
gpt2
File size: 3,080 Bytes
2b3556f
f60d3f0
 
 
2b3556f
 
6e67568
8921471
f60d3f0
4f1240a
f60d3f0
8921471
6e67568
8921471
6e67568
2b3556f
9021e1f
 
 
 
 
 
 
2b3556f
 
 
 
6e67568
2b3556f
 
 
 
 
 
 
b2ddd52
2b3556f
 
 
 
 
 
 
 
 
6de96d0
 
 
 
 
 
 
aea7c6f
 
 
 
 
 
6de96d0
aea7c6f
 
6de96d0
aea7c6f
2fe5112
 
 
 
 
 
 
 
 
6de96d0
 
8921471
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
license: apache-2.0
library_name: transformers
pipeline_tag: feature-extraction
---
# bvv241-max: Unified Unicode Tokenizer (SOTA Intersection) with Frozen Embeddings

## Tokenizer Description

This repository contains the tokenizer and associated resources from this paper:

[πŸ“š Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) - 

[πŸ’» Code](https://github.com/AVBochkov/Embeddings)

This tokenizer is based on a hybrid vocabulary:  

This tokenizer uses a strictly structured Unicode mapping scheme:

- Plane 0 (0–65535): All single Unicode code points (monograms) are mapped 1:1 to token codes, directly matching standard Unicode BMP.
- Private and unused code ranges (Plane 0 high + supplementary, e.g., 0xE000–0xF8FF and 65536–131071):
    - All multi-character tokens (bigrams, trigrams, SOTA model token strings) are placed exclusively in these ranges.
- This design achieves total, lossless Unicode text coverage, with all multi-symbol tokens isolated above the core Unicode range.
- Tokenizer created from the intersection of token text across leading SOTA models
- Includes o200k_base, cl100k_base, Mistral-Nemo, QwQ-32B, DeepSeek-R1, Qwen3-32B vocabularies,  
- Vocabulary size: 131,072 tokens,  
- Embedding dimension: 1024.

The associated `normalized_embeddings_weights.pt` file contains a [vocab_size x embed_dim] matrix of precomputed, L2-normalized, frozen embeddings.    
No semantic information is encoded; embeddings remain fixed throughout LM pretraining.
No training or adaptation; suitable for plug-and-play use in research on embedding-free semantic emergence and modular LMs.


## How to Get Started with the Tokenizer

```python
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
import torch
tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-max')
emb_path = hf_hub_download(
    repo_id="Bochkov/bvv241-max",
    filename="normalized_embeddings_weights.pt"
)
embeddings = torch.load(emb_path)
```

## πŸ§‘β€πŸ”¬ Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

```
@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}
```

This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs β€” a step toward modular, fusable, multilingual LMs.