Bochkov
/

bvv241-2-3

Feature Extraction

Model card Files Files and versions

bvv241-2-3 / README.md

Bochkov's picture

Update README.md

fec177c verified 15 days ago

|

history blame contribute delete

3.04 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: feature-extraction
	tags:
	- tokenizer
	- embeddings
	- unicode
	- feature-extraction
	---

	# bvv241-2-3: Unicode - based Tokenizer with Precomputed Frozen Embeddings

	This model is a tokenizer and associated precomputed frozen embeddings presented in the paper [Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations](https://huggingface.co/papers/2507.04886).

	Code: https://github.com/AVBochkov/Embeddings

	## Tokenizer Description

	This tokenizer is based on a hybrid vocabulary:

	This tokenizer uses a strictly structured Unicode mapping scheme:

	- Plane 0 (0–65535): All single Unicode code points (monograms) are mapped 1:1 to token codes, directly matching standard Unicode BMP.
	- Private and unused code ranges (Plane 0, e.g., 0xE000–0xF8FF):
	- All multi-character tokens (bigrams, trigrams) are placed exclusively in these ranges.
	- This design achieves total, lossless Unicode text coverage, with all multi-symbol tokens isolated above the core Unicode range.
	- Data-driven bigrams and trigrams from Wikipedia (token co-occurrence),
	- Vocabulary size: 65,536 tokens,
	- Embedding dimension: 1024.

	The associated `normalized_embeddings_weights.pt` file contains a [vocab_size x embed_dim] matrix of precomputed, L2-normalized, frozen embeddings.
	No semantic information is encoded; embeddings remain fixed throughout LM pretraining.

	This tokenizer and embedding set is ideal for exploring semantic emergence and modular/fusion LM training over frozen,
	surface-level representations, enabling reproducible experiments and research.

	## How to Get Started with the Tokenizer

	```python

	from transformers import AutoTokenizer

	from huggingface_hub import hf_hub_download

	import torch

	tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-2-3')


	emb_path = hf_hub_download(
	repo_id="Bochkov/bvv241-2-3",
	filename="normalized_embeddings_weights.pt"
	)

	embeddings = torch.load(emb_path)
	```

	## 🧑‍🔬 Citation & Concept

	If you use this model or the underlying concepts in your research, please cite our work:

	```
	@article{
	bochkov2025emergent,
	title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
	author={Andrey Bochkov},
	journal={Transactions on Machine Learning Research},
	issn={2835-8856},
	year={2025},
	url={https://openreview.net/forum?id=Odh8IynO1o},
	note={}
	}

	@misc{bochkov2025growingtransformersmodularcomposition,
	title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
	author={A. Bochkov},
	year={2025},
	eprint={2507.07129},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2507.07129},
	}
	```

	This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs.