Update README.md
Browse files
README.md
CHANGED
|
@@ -13,13 +13,17 @@ tags:
|
|
| 13 |
|
| 14 |
# Model Card for abs-bvv-4
|
| 15 |
|
| 16 |
-
**GitHub Repository**: [https://github.com/Bochkov/BVV241-Tokenizers-Embeddings-Benchmarks](https://github.com/Bochkov/BVV241-Tokenizers-Embeddings-Benchmarks)
|
| 17 |
-
|
| 18 |
## Model Description
|
| 19 |
|
| 20 |
`abs-bvv-4` is a 1.9 billion parameter decoder-only Transformer model. It is the 4th model in the **Progressive Growth Transformers (PGT)** series, designed to explore how linguistic and reasoning capabilities emerge as a function of model depth.
|
| 21 |
|
| 22 |
-
This model was not trained monolithically. Instead, it was "grown" constructively, one layer at a time, upon a foundation of **frozen, non-semantic visual embeddings**, as introduced in the
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
The core idea is to demonstrate an alternative, more modular and resource-efficient paradigm for building LLMs. The PGT series shows that:
|
| 25 |
1. Semantic understanding can emerge without trainable embeddings.
|
|
@@ -34,31 +38,18 @@ This model is primarily an artifact for research into emergent capabilities, con
|
|
| 34 |
|
| 35 |
## Training Details
|
| 36 |
Architecture: 4-layer Decoder-Only Transformer (n_layer=4, d_model=4096, n_head=32).
|
| 37 |
-
|
| 38 |
Embeddings: The token embedding layer is frozen and derived from visual representations of Unicode glyphs. It is never updated during training.
|
| 39 |
-
|
| 40 |
Training Method: Progressive Layer-Wise Growth. The model was built by training one layer at a time. Layer 1 was trained to convergence, then frozen. Layer 2 was added and trained, etc. For deeper layers (5 and 6), LoRA was used to fine-tune all existing layers simultaneously with the new layer to ensure global coherence.
|
| 41 |
-
|
| 42 |
Parameters: Total: 1.9B.
|
| 43 |
-
|
| 44 |
Data: A ~9B token mix of Wikipedia and SFT datasets (10%).
|
| 45 |
-
|
| 46 |
## Limitations and Bias
|
| 47 |
-
|
| 48 |
This model is a research prototype and has several limitations:
|
| 49 |
-
|
| 50 |
Not Instruction-Tuned: It is a base model and will not follow instructions or engage in dialogue reliably.
|
| 51 |
-
|
| 52 |
Potential for Hallucinations: Like all LLMs, it can generate factually incorrect or nonsensical text.
|
| 53 |
-
|
| 54 |
Data Bias: Trained primarily on Wikipedia, it will reflect the biases present in that corpus.
|
| 55 |
-
|
| 56 |
Limited Scope: The model was trained on a relatively small dataset (9B tokens) compared to state-of-the-art models. Its performance is intended to be evaluated relative to its own baseline (trainable embeddings) and shallower versions, not against giant commercial models.
|
| 57 |
-
|
| 58 |
## π§βπ¬ Citation & Concept
|
| 59 |
-
|
| 60 |
If you use this model or the underlying concepts in your research, please cite our work:
|
| 61 |
-
|
| 62 |
```
|
| 63 |
@misc{bochkov2025emergentsemanticstokenembeddings,
|
| 64 |
title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
|
|
@@ -69,7 +60,6 @@ If you use this model or the underlying concepts in your research, please cite o
|
|
| 69 |
primaryClass={cs.CL},
|
| 70 |
url={https://arxiv.org/abs/2507.04886},
|
| 71 |
}
|
| 72 |
-
|
| 73 |
@misc{bochkov2025growingtransformersmodularcomposition,
|
| 74 |
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
|
| 75 |
author={A. Bochkov},
|
|
@@ -80,11 +70,8 @@ If you use this model or the underlying concepts in your research, please cite o
|
|
| 80 |
url={https://arxiv.org/abs/2507.07129},
|
| 81 |
}
|
| 82 |
```
|
| 83 |
-
|
| 84 |
This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs β a step toward modular, fusable, multilingual LMs.
|
| 85 |
-
|
| 86 |
## How to Use
|
| 87 |
-
|
| 88 |
The model can be loaded using the `transformers` library. Note that `trust_remote_code=True` is required as it uses a custom model architecture.
|
| 89 |
|
| 90 |
```python
|
|
|
|
| 13 |
|
| 14 |
# Model Card for abs-bvv-4
|
| 15 |
|
|
|
|
|
|
|
| 16 |
## Model Description
|
| 17 |
|
| 18 |
`abs-bvv-4` is a 1.9 billion parameter decoder-only Transformer model. It is the 4th model in the **Progressive Growth Transformers (PGT)** series, designed to explore how linguistic and reasoning capabilities emerge as a function of model depth.
|
| 19 |
|
| 20 |
+
This model was not trained monolithically. Instead, it was "grown" constructively, one layer at a time, upon a foundation of **frozen, non-semantic visual embeddings**, as introduced in the papers
|
| 21 |
+
|
| 22 |
+
[π Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) -
|
| 23 |
+
|
| 24 |
+
[π Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) -
|
| 25 |
+
|
| 26 |
+
[π» Code](https://github.com/AVBochkov/Embeddings)
|
| 27 |
|
| 28 |
The core idea is to demonstrate an alternative, more modular and resource-efficient paradigm for building LLMs. The PGT series shows that:
|
| 29 |
1. Semantic understanding can emerge without trainable embeddings.
|
|
|
|
| 38 |
|
| 39 |
## Training Details
|
| 40 |
Architecture: 4-layer Decoder-Only Transformer (n_layer=4, d_model=4096, n_head=32).
|
|
|
|
| 41 |
Embeddings: The token embedding layer is frozen and derived from visual representations of Unicode glyphs. It is never updated during training.
|
|
|
|
| 42 |
Training Method: Progressive Layer-Wise Growth. The model was built by training one layer at a time. Layer 1 was trained to convergence, then frozen. Layer 2 was added and trained, etc. For deeper layers (5 and 6), LoRA was used to fine-tune all existing layers simultaneously with the new layer to ensure global coherence.
|
|
|
|
| 43 |
Parameters: Total: 1.9B.
|
|
|
|
| 44 |
Data: A ~9B token mix of Wikipedia and SFT datasets (10%).
|
|
|
|
| 45 |
## Limitations and Bias
|
|
|
|
| 46 |
This model is a research prototype and has several limitations:
|
|
|
|
| 47 |
Not Instruction-Tuned: It is a base model and will not follow instructions or engage in dialogue reliably.
|
|
|
|
| 48 |
Potential for Hallucinations: Like all LLMs, it can generate factually incorrect or nonsensical text.
|
|
|
|
| 49 |
Data Bias: Trained primarily on Wikipedia, it will reflect the biases present in that corpus.
|
|
|
|
| 50 |
Limited Scope: The model was trained on a relatively small dataset (9B tokens) compared to state-of-the-art models. Its performance is intended to be evaluated relative to its own baseline (trainable embeddings) and shallower versions, not against giant commercial models.
|
|
|
|
| 51 |
## π§βπ¬ Citation & Concept
|
|
|
|
| 52 |
If you use this model or the underlying concepts in your research, please cite our work:
|
|
|
|
| 53 |
```
|
| 54 |
@misc{bochkov2025emergentsemanticstokenembeddings,
|
| 55 |
title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
|
|
|
|
| 60 |
primaryClass={cs.CL},
|
| 61 |
url={https://arxiv.org/abs/2507.04886},
|
| 62 |
}
|
|
|
|
| 63 |
@misc{bochkov2025growingtransformersmodularcomposition,
|
| 64 |
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
|
| 65 |
author={A. Bochkov},
|
|
|
|
| 70 |
url={https://arxiv.org/abs/2507.07129},
|
| 71 |
}
|
| 72 |
```
|
|
|
|
| 73 |
This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs β a step toward modular, fusable, multilingual LMs.
|
|
|
|
| 74 |
## How to Use
|
|
|
|
| 75 |
The model can be loaded using the `transformers` library. Note that `trust_remote_code=True` is required as it uses a custom model architecture.
|
| 76 |
|
| 77 |
```python
|