Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,94 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
pipeline_tag: text-generation
|
| 6 |
+
library_name: transformers
|
| 7 |
+
tags:
|
| 8 |
+
- i3-arhitecture
|
| 9 |
+
---
|
| 10 |
+
# i3-tiny: The i3 Architecture (v1)
|
| 11 |
+
|
| 12 |
+
## Model Description
|
| 13 |
+
|
| 14 |
+
The **i3-tiny** is an experimental Large Language Model (LLM) utilizing the proprietary **i3 (Integrated Intelligent Infrastructure)** Architecture. It is designed for ultra-efficiency and a low memory footprint.
|
| 15 |
+
|
| 16 |
+
This model is a **character-level autoregressive decoder** trained on a small English corpus.
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## Key Architectural Features
|
| 21 |
+
|
| 22 |
+
The **i3 Block** is a novel, single-layer design that achieves high efficiency by minimizing parameter and computational costs. It integrates advanced mechanisms for sequence processing:
|
| 23 |
+
|
| 24 |
+
* **Proprietary Recurrence Mechanism:**
|
| 25 |
+
A specialized hybrid recurrence component manages sequential dependencies efficiently, avoiding the quadratic complexity of standard self-attention.
|
| 26 |
+
|
| 27 |
+
* **Low-Rank Attention:**
|
| 28 |
+
Attention mechanisms use highly factorized, low-rank projections, significantly reducing memory and compute costs associated with Key, Query, and Value matrices.
|
| 29 |
+
|
| 30 |
+
* **Low-Rank Feedforward:**
|
| 31 |
+
The standard FFN is replaced by proprietary low-rank factorization layers to maximize parameter efficiency throughout the block.
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## Model Scale (v1 Configuration)
|
| 36 |
+
|
| 37 |
+
| **Parameter** | **Value** | **Notes** |
|
| 38 |
+
| ------------------- | ------------------------- | -------------------------------------------------- |
|
| 39 |
+
| **Model Size** | Approx. 40–50M parameters | Exact count is printed during training initiation. |
|
| 40 |
+
| **d_model** | 256 | Hidden dimension size. |
|
| 41 |
+
| **n_layers** | 6 | Number of hybrid i3 blocks. |
|
| 42 |
+
| **n_heads** | 8 | Number of attention heads. |
|
| 43 |
+
| **Recurrence Rank** | 16 (d_state for Mamba) | Size of the proprietary recurrence state. |
|
| 44 |
+
| **Low-Rank Rank** | 8 | Rank used for low-rank factorizations. |
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## Intended Uses and Limitations
|
| 49 |
+
|
| 50 |
+
### Intended Uses
|
| 51 |
+
|
| 52 |
+
* **Benchmarking & Research:** Exploring the training speed and final loss achievement of the i3 architecture against fully Transformer-based models of similar scale.
|
| 53 |
+
* **Proof of Concept:** Demonstrating an ultra-efficient training and inference paradigm.
|
| 54 |
+
|
| 55 |
+
### Out-of-Scope Use and Limitations
|
| 56 |
+
|
| 57 |
+
* **Production Use:** *Do not* use this model for real-world text generation, translation, or conversation.
|
| 58 |
+
* **General Language Tasks:** Due to the extremely small and repetitive training dataset (even with 10× repetition), the model has a very limited understanding of grammar, syntax, and semantics. It will primarily generate repetitive and fragmented text based on corpus patterns.
|
| 59 |
+
|
| 60 |
+
---
|
| 61 |
+
|
| 62 |
+
## Training Details
|
| 63 |
+
|
| 64 |
+
### Training Data
|
| 65 |
+
|
| 66 |
+
| **Parameter** | **Value** | **Notes** |
|
| 67 |
+
| ----------------- | ------------------------- | ------------------------------------------------------------------------- |
|
| 68 |
+
| **Source** | Public Domain Sample Text | The original sample text provided in the source code. |
|
| 69 |
+
| **Volume** | 10× Repetition | The original text was repeated 10 times to increase training data volume. |
|
| 70 |
+
| **Tokenization** | Character-level | Vocabulary size is determined by unique characters (≈32–35). |
|
| 71 |
+
| **Preprocessing** | Lowercased | All training data is normalized to lowercase characters. |
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
### Hyperparameters
|
| 76 |
+
|
| 77 |
+
| **Parameter** | **Value** | **Notes** |
|
| 78 |
+
| ------------------- | ------------------ | --------- |
|
| 79 |
+
| **Optimization** | AdamW | |
|
| 80 |
+
| **Learning Rate** | 3e-4 | |
|
| 81 |
+
| **Max Iterations** | 2000 | |
|
| 82 |
+
| **Batch Size** | 2 | |
|
| 83 |
+
| **Sequence Length** | 128 | |
|
| 84 |
+
| **Loss Function** | Cross-Entropy Loss | |
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
## Performance Metrics
|
| 89 |
+
|
| 90 |
+
* **Initial Loss (Expected):**
|
| 91 |
+
The model should start with a Cross-Entropy Loss between **3.0** and **4.0**, depending on the final character vocabulary size (≈ ln V).
|
| 92 |
+
|
| 93 |
+
* **Target Loss:**
|
| 94 |
+
With the increased data volume (10×), the model should drop the training loss well below **2.0** and aim closer to **1.0** to be considered successfully trained and leveraging its increased capacity.
|