FlameF0X commited on
Commit
d3bd287
·
verified ·
1 Parent(s): 7434a85

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -3
README.md CHANGED
@@ -1,3 +1,94 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
+ tags:
8
+ - i3-arhitecture
9
+ ---
10
+ # i3-tiny: The i3 Architecture (v1)
11
+
12
+ ## Model Description
13
+
14
+ The **i3-tiny** is an experimental Large Language Model (LLM) utilizing the proprietary **i3 (Integrated Intelligent Infrastructure)** Architecture. It is designed for ultra-efficiency and a low memory footprint.
15
+
16
+ This model is a **character-level autoregressive decoder** trained on a small English corpus.
17
+
18
+ ---
19
+
20
+ ## Key Architectural Features
21
+
22
+ The **i3 Block** is a novel, single-layer design that achieves high efficiency by minimizing parameter and computational costs. It integrates advanced mechanisms for sequence processing:
23
+
24
+ * **Proprietary Recurrence Mechanism:**
25
+ A specialized hybrid recurrence component manages sequential dependencies efficiently, avoiding the quadratic complexity of standard self-attention.
26
+
27
+ * **Low-Rank Attention:**
28
+ Attention mechanisms use highly factorized, low-rank projections, significantly reducing memory and compute costs associated with Key, Query, and Value matrices.
29
+
30
+ * **Low-Rank Feedforward:**
31
+ The standard FFN is replaced by proprietary low-rank factorization layers to maximize parameter efficiency throughout the block.
32
+
33
+ ---
34
+
35
+ ## Model Scale (v1 Configuration)
36
+
37
+ | **Parameter** | **Value** | **Notes** |
38
+ | ------------------- | ------------------------- | -------------------------------------------------- |
39
+ | **Model Size** | Approx. 40–50M parameters | Exact count is printed during training initiation. |
40
+ | **d_model** | 256 | Hidden dimension size. |
41
+ | **n_layers** | 6 | Number of hybrid i3 blocks. |
42
+ | **n_heads** | 8 | Number of attention heads. |
43
+ | **Recurrence Rank** | 16 (d_state for Mamba) | Size of the proprietary recurrence state. |
44
+ | **Low-Rank Rank** | 8 | Rank used for low-rank factorizations. |
45
+
46
+ ---
47
+
48
+ ## Intended Uses and Limitations
49
+
50
+ ### Intended Uses
51
+
52
+ * **Benchmarking & Research:** Exploring the training speed and final loss achievement of the i3 architecture against fully Transformer-based models of similar scale.
53
+ * **Proof of Concept:** Demonstrating an ultra-efficient training and inference paradigm.
54
+
55
+ ### Out-of-Scope Use and Limitations
56
+
57
+ * **Production Use:** *Do not* use this model for real-world text generation, translation, or conversation.
58
+ * **General Language Tasks:** Due to the extremely small and repetitive training dataset (even with 10× repetition), the model has a very limited understanding of grammar, syntax, and semantics. It will primarily generate repetitive and fragmented text based on corpus patterns.
59
+
60
+ ---
61
+
62
+ ## Training Details
63
+
64
+ ### Training Data
65
+
66
+ | **Parameter** | **Value** | **Notes** |
67
+ | ----------------- | ------------------------- | ------------------------------------------------------------------------- |
68
+ | **Source** | Public Domain Sample Text | The original sample text provided in the source code. |
69
+ | **Volume** | 10× Repetition | The original text was repeated 10 times to increase training data volume. |
70
+ | **Tokenization** | Character-level | Vocabulary size is determined by unique characters (≈32–35). |
71
+ | **Preprocessing** | Lowercased | All training data is normalized to lowercase characters. |
72
+
73
+ ---
74
+
75
+ ### Hyperparameters
76
+
77
+ | **Parameter** | **Value** | **Notes** |
78
+ | ------------------- | ------------------ | --------- |
79
+ | **Optimization** | AdamW | |
80
+ | **Learning Rate** | 3e-4 | |
81
+ | **Max Iterations** | 2000 | |
82
+ | **Batch Size** | 2 | |
83
+ | **Sequence Length** | 128 | |
84
+ | **Loss Function** | Cross-Entropy Loss | |
85
+
86
+ ---
87
+
88
+ ## Performance Metrics
89
+
90
+ * **Initial Loss (Expected):**
91
+ The model should start with a Cross-Entropy Loss between **3.0** and **4.0**, depending on the final character vocabulary size (≈ ln V).
92
+
93
+ * **Target Loss:**
94
+ With the increased data volume (10×), the model should drop the training loss well below **2.0** and aim closer to **1.0** to be considered successfully trained and leveraging its increased capacity.