i3-lab
/

i3-tiny

+---
+license: apache-2.0
+language:
+- en
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- i3-arhitecture
+---
+# i3-tiny: The i3 Architecture (v1)
+## Model Description
+The **i3-tiny** is an experimental Large Language Model (LLM) utilizing the proprietary **i3 (Integrated Intelligent Infrastructure)** Architecture. It is designed for ultra-efficiency and a low memory footprint.
+This model is a **character-level autoregressive decoder** trained on a small English corpus.
+---
+## Key Architectural Features
+The **i3 Block** is a novel, single-layer design that achieves high efficiency by minimizing parameter and computational costs. It integrates advanced mechanisms for sequence processing:
+* **Proprietary Recurrence Mechanism:**
+  A specialized hybrid recurrence component manages sequential dependencies efficiently, avoiding the quadratic complexity of standard self-attention.
+* **Low-Rank Attention:**
+  Attention mechanisms use highly factorized, low-rank projections, significantly reducing memory and compute costs associated with Key, Query, and Value matrices.
+* **Low-Rank Feedforward:**
+  The standard FFN is replaced by proprietary low-rank factorization layers to maximize parameter efficiency throughout the block.
+---
+## Model Scale (v1 Configuration)
+| **Parameter**       | **Value**                 | **Notes**                                          |
+| ------------------- | ------------------------- | -------------------------------------------------- |
+| **Model Size**      | Approx. 40–50M parameters | Exact count is printed during training initiation. |
+| **d_model**         | 256                       | Hidden dimension size.                             |
+| **n_layers**        | 6                         | Number of hybrid i3 blocks.                        |
+| **n_heads**         | 8                         | Number of attention heads.                         |
+| **Recurrence Rank** | 16 (d_state for Mamba)    | Size of the proprietary recurrence state.          |
+| **Low-Rank Rank**   | 8                         | Rank used for low-rank factorizations.             |
+---
+## Intended Uses and Limitations
+### Intended Uses
+* **Benchmarking & Research:** Exploring the training speed and final loss achievement of the i3 architecture against fully Transformer-based models of similar scale.
+* **Proof of Concept:** Demonstrating an ultra-efficient training and inference paradigm.
+### Out-of-Scope Use and Limitations
+* **Production Use:** *Do not* use this model for real-world text generation, translation, or conversation.
+* **General Language Tasks:** Due to the extremely small and repetitive training dataset (even with 10× repetition), the model has a very limited understanding of grammar, syntax, and semantics. It will primarily generate repetitive and fragmented text based on corpus patterns.
+---
+## Training Details
+### Training Data
+| **Parameter**     | **Value**                 | **Notes**                                                                 |
+| ----------------- | ------------------------- | ------------------------------------------------------------------------- |
+| **Source**        | Public Domain Sample Text | The original sample text provided in the source code.                     |
+| **Volume**        | 10× Repetition            | The original text was repeated 10 times to increase training data volume. |
+| **Tokenization**  | Character-level           | Vocabulary size is determined by unique characters (≈32–35).              |
+| **Preprocessing** | Lowercased                | All training data is normalized to lowercase characters.                  |
+---
+### Hyperparameters
+| **Parameter**       | **Value**          | **Notes** |
+| ------------------- | ------------------ | --------- |
+| **Optimization**    | AdamW              |           |
+| **Learning Rate**   | 3e-4               |           |
+| **Max Iterations**  | 2000               |           |
+| **Batch Size**      | 2                  |           |
+| **Sequence Length** | 128                |           |
+| **Loss Function**   | Cross-Entropy Loss |           |
+---
+## Performance Metrics
+* **Initial Loss (Expected):**
+  The model should start with a Cross-Entropy Loss between **3.0** and **4.0**, depending on the final character vocabulary size (≈ ln V).
+* **Target Loss:**
+  With the increased data volume (10×), the model should drop the training loss well below **2.0** and aim closer to **1.0** to be considered successfully trained and leveraging its increased capacity.