Update README.md
Browse files
README.md
CHANGED
|
@@ -7,88 +7,84 @@ library_name: transformers
|
|
| 7 |
tags:
|
| 8 |
- i3-arhitecture
|
| 9 |
---
|
| 10 |
-
# i3-tiny: The i3 Architecture (v1)
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
-
A specialized hybrid recurrence component manages sequential dependencies efficiently, avoiding the quadratic complexity of standard self-attention.
|
| 26 |
|
| 27 |
-
|
| 28 |
-
Attention mechanisms use highly factorized, low-rank projections, significantly reducing memory and compute costs associated with Key, Query, and Value matrices.
|
| 29 |
|
| 30 |
-
|
| 31 |
-
The standard FFN is replaced by proprietary low-rank factorization layers to maximize parameter efficiency throughout the block.
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
| **Parameter** | **Value** | **Notes** |
|
| 38 |
-
| ------------------- | ------------------------- | -------------------------------------------------- |
|
| 39 |
-
| **Model Size** | Approx. 40–50M parameters | Exact count is printed during training initiation. |
|
| 40 |
-
| **d_model** | 256 | Hidden dimension size. |
|
| 41 |
-
| **n_layers** | 6 | Number of hybrid i3 blocks. |
|
| 42 |
-
| **n_heads** | 8 | Number of attention heads. |
|
| 43 |
-
| **Recurrence Rank** | 16 (d_state for Mamba) | Size of the proprietary recurrence state. |
|
| 44 |
-
| **Low-Rank Rank** | 8 | Rank used for low-rank factorizations. |
|
| 45 |
|
| 46 |
---
|
| 47 |
|
| 48 |
-
##
|
| 49 |
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
-
|
| 53 |
-
* **Proof of Concept:** Demonstrating an ultra-efficient training and inference paradigm.
|
| 54 |
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
-
|
| 58 |
-
* **General Language Tasks:** Due to the extremely small and repetitive training dataset (even with 10× repetition), the model has a very limited understanding of grammar, syntax, and semantics. It will primarily generate repetitive and fragmented text based on corpus patterns.
|
| 59 |
|
| 60 |
---
|
| 61 |
|
| 62 |
-
##
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
|
|
|
|
|
|
|
| 65 |
|
| 66 |
-
|
| 67 |
-
| ----------------- | ------------------------- | ------------------------------------------------------------------------- |
|
| 68 |
-
| **Source** | Public Domain Sample Text | The original sample text provided in the source code. |
|
| 69 |
-
| **Volume** | 10× Repetition | The original text was repeated 10 times to increase training data volume. |
|
| 70 |
-
| **Tokenization** | Character-level | Vocabulary size is determined by unique characters (≈32–35). |
|
| 71 |
-
| **Preprocessing** | Lowercased | All training data is normalized to lowercase characters. |
|
| 72 |
|
| 73 |
---
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
| **Max Iterations** | 2000 | |
|
| 82 |
-
| **Batch Size** | 2 | |
|
| 83 |
-
| **Sequence Length** | 128 | |
|
| 84 |
-
| **Loss Function** | Cross-Entropy Loss | |
|
| 85 |
|
| 86 |
---
|
| 87 |
|
| 88 |
-
##
|
| 89 |
|
| 90 |
-
|
| 91 |
-
The model should start with a Cross-Entropy Loss between **3.0** and **4.0**, depending on the final character vocabulary size (≈ ln V).
|
| 92 |
|
| 93 |
-
*
|
| 94 |
-
|
|
|
|
|
|
| 7 |
tags:
|
| 8 |
- i3-arhitecture
|
| 9 |
---
|
|
|
|
| 10 |
|
| 11 |
+
# Model Card: i3Model (Hybrid Efficient LLM)
|
| 12 |
|
| 13 |
+
## Overview
|
| 14 |
|
| 15 |
+
**i3Model** is a research-focused, ultra-efficient large language model prototype designed for exploring advanced hybrid architectures that balance **performance, scalability, and memory efficiency**. It integrates several experimental mechanisms for sequence modeling, low-rank parameterization, and quantization-aware training to achieve strong performance under resource constraints.
|
| 16 |
|
| 17 |
+
This model was developed for experimentation in lightweight large language modeling, particularly for tasks such as:
|
| 18 |
|
| 19 |
+
* Character- or token-level language modeling
|
| 20 |
+
* Text generation and continuation
|
| 21 |
+
* Research into efficient training and deployment techniques
|
| 22 |
|
| 23 |
+
> **Note:** Architectural details are proprietary and are intentionally omitted.
|
| 24 |
|
| 25 |
+
---
|
|
|
|
| 26 |
|
| 27 |
+
## Intended Use
|
|
|
|
| 28 |
|
| 29 |
+
The model is intended for:
|
|
|
|
| 30 |
|
| 31 |
+
* Academic research on hybrid recurrent-transformer architectures
|
| 32 |
+
* Prototyping efficient LLMs for low-resource environments
|
| 33 |
+
* Studying low-rank adaptation and quantization for model compression
|
| 34 |
|
| 35 |
+
It is **not** optimized or tested for production deployment, safety-critical applications, or real-world text generation beyond controlled research settings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
---
|
| 38 |
|
| 39 |
+
## Key Features
|
| 40 |
|
| 41 |
+
* **Hybrid Recurrent–Sequence Modeling:** Combines sequence-mixing and dynamic state-space mechanisms for temporal reasoning.
|
| 42 |
+
* **Low-Rank Parameterization:** Reduces parameter footprint while maintaining expressivity.
|
| 43 |
+
* **Quantization-Aware Design:** Uses a 4-bit quantization scheme with FP32 master weights for training stability.
|
| 44 |
+
* **Causal Autoregressive Training:** Enables next-token prediction and controlled text generation.
|
| 45 |
+
* **Modular and Extensible:** Supports layer-wise experimentation and scalable configuration.
|
| 46 |
|
| 47 |
+
---
|
|
|
|
| 48 |
|
| 49 |
+
## Training Details
|
| 50 |
+
|
| 51 |
+
* **Training Objective:** Next-token prediction (causal language modeling)
|
| 52 |
+
* **Dataset:** Custom small-scale character-level corpus derived from public domain text passages
|
| 53 |
+
* **Sequence Length:** 128 tokens (for prototype training)
|
| 54 |
+
* **Optimization:** AdamW optimizer with weight decay and gradient-based updates
|
| 55 |
+
* **Learning Rate:** 3e-4
|
| 56 |
+
* **Training Duration:** ~2000 iterations on a small dataset
|
| 57 |
+
* **Batch Size:** 2
|
| 58 |
|
| 59 |
+
The model was trained primarily for demonstration and performance measurement purposes rather than benchmark-level convergence.
|
|
|
|
| 60 |
|
| 61 |
---
|
| 62 |
|
| 63 |
+
## Evaluation
|
| 64 |
+
|
| 65 |
+
Evaluation focused on:
|
| 66 |
|
| 67 |
+
* **Training stability**
|
| 68 |
+
* **Generation coherence at small scale**
|
| 69 |
+
* **Speed and memory performance metrics**
|
| 70 |
|
| 71 |
+
While not benchmarked on large-scale NLP datasets, the model demonstrates promising early results in lightweight text generation with efficient runtime characteristics.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
---
|
| 74 |
|
| 75 |
+
## Limitations
|
| 76 |
|
| 77 |
+
* The model is trained on a very limited dataset and may produce incoherent or repetitive outputs.
|
| 78 |
+
* It lacks fine-tuning for alignment, safety, or factual consistency.
|
| 79 |
+
* It is unsuitable for deployment in sensitive or user-facing contexts.
|
| 80 |
+
* Generation quality is constrained by vocabulary and training corpus diversity.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
---
|
| 83 |
|
| 84 |
+
## Ethical Considerations
|
| 85 |
|
| 86 |
+
This model is intended solely for **research and educational use**. Users should:
|
|
|
|
| 87 |
|
| 88 |
+
* Avoid using it to generate misleading or harmful content.
|
| 89 |
+
* Not deploy it in systems interacting with the public without additional alignment and safety layers.
|
| 90 |
+
* Attribute the model appropriately if adapted or redistributed.
|