FlameF0X commited on
Commit
1b930d8
·
verified ·
1 Parent(s): 8b425f7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -53
README.md CHANGED
@@ -7,88 +7,84 @@ library_name: transformers
7
  tags:
8
  - i3-arhitecture
9
  ---
10
- # i3-tiny: The i3 Architecture (v1)
11
 
12
- ## Model Description
13
 
14
- The **i3-tiny** is an experimental Language Model (LM) utilizing the proprietary **i3** Architecture. It is designed for ultra-efficiency and a low memory footprint.
15
 
16
- This model is a **character-level autoregressive decoder** trained on a small English corpus.
17
 
18
- ---
19
 
20
- ## Key Architectural Features
 
 
21
 
22
- The **i3 Block** is a novel, single-layer design that achieves high efficiency by minimizing parameter and computational costs. It integrates advanced mechanisms for sequence processing:
23
 
24
- * **Proprietary Recurrence Mechanism:**
25
- A specialized hybrid recurrence component manages sequential dependencies efficiently, avoiding the quadratic complexity of standard self-attention.
26
 
27
- * **Low-Rank Attention:**
28
- Attention mechanisms use highly factorized, low-rank projections, significantly reducing memory and compute costs associated with Key, Query, and Value matrices.
29
 
30
- * **Low-Rank Feedforward:**
31
- The standard FFN is replaced by proprietary low-rank factorization layers to maximize parameter efficiency throughout the block.
32
 
33
- ---
 
 
34
 
35
- ## Model Scale (v1 Configuration)
36
-
37
- | **Parameter** | **Value** | **Notes** |
38
- | ------------------- | ------------------------- | -------------------------------------------------- |
39
- | **Model Size** | Approx. 40–50M parameters | Exact count is printed during training initiation. |
40
- | **d_model** | 256 | Hidden dimension size. |
41
- | **n_layers** | 6 | Number of hybrid i3 blocks. |
42
- | **n_heads** | 8 | Number of attention heads. |
43
- | **Recurrence Rank** | 16 (d_state for Mamba) | Size of the proprietary recurrence state. |
44
- | **Low-Rank Rank** | 8 | Rank used for low-rank factorizations. |
45
 
46
  ---
47
 
48
- ## Intended Uses and Limitations
49
 
50
- ### Intended Uses
 
 
 
 
51
 
52
- * **Benchmarking & Research:** Exploring the training speed and final loss achievement of the i3 architecture against fully Transformer-based models of similar scale.
53
- * **Proof of Concept:** Demonstrating an ultra-efficient training and inference paradigm.
54
 
55
- ### Out-of-Scope Use and Limitations
 
 
 
 
 
 
 
 
56
 
57
- * **Production Use:** *Do not* use this model for real-world text generation, translation, or conversation.
58
- * **General Language Tasks:** Due to the extremely small and repetitive training dataset (even with 10× repetition), the model has a very limited understanding of grammar, syntax, and semantics. It will primarily generate repetitive and fragmented text based on corpus patterns.
59
 
60
  ---
61
 
62
- ## Training Details
 
 
63
 
64
- ### Training Data
 
 
65
 
66
- | **Parameter** | **Value** | **Notes** |
67
- | ----------------- | ------------------------- | ------------------------------------------------------------------------- |
68
- | **Source** | Public Domain Sample Text | The original sample text provided in the source code. |
69
- | **Volume** | 10× Repetition | The original text was repeated 10 times to increase training data volume. |
70
- | **Tokenization** | Character-level | Vocabulary size is determined by unique characters (≈32–35). |
71
- | **Preprocessing** | Lowercased | All training data is normalized to lowercase characters. |
72
 
73
  ---
74
 
75
- ### Hyperparameters
76
 
77
- | **Parameter** | **Value** | **Notes** |
78
- | ------------------- | ------------------ | --------- |
79
- | **Optimization** | AdamW | |
80
- | **Learning Rate** | 3e-4 | |
81
- | **Max Iterations** | 2000 | |
82
- | **Batch Size** | 2 | |
83
- | **Sequence Length** | 128 | |
84
- | **Loss Function** | Cross-Entropy Loss | |
85
 
86
  ---
87
 
88
- ## Performance Metrics
89
 
90
- * **Initial Loss (Expected):**
91
- The model should start with a Cross-Entropy Loss between **3.0** and **4.0**, depending on the final character vocabulary size (≈ ln V).
92
 
93
- * **Target Loss:**
94
- With the increased data volume (10×), the model should drop the training loss well below **2.0** and aim closer to **1.0** to be considered successfully trained and leveraging its increased capacity.
 
 
7
  tags:
8
  - i3-arhitecture
9
  ---
 
10
 
11
+ # Model Card: i3Model (Hybrid Efficient LLM)
12
 
13
+ ## Overview
14
 
15
+ **i3Model** is a research-focused, ultra-efficient large language model prototype designed for exploring advanced hybrid architectures that balance **performance, scalability, and memory efficiency**. It integrates several experimental mechanisms for sequence modeling, low-rank parameterization, and quantization-aware training to achieve strong performance under resource constraints.
16
 
17
+ This model was developed for experimentation in lightweight large language modeling, particularly for tasks such as:
18
 
19
+ * Character- or token-level language modeling
20
+ * Text generation and continuation
21
+ * Research into efficient training and deployment techniques
22
 
23
+ > **Note:** Architectural details are proprietary and are intentionally omitted.
24
 
25
+ ---
 
26
 
27
+ ## Intended Use
 
28
 
29
+ The model is intended for:
 
30
 
31
+ * Academic research on hybrid recurrent-transformer architectures
32
+ * Prototyping efficient LLMs for low-resource environments
33
+ * Studying low-rank adaptation and quantization for model compression
34
 
35
+ It is **not** optimized or tested for production deployment, safety-critical applications, or real-world text generation beyond controlled research settings.
 
 
 
 
 
 
 
 
 
36
 
37
  ---
38
 
39
+ ## Key Features
40
 
41
+ * **Hybrid Recurrent–Sequence Modeling:** Combines sequence-mixing and dynamic state-space mechanisms for temporal reasoning.
42
+ * **Low-Rank Parameterization:** Reduces parameter footprint while maintaining expressivity.
43
+ * **Quantization-Aware Design:** Uses a 4-bit quantization scheme with FP32 master weights for training stability.
44
+ * **Causal Autoregressive Training:** Enables next-token prediction and controlled text generation.
45
+ * **Modular and Extensible:** Supports layer-wise experimentation and scalable configuration.
46
 
47
+ ---
 
48
 
49
+ ## Training Details
50
+
51
+ * **Training Objective:** Next-token prediction (causal language modeling)
52
+ * **Dataset:** Custom small-scale character-level corpus derived from public domain text passages
53
+ * **Sequence Length:** 128 tokens (for prototype training)
54
+ * **Optimization:** AdamW optimizer with weight decay and gradient-based updates
55
+ * **Learning Rate:** 3e-4
56
+ * **Training Duration:** ~2000 iterations on a small dataset
57
+ * **Batch Size:** 2
58
 
59
+ The model was trained primarily for demonstration and performance measurement purposes rather than benchmark-level convergence.
 
60
 
61
  ---
62
 
63
+ ## Evaluation
64
+
65
+ Evaluation focused on:
66
 
67
+ * **Training stability**
68
+ * **Generation coherence at small scale**
69
+ * **Speed and memory performance metrics**
70
 
71
+ While not benchmarked on large-scale NLP datasets, the model demonstrates promising early results in lightweight text generation with efficient runtime characteristics.
 
 
 
 
 
72
 
73
  ---
74
 
75
+ ## Limitations
76
 
77
+ * The model is trained on a very limited dataset and may produce incoherent or repetitive outputs.
78
+ * It lacks fine-tuning for alignment, safety, or factual consistency.
79
+ * It is unsuitable for deployment in sensitive or user-facing contexts.
80
+ * Generation quality is constrained by vocabulary and training corpus diversity.
 
 
 
 
81
 
82
  ---
83
 
84
+ ## Ethical Considerations
85
 
86
+ This model is intended solely for **research and educational use**. Users should:
 
87
 
88
+ * Avoid using it to generate misleading or harmful content.
89
+ * Not deploy it in systems interacting with the public without additional alignment and safety layers.
90
+ * Attribute the model appropriately if adapted or redistributed.