File size: 4,880 Bytes
798dbed
 
 
 
 
 
 
 
 
 
d3bd287
f9723b3
d3bd287
860a775
d3bd287
1b930d8
d3bd287
f9723b3
d3bd287
f9723b3
d3bd287
860a775
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3bd287
 
 
f9723b3
d3bd287
860a775
 
 
 
 
 
 
 
f9723b3
d3bd287
2209baa
d3bd287
2209baa
d3bd287
2209baa
d3bd287
2209baa
3fd1fa2
2209baa
3fd1fa2
2209baa
 
 
 
3fd1fa2
860a775
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- i3-architecture
- custom_code
---

# i3-tiny

**i3-tiny** is a compact, efficient character-level language model designed for experimentation and exploration in text generation. Despite its small size, it can generate sequences that are quirky, unpredictable, and full of "human-like" character-level errors.

---

## Model Overview

i3-tiny is trained to predict the next character in a sequence, making it ideal for **character-level language modeling**, **creative text generation**, and **research on lightweight, efficient models**. Its small footprint allows rapid experimentation, even on modest hardware, and it provides a playground for studying how models learn patterns in sequences of characters.

The model is **intentionally experimental** — it's not aligned, fact-checked, or polished. Outputs may be coherent, partially readable, or amusingly garbled.

---

## Architecture: i3

The **i3 architecture** (pronounced "i-three") is a novel hybrid design optimized for extreme efficiency on resource-constrained hardware. The name reflects its design goal: to enable language model training on modest consumer CPUs, including Intel Core i3 processors.

### Key Design Principles

i3 combines multiple efficiency techniques to achieve sub-1GB memory usage during training:

- **Hybrid sequence modeling**: Blends different approaches to long-range dependency capture, balancing expressiveness with computational efficiency
- **Low-rank parameterization**: Strategic use of matrix factorization reduces memory footprint while maintaining model capacity
- **Factorized attention mechanisms**: Efficient approximations that preserve attention's ability to model relationships without quadratic memory costs
- **Linear-time operations**: Emphasis on operations that scale linearly with sequence length rather than quadratically

### Efficiency Characteristics

- **Training memory**: < 1 GB RAM total (including model, gradients, and optimizer state)
- **Model size**: 711,106 parameters (~2.7 MB in FP32)
- **Training speed**: ~450 ms per iteration on modest CPU hardware
- **Sequence processing**: Linear complexity enables longer context windows on limited hardware

The architecture is designed from the ground up for CPU-friendly training, making it accessible for experimentation and research without requiring specialized hardware.

---

## Training Details

* **Dataset:** ~45,830 characters (a curated text corpus repeated for exposure)  
* **Vocabulary:** 34 characters (all lowercased)  
* **Sequence length:** 128  
* **Training iterations:** 2,000  
* **Batch size:** 2  
* **Optimizer:** AdamW, learning rate 3e-4  
* **Model parameters:** 711,106  
* **Hardware:** Trained on free-tier CPU compute (Kaggle)
* **Performance notes:** Each iteration takes roughly 400–500 ms; 100 iterations take ~45 s on average. Loss steadily decreased from 3.53 to 2.15 over training.

### Training Analysis

The charts below illustrate the model's performance over the 2,000 training iterations.

The **Training Loss Over Iterations** plot shows a clear learning trend, with the 50-iteration moving average (red line) confirming a steady decrease in Cross-Entropy loss from $\sim3.5$ to $\sim2.1$. The **Training Time Performance** plot shows a consistent block time per 100 iterations, resulting in a nearly linear increase in cumulative training time, demonstrating stable and predictable training execution.

![image](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/Z0r9xl1cY5KZo3ztnmS7Z.png)

**Example generation (iteration 1200):**

```
Prompt: "The quick"
Generated: the quick efehn. dethe cans the fice the fpeens antary of eathetint, an thadat hitimes the and cow thig, and
```

These outputs capture the **chaotic creativity** of a character-level model: a mixture of readable words, invented forms, and surprising sequences.

---

## Use Cases

- **Educational research**: Study how tiny models learn language patterns
- **Creative text generation**: Experiment with character-level generation
- **Efficiency benchmarking**: Test memory-constrained training scenarios
- **Architecture research**: Explore novel approaches to efficient language modeling

---

## Limitations

- Character-level modeling only (no tokenization)
- Small vocabulary (34 characters)
- Limited training data and iterations
- Not suitable for production use or factual tasks
- Outputs are experimental and unfiltered

---

## Citation

If you use this model or the i3 architecture in your research, please cite:

```bibtex
@misc{i3tiny2024,
  author = {FlameF0X},
  title = {i3-tiny: Ultra-Efficient Character-Level Language Model},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/FlameF0X/i3-tiny}}
}
```