File size: 4,880 Bytes
798dbed d3bd287 f9723b3 d3bd287 860a775 d3bd287 1b930d8 d3bd287 f9723b3 d3bd287 f9723b3 d3bd287 860a775 d3bd287 f9723b3 d3bd287 860a775 f9723b3 d3bd287 2209baa d3bd287 2209baa d3bd287 2209baa d3bd287 2209baa 3fd1fa2 2209baa 3fd1fa2 2209baa 3fd1fa2 860a775 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- i3-architecture
- custom_code
---
# i3-tiny
**i3-tiny** is a compact, efficient character-level language model designed for experimentation and exploration in text generation. Despite its small size, it can generate sequences that are quirky, unpredictable, and full of "human-like" character-level errors.
---
## Model Overview
i3-tiny is trained to predict the next character in a sequence, making it ideal for **character-level language modeling**, **creative text generation**, and **research on lightweight, efficient models**. Its small footprint allows rapid experimentation, even on modest hardware, and it provides a playground for studying how models learn patterns in sequences of characters.
The model is **intentionally experimental** — it's not aligned, fact-checked, or polished. Outputs may be coherent, partially readable, or amusingly garbled.
---
## Architecture: i3
The **i3 architecture** (pronounced "i-three") is a novel hybrid design optimized for extreme efficiency on resource-constrained hardware. The name reflects its design goal: to enable language model training on modest consumer CPUs, including Intel Core i3 processors.
### Key Design Principles
i3 combines multiple efficiency techniques to achieve sub-1GB memory usage during training:
- **Hybrid sequence modeling**: Blends different approaches to long-range dependency capture, balancing expressiveness with computational efficiency
- **Low-rank parameterization**: Strategic use of matrix factorization reduces memory footprint while maintaining model capacity
- **Factorized attention mechanisms**: Efficient approximations that preserve attention's ability to model relationships without quadratic memory costs
- **Linear-time operations**: Emphasis on operations that scale linearly with sequence length rather than quadratically
### Efficiency Characteristics
- **Training memory**: < 1 GB RAM total (including model, gradients, and optimizer state)
- **Model size**: 711,106 parameters (~2.7 MB in FP32)
- **Training speed**: ~450 ms per iteration on modest CPU hardware
- **Sequence processing**: Linear complexity enables longer context windows on limited hardware
The architecture is designed from the ground up for CPU-friendly training, making it accessible for experimentation and research without requiring specialized hardware.
---
## Training Details
* **Dataset:** ~45,830 characters (a curated text corpus repeated for exposure)
* **Vocabulary:** 34 characters (all lowercased)
* **Sequence length:** 128
* **Training iterations:** 2,000
* **Batch size:** 2
* **Optimizer:** AdamW, learning rate 3e-4
* **Model parameters:** 711,106
* **Hardware:** Trained on free-tier CPU compute (Kaggle)
* **Performance notes:** Each iteration takes roughly 400–500 ms; 100 iterations take ~45 s on average. Loss steadily decreased from 3.53 to 2.15 over training.
### Training Analysis
The charts below illustrate the model's performance over the 2,000 training iterations.
The **Training Loss Over Iterations** plot shows a clear learning trend, with the 50-iteration moving average (red line) confirming a steady decrease in Cross-Entropy loss from $\sim3.5$ to $\sim2.1$. The **Training Time Performance** plot shows a consistent block time per 100 iterations, resulting in a nearly linear increase in cumulative training time, demonstrating stable and predictable training execution.

**Example generation (iteration 1200):**
```
Prompt: "The quick"
Generated: the quick efehn. dethe cans the fice the fpeens antary of eathetint, an thadat hitimes the and cow thig, and
```
These outputs capture the **chaotic creativity** of a character-level model: a mixture of readable words, invented forms, and surprising sequences.
---
## Use Cases
- **Educational research**: Study how tiny models learn language patterns
- **Creative text generation**: Experiment with character-level generation
- **Efficiency benchmarking**: Test memory-constrained training scenarios
- **Architecture research**: Explore novel approaches to efficient language modeling
---
## Limitations
- Character-level modeling only (no tokenization)
- Small vocabulary (34 characters)
- Limited training data and iterations
- Not suitable for production use or factual tasks
- Outputs are experimental and unfiltered
---
## Citation
If you use this model or the i3 architecture in your research, please cite:
```bibtex
@misc{i3tiny2024,
author = {FlameF0X},
title = {i3-tiny: Ultra-Efficient Character-Level Language Model},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/FlameF0X/i3-tiny}}
}
``` |