|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- conversational |
|
|
- efficient |
|
|
- i3-architecture |
|
|
- custom_code |
|
|
datasets: |
|
|
- starhopp3r/TinyChat |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# i3 Model - Ultra-Efficient Pretraining Language Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
The **i3 Model** is designed to optimize **pretraining efficiency** while retaining core language modeling capabilities. |
|
|
Its architecture allows training on **memory-constrained hardware**, including CPU-only setups, without sacrificing sequence modeling performance. |
|
|
|
|
|
> [!Note] |
|
|
> The i3 architecture is present within the model for highly efficient pretraining. It is designed to **reduce memory usage**, **speed up training**, and allow pretraining from scratch on tiny hardware. Internal details are abstracted for simplicity. |
|
|
|
|
|
--- |
|
|
|
|
|
## Use |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline("text-generation", model="FlameF0X/i3-12m") |
|
|
messages = [ |
|
|
{"role": "user", "content": "Who are you?"}, |
|
|
] |
|
|
pipe(messages) |
|
|
```` |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Statistics |
|
|
|
|
|
* **Vocabulary Size:** 4,466 (variable-length chunks) |
|
|
* **Hidden Dimension:** 512 |
|
|
* **Number of Layers:** 12 |
|
|
* **Max Sequence Length:** 256 |
|
|
* **Total Parameters:** 12,691,186 |
|
|
* **Tokenization:** Memory-efficient variable-length chunking (2β3 characters) |
|
|
|
|
|
* **Total tokens:** 334,524,736 |
|
|
|
|
|
--- |
|
|
|
|
|
## Key Features |
|
|
|
|
|
1. **Memory-Optimized:** Designed to train on tiny hardware with minimal RAM usage |
|
|
2. **Pretraining-Focused Architecture:** i3 layers provide efficient sequence modeling, low-rank linear updates, and factorized attention |
|
|
3. **Variable-Length Tokenization:** 2β3 character chunks for compact embeddings |
|
|
4. **Conversational Readiness:** Optimized for dialogue and text generation |
|
|
|
|
|
--- |
|
|
|
|
|
## i3 Architecture (Abstract Overview) |
|
|
|
|
|
### Design Philosophy |
|
|
|
|
|
The i3 model targets **CPU-friendly, memory-constrained pretraining**, emphasizing: |
|
|
|
|
|
* Long-range sequence modeling |
|
|
* Low-rank weight updates for memory savings |
|
|
* Efficient factorized attention |
|
|
* 4-bit weights and microbatching for minimal memory footprint |
|
|
|
|
|
## Technologies used in the i3 Architecture that are open-sourced by me: |
|
|
|
|
|
* [Low-Rank Pre-training](https://github.com/FlameF0X/Low-Rank-Pretraining) - LoRa for pre-training. |
|
|
|
|
|
### Conceptual Layout |
|
|
|
|
|
``` |
|
|
Input Tokens |
|
|
β |
|
|
+-----------------+ |
|
|
| Embedding Layer | |
|
|
+-----------------+ |
|
|
β |
|
|
+-----------------+ |
|
|
| i3 Architecture | |
|
|
+-----------------+ |
|
|
β |
|
|
+------------------------+ |
|
|
| KQV Low-Rank Attention | |
|
|
+------------------------+ |
|
|
β |
|
|
+-----------------------+ |
|
|
| LayerNorm + Residuals | |
|
|
+-----------------------+ |
|
|
β |
|
|
+-------------------+ |
|
|
| Output Projection | |
|
|
+-------------------+ |
|
|
β |
|
|
Predicted Tokens |
|
|
``` |
|
|
|
|
|
> Key idea: Every component is optimized for **memory efficiency** and **pretraining speed** on small hardware, while preserving essential transformer dynamics. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
* **Sequence length:** 128β512 tokens |
|
|
* **Model size:** ~12M parameters (CPU-friendly) |
|
|
* **Optimizer:** AdamW or Lion (4-bit / mixed precision) |
|
|
* **Dataset:** TinyChat (~50β200 MB) |
|
|
* **Training loop:** gradient checkpointing + recomputation |
|
|
* **Objective:** token prediction / text generation |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@software{lorpt2025, |
|
|
title={LoRPt: Low-Rank Pretraining for Resource-Efficient Language Models}, |
|
|
author={[FlameF0X]}, |
|
|
year={2025}, |
|
|
url={https://github.com/FlameF0X/Low-Rank-Pretraining} |
|
|
} |
|
|
``` |