Instructions to use i3-lab/i3-12m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use i3-lab/i3-12m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="i3-lab/i3-12m", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("i3-lab/i3-12m", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use i3-lab/i3-12m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "i3-lab/i3-12m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "i3-lab/i3-12m", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/i3-lab/i3-12m
- SGLang
How to use i3-lab/i3-12m with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "i3-lab/i3-12m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "i3-lab/i3-12m", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "i3-lab/i3-12m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "i3-lab/i3-12m", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use i3-lab/i3-12m with Docker Model Runner:
docker model run hf.co/i3-lab/i3-12m
i3 Model - Ultra-Efficient Pretraining Language Model
Model Description
The i3 Model is designed to optimize pretraining efficiency while retaining core language modeling capabilities.
Its architecture allows training on memory-constrained hardware, including CPU-only setups, without sacrificing sequence modeling performance.
The i3 architecture is present within the model for highly efficient pretraining. It is designed to reduce memory usage, speed up training, and allow pretraining from scratch on tiny hardware. Internal details are abstracted for simplicity.
Use
from transformers import pipeline
pipe = pipeline("text-generation", model="FlameF0X/i3-12m")
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe(messages)
Model Statistics
Vocabulary Size: 4,466 (variable-length chunks)
Hidden Dimension: 512
Number of Layers: 12
Max Sequence Length: 256
Total Parameters: 12,691,186
Tokenization: Memory-efficient variable-length chunking (2β3 characters)
- Total tokens: 334,524,736
Key Features
- Memory-Optimized: Designed to train on tiny hardware with minimal RAM usage
- Pretraining-Focused Architecture: i3 layers provide efficient sequence modeling, low-rank linear updates, and factorized attention
- Variable-Length Tokenization: 2β3 character chunks for compact embeddings
- Conversational Readiness: Optimized for dialogue and text generation
i3 Architecture (Abstract Overview)
Design Philosophy
The i3 model targets CPU-friendly, memory-constrained pretraining, emphasizing:
- Long-range sequence modeling
- Low-rank weight updates for memory savings
- Efficient factorized attention
- 4-bit weights and microbatching for minimal memory footprint
Technologies used in the i3 Architecture that are open-sourced by me:
- Low-Rank Pre-training - LoRa for pre-training.
Conceptual Layout
Input Tokens
β
+-----------------+
| Embedding Layer |
+-----------------+
β
+-----------------+
| i3 Architecture |
+-----------------+
β
+------------------------+
| KQV Low-Rank Attention |
+------------------------+
β
+-----------------------+
| LayerNorm + Residuals |
+-----------------------+
β
+-------------------+
| Output Projection |
+-------------------+
β
Predicted Tokens
Key idea: Every component is optimized for memory efficiency and pretraining speed on small hardware, while preserving essential transformer dynamics.
Training Details
- Sequence length: 128β512 tokens
- Model size: ~12M parameters (CPU-friendly)
- Optimizer: AdamW or Lion (4-bit / mixed precision)
- Dataset: TinyChat (~50β200 MB)
- Training loop: gradient checkpointing + recomputation
- Objective: token prediction / text generation
Citation
@software{lorpt2025,
title={LoRPt: Low-Rank Pretraining for Resource-Efficient Language Models},
author={[FlameF0X]},
year={2025},
url={https://github.com/FlameF0X/Low-Rank-Pretraining}
}
- Downloads last month
- 1,131