Tressa GPT 🧠
Model Details
Model Description
Tressa GPT is a fully custom, 51-million parameter GPT-style autoregressive language model built entirely from scratch in PyTorch. It has been pre-trained on a 5 Billion token portion of the Hugging Face FineWeb-Edu dataset.
The architecture features a 6-layer Transformer block set with Multi-Head Causal Attention and incorporates advanced design choices like Sinusoidal Positional Embeddings, a 5x Feed-Forward Network expansion factor, and weight verification tying to optimize parameter efficiency.
During training, the model natively utilizes an advanced Curriculum Learning pipeline. This allows it to dynamically expand its context window from 128 up to 1024 tokens mid-training, progressively reducing the batch size to safely fit within VRAM boundaries without halting the data stream.
- Model type: Autoregressive Language Model
- Language(s) (NLP): English
- License: Apache 2.0 (or equivalent open-source)
- Developed by: Custom built from scratch
Model Architecture Configurations
- Parameters: ~51 Million (38.5M saved via weight tying)
- Transformer Blocks: 6
- Attention Heads: 6
- Embedding Dimension: 384
- Feed-Forward Network (FFN) Expansion: 5x
- Vocabulary Size: 100,277
- Max Sequence Length: 1024
- Positional Embeddings: Native Sinusoidal
- Weight-tying: Target embedding matrix natively tied with the final LM projection head.
- Dropout: 0.1
Training Details
Training Data
The model is trained on a high-quality educational dataset stream.
- Dataset:
HuggingFaceFW/fineweb-edu - Subset:
sample-10BT - Filtering Rules: Docs with
language_score< 0.7 or genericscore< 3 are dynamically filtered out. - Tokens Processed: 5,000,000,000 (5 Billion tokens precisely)
Tokenizer
- Tokenizer:
cl100k_base(via OpenAI'stiktoken) - Type: Byte-Pair Encoding (BPE)
Optimizer and Schedule
- Optimizer: AdamW
- Learning Rate: 3e-4
- Total Training Steps: 1,220,703
- Precision: TF32 (TensorFloat-32) enabled for maximized matrix multiplication performance (
torch.set_float32_matmul_precision('high')). - Gradient Clipping: Max norm limit of 1.0.
Curriculum Learning (Dynamic Context Growth)
Tressa GPT integrates a custom PyTorch IterableDataset mapping that dynamically slices data to conserve memory. It executes Curriculum Learning:
- The
block_sizeeffectively starts small and climbs systematically to themax_seq_len(1024) across the training stages. - The state dictionaries of the dataset cleanly pass the byte stream across these phases without resyncing or skipping context.
Hardware Details
- Hardware: Partitioned for A40 GPU constraints.
- Fault Tolerance: Robust state saving every 5000 steps with native optimizer and streaming dataset fallback parameters.
How to use
You can utilize the model right away with basic PyTorch operations, or deploy the native chat.py interface available within the source code to natively prompt the model.
import torch
from config import GPTConfig
from model import TressaGPTModel
# Setup config matching standard Tressa GPT requirements
config = GPTConfig()
# Initialize Model framework and load weights
model = TressaGPTModel(config)
model.load_state_dict(torch.load("checkpoints/gpt_model_5B_tokens.pt", map_location="cpu")['model_state_dict'])
model.eval()
# Ready for autoregressive generation!
Intended Uses & Limitations
This model serves significantly as an educational and proof-of-concept milestone for custom model engineering scaling.
Limitations: Due to its relatively small parameter size (51M) compared to giant state-of-the-art foundation models (7B+), it might struggle with deeply complex logic or highly specific factual retrieval tasks. Its generation focuses on structure, mimicking contextual grammar rather than robust generalized intelligence.