Tressa GPT 🧠

Model Details

Model Description

Tressa GPT is a fully custom, 51-million parameter GPT-style autoregressive language model built entirely from scratch in PyTorch. It has been pre-trained on a 5 Billion token portion of the Hugging Face FineWeb-Edu dataset.

The architecture features a 6-layer Transformer block set with Multi-Head Causal Attention and incorporates advanced design choices like Sinusoidal Positional Embeddings, a 5x Feed-Forward Network expansion factor, and weight verification tying to optimize parameter efficiency.

During training, the model natively utilizes an advanced Curriculum Learning pipeline. This allows it to dynamically expand its context window from 128 up to 1024 tokens mid-training, progressively reducing the batch size to safely fit within VRAM boundaries without halting the data stream.

Model type: Autoregressive Language Model
Language(s) (NLP): English
License: Apache 2.0 (or equivalent open-source)
Developed by: Custom built from scratch

Model Architecture Configurations

Parameters: ~51 Million (38.5M saved via weight tying)
Transformer Blocks: 6
Attention Heads: 6
Embedding Dimension: 384
Feed-Forward Network (FFN) Expansion: 5x
Vocabulary Size: 100,277
Max Sequence Length: 1024
Positional Embeddings: Native Sinusoidal
Weight-tying: Target embedding matrix natively tied with the final LM projection head.
Dropout: 0.1

Training Details

Training Data

The model is trained on a high-quality educational dataset stream.

Dataset: HuggingFaceFW/fineweb-edu
Subset: sample-10BT
Filtering Rules: Docs with language_score < 0.7 or generic score < 3 are dynamically filtered out.
Tokens Processed: 5,000,000,000 (5 Billion tokens precisely)

Tokenizer

Tokenizer: cl100k_base (via OpenAI's tiktoken)
Type: Byte-Pair Encoding (BPE)

Optimizer and Schedule

Optimizer: AdamW
Learning Rate: 3e-4
Total Training Steps: 1,220,703
Precision: TF32 (TensorFloat-32) enabled for maximized matrix multiplication performance (torch.set_float32_matmul_precision('high')).
Gradient Clipping: Max norm limit of 1.0.

Curriculum Learning (Dynamic Context Growth)

Tressa GPT integrates a custom PyTorch IterableDataset mapping that dynamically slices data to conserve memory. It executes Curriculum Learning:

The block_size effectively starts small and climbs systematically to the max_seq_len (1024) across the training stages.
The state dictionaries of the dataset cleanly pass the byte stream across these phases without resyncing or skipping context.

Hardware Details

Hardware: Partitioned for A40 GPU constraints.
Fault Tolerance: Robust state saving every 5000 steps with native optimizer and streaming dataset fallback parameters.

How to use

You can utilize the model right away with basic PyTorch operations, or deploy the native chat.py interface available within the source code to natively prompt the model.

import torch
from config import GPTConfig
from model import TressaGPTModel

# Setup config matching standard Tressa GPT requirements
config = GPTConfig()

# Initialize Model framework and load weights
model = TressaGPTModel(config)
model.load_state_dict(torch.load("checkpoints/gpt_model_5B_tokens.pt", map_location="cpu")['model_state_dict'])

model.eval()

# Ready for autoregressive generation!

Intended Uses & Limitations

This model serves significantly as an educational and proof-of-concept milestone for custom model engineering scaling.

Limitations: Due to its relatively small parameter size (51M) compared to giant state-of-the-art foundation models (7B+), it might struggle with deeply complex logic or highly specific factual retrieval tasks. Its generation focuses on structure, mimicking contextual grammar rather than robust generalized intelligence.

Downloads last month: -; Downloads are not tracked for this model. How to track

abhijeetmishra101
/

tressa_gpt_50M