Tressa GPT 🧠

Model Details

Model Description

Tressa GPT is a fully custom, 51-million parameter GPT-style autoregressive language model built entirely from scratch in PyTorch. It has been pre-trained on a 5 Billion token portion of the Hugging Face FineWeb-Edu dataset.

The architecture features a 6-layer Transformer block set with Multi-Head Causal Attention and incorporates advanced design choices like Sinusoidal Positional Embeddings, a 5x Feed-Forward Network expansion factor, and weight verification tying to optimize parameter efficiency.

During training, the model natively utilizes an advanced Curriculum Learning pipeline. This allows it to dynamically expand its context window from 128 up to 1024 tokens mid-training, progressively reducing the batch size to safely fit within VRAM boundaries without halting the data stream.

  • Model type: Autoregressive Language Model
  • Language(s) (NLP): English
  • License: Apache 2.0 (or equivalent open-source)
  • Developed by: Custom built from scratch

Model Architecture Configurations

  • Parameters: ~51 Million (38.5M saved via weight tying)
  • Transformer Blocks: 6
  • Attention Heads: 6
  • Embedding Dimension: 384
  • Feed-Forward Network (FFN) Expansion: 5x
  • Vocabulary Size: 100,277
  • Max Sequence Length: 1024
  • Positional Embeddings: Native Sinusoidal
  • Weight-tying: Target embedding matrix natively tied with the final LM projection head.
  • Dropout: 0.1

Training Details

Training Data

The model is trained on a high-quality educational dataset stream.

  • Dataset: HuggingFaceFW/fineweb-edu
  • Subset: sample-10BT
  • Filtering Rules: Docs with language_score < 0.7 or generic score < 3 are dynamically filtered out.
  • Tokens Processed: 5,000,000,000 (5 Billion tokens precisely)

Tokenizer

  • Tokenizer: cl100k_base (via OpenAI's tiktoken)
  • Type: Byte-Pair Encoding (BPE)

Optimizer and Schedule

  • Optimizer: AdamW
  • Learning Rate: 3e-4
  • Total Training Steps: 1,220,703
  • Precision: TF32 (TensorFloat-32) enabled for maximized matrix multiplication performance (torch.set_float32_matmul_precision('high')).
  • Gradient Clipping: Max norm limit of 1.0.

Curriculum Learning (Dynamic Context Growth)

Tressa GPT integrates a custom PyTorch IterableDataset mapping that dynamically slices data to conserve memory. It executes Curriculum Learning:

  • The block_size effectively starts small and climbs systematically to the max_seq_len (1024) across the training stages.
  • The state dictionaries of the dataset cleanly pass the byte stream across these phases without resyncing or skipping context.

Hardware Details

  • Hardware: Partitioned for A40 GPU constraints.
  • Fault Tolerance: Robust state saving every 5000 steps with native optimizer and streaming dataset fallback parameters.

How to use

You can utilize the model right away with basic PyTorch operations, or deploy the native chat.py interface available within the source code to natively prompt the model.

import torch
from config import GPTConfig
from model import TressaGPTModel

# Setup config matching standard Tressa GPT requirements
config = GPTConfig()

# Initialize Model framework and load weights
model = TressaGPTModel(config)
model.load_state_dict(torch.load("checkpoints/gpt_model_5B_tokens.pt", map_location="cpu")['model_state_dict'])

model.eval()

# Ready for autoregressive generation!

Intended Uses & Limitations

This model serves significantly as an educational and proof-of-concept milestone for custom model engineering scaling.

Limitations: Due to its relatively small parameter size (51M) compared to giant state-of-the-art foundation models (7B+), it might struggle with deeply complex logic or highly specific factual retrieval tasks. Its generation focuses on structure, mimicking contextual grammar rather than robust generalized intelligence.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train abhijeetmishra101/tressa_gpt_50M