MiniGPT Shakespeare

A lightweight GPT-style transformer trained from scratch on Shakespeare text and integrated with Hugging Face.

Features

Custom Decoder-only Transformer
Built entirely in PyTorch
Hugging Face compatible (AutoModel)
Byte-level BPE tokenizer
Supports text generation with sampling

Model Details

Architecture: Decoder-only Transformer (GPT-style)
Layers: 6
Heads: 4
Embedding Size: 256
Vocab Size: 1000
Max Sequence Length: 256

Usage

Load Model

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "flamingo44333/mini-gpt-shakespeare",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "flamingo44333/mini-gpt-shakespeare",
    trust_remote_code=True
)

Generate Text

import torch
import torch.nn.functional as F

def generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=100,
    temperature=0.5,
    top_k=40,
    device="cuda" if torch.cuda.is_available() else "cpu"
):
    model.eval()
    model.to(device)

    # Encode (match training behavior)
    input_ids = tokenizer.encode(prompt, add_special_tokens=False)
    input_ids = torch.tensor(input_ids, dtype=torch.long).unsqueeze(0).to(device)

    # Handle DataParallel safely
    model_to_use = model.module if hasattr(model, "module") else model

    with torch.no_grad():
        for _ in range(max_new_tokens):

            input_crop = input_ids[:, -model.config.max_seq_len:]

            out = model_to_use(input_crop)
            logits = out["logits"]  
            pad_id = tokenizer.pad_token_id
            unk_id = tokenizer.unk_token_id
            if pad_id is not None:
                logits[:, -1, pad_id] = float('-inf')
            
            if unk_id is not None:
                logits[:, -1, unk_id] = float('-inf')
            logits = logits[:, -1, :] / temperature

            if top_k is not None:
                values, indices = torch.topk(logits, top_k)
                probs = F.softmax(values, dim=-1)
                next_token = indices.gather(-1, torch.multinomial(probs, 1))
            else:
                probs = F.softmax(logits, dim=-1)
                next_token = torch.multinomial(probs, 1)

            input_ids = torch.cat([input_ids, next_token], dim=1)

    return tokenizer.decode(
        input_ids[0].tolist(),
        clean_up_tokenization_spaces=False
    )

Notes

No KV-cache (generation is slower than production LLMs)
Uses top-k sampling for stability
Requires trust_remote_code=True to load custom model

Future Improvements

Add .generate() API support
Implement KV caching
Add top-p (nucleus sampling)
Train on larger datasets

Purpose

This project was built to understand:

Transformer internals
Attention mechanisms
Hugging Face model integration

Author

Praful Srinivasan

Downloads last month: 254

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support