gpt_tokeniser / README.md

Upload README.md with huggingface_hub

677a822 verified 2 months ago

4.38 kB

	license: mit
	language:
	- en
	tags:
	- pytorch
	- causal-lm
	- gpt2
	- text-generation
	- transformers
	library_name: pytorch
	pipeline_tag: text-generation
	---

	# GPT-2 Style Language Model

	This is a GPT-2 style autoregressive language model trained from scratch using PyTorch.

	## Model Description

	This model implements the GPT-2 architecture with causal self-attention mechanism for next-token prediction. It has been trained on custom text data to learn language patterns and generate coherent text sequences.

	### Model Architecture

	- Model Type: Causal Language Model (Decoder-only Transformer)
	- Architecture: GPT-2
	- Framework: PyTorch
	- Parameters:
	- Number of Layers: {model.config.n_layer}
	- Number of Attention Heads: {model.config.n_head}
	- Embedding Dimension: {model.config.n_embd}
	- Vocabulary Size: {model.config.vocab_size}
	- Maximum Sequence Length: {model.config.block_size} tokens
	- Total Parameters: ~{sum(p.numel() for p in model.parameters()) / 1e6:.2f}M

	### Training Details

	- Training Steps: 500
	- Batch Size: 4
	- Sequence Length: 32 tokens
	- Optimizer: AdamW
	- Learning Rate: 3e-4
	- Final Training Loss: {loss.item():.4f}
	- Tokenizer: GPT-2 BPE tokenizer (tiktoken)

	### Intended Use

	This model is intended for:
	- Text generation tasks
	- Educational purposes and research
	- Experimentation with language model fine-tuning
	- Understanding transformer architectures

	### Limitations

	- Trained on a limited dataset with only 500 steps
	- May not generalize well to all text domains
	- Can produce biased or nonsensical outputs
	- Not suitable for production use without further training
	- Limited context window of {model.config.block_size} tokens

	## Usage

	### Requirements

	```bash
	pip install torch tiktoken huggingface_hub
	```

	### Loading the Model

	```python
	import torch
	from huggingface_hub import hf_hub_download

	# Download the model
	model_path = hf_hub_download(repo_id="{repo_id}", filename="model.pt")

	# Load the checkpoint
	checkpoint = torch.load(model_path, map_location='cpu')

	# Print model configuration
	print("Model Configuration:", checkpoint['config'])

	# To use the model, you'll need to define the GPT class
	# (See the model architecture code in the repository)
	from dataclasses import dataclass
	import torch.nn as nn
	from torch.nn import functional as F

	@dataclass
	class GPTConfig:
	block_size: int = 1024
	vocab_size: int = 50257
	n_layer: int = 12
	n_head: int = 12
	n_embd: int = 768

	# Recreate the model with the saved configuration
	config = GPTConfig(**checkpoint['config'])
	model = GPT(config) # You'll need the full GPT class definition
	model.load_state_dict(checkpoint['model_state_dict'])
	model.eval()

	print(f"Model loaded successfully with {{sum(p.numel() for p in model.parameters()):,}} parameters")
	```

	### Text Generation

	```python
	import tiktoken

	# Initialize tokenizer
	enc = tiktoken.get_encoding('gpt2')

	# Prepare input
	prompt = "Once upon a time"
	tokens = enc.encode(prompt)
	x = torch.tensor(tokens).unsqueeze(0) # Add batch dimension

	# Generate text
	model.eval()
	max_length = 50

	with torch.no_grad():
	while x.size(1) < max_length:
	logits, _ = model(x)
	logits = logits[:, -1, :] # Get last token logits
	probs = F.softmax(logits, dim=-1)

	# Top-k sampling
	topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)
	ix = torch.multinomial(topk_probs, 1)
	xcol = torch.gather(topk_indices, -1, ix)
	x = torch.cat((x, xcol), dim=1)

	# Decode and print
	generated_tokens = x[0].tolist()
	generated_text = enc.decode(generated_tokens)
	print(generated_text)
	```

	## Training Data

	The model was trained on custom text data using the GPT-2 tokenizer. Please refer to the training script for specific dataset details.

	## Evaluation

	This model checkpoint represents an early training stage (500 steps) and should be considered experimental. For production use, significantly more training is recommended.

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{{gpt2_tokeniser_2025,
	author = {{agileabhi}},
	title = {{GPT-2 Style Language Model}},
	year = {{2025}},
	publisher = {{Hugging Face}},
	howpublished = {{\\url{{https://huggingface.co/{repo_id}}}}}
	}}
	```

	## Model Card Authors

	- agileabhi

	## License

	MIT License - See LICENSE file for details