You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

GPT-2 XL Fine-Tuning on EU_ACT Dataset

Overview

This project fine-tunes the GPT-2 XL model (openai-community/gpt2-xl) using the EU_ACT.pdf dataset. The fine-tuned model is then uploaded to Hugging Face for easy access and deployment.

📌 Features

Uses Hugging Face Transformers for model training
Data preprocessing: Extracts text from PDF and cleans it
Tokenizer: GPT-2 XL tokenizer with padding
Fine-tuning on extracted dataset
Mixed Precision Training (fp16) for faster computation
Uploads Model to Hugging Face Hub

📁 Project Structure

.
├── EU_ACT.pdf                # Dataset (PDF format)
├── gpt2xl_finetune.py        # Fine-tuning script
├── gpt2-xl-euact1/           # Trained model output
├── README.md                 # Documentation

🔧 Installation

Make sure you have Python installed. Then, install the required libraries:

pip install transformers datasets torch huggingface_hub PyPDF2

🚀 Usage

Run the script to fine-tune GPT-2 XL:

python gpt2xl_finetune.py

This will:

Extract text from EU_ACT.pdf
Tokenize and preprocess the data
Fine-tune GPT-2 XL
Save and upload the model to Hugging Face Hub

📊 Model Training Pipeline

Load and Preprocess Data
- Extract text from PDF
- Clean the text (remove special characters, whitespace, etc.)
Tokenization
- Convert text to tokens using GPT-2 XL tokenizer
Fine-Tuning
- Train using Trainer API with TrainingArguments
Save & Upload
- Save model locally and upload it to Hugging Face Hub

🎯 Model Upload Link

Once training is complete, the model will be available at:

https://huggingface.co/sssdddwd/gpt2-xl-Transfer-euact1

📌 Example Code to Use the Fine-Tuned Model

from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("sssdddwd/gpt2-xl-Transfer-euact1")
model = GPT2LMHeadModel.from_pretrained("sssdddwd/gpt2-xl-Transfer-euact1")

text = "EU regulations state that"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)

print(tokenizer.decode(outputs[0]))

📌 Author: Shreyash Darade
✅ Last Updated: Feb 2025
🚀 Powered by Hugging Face & Transformers

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support