YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
GPT-2 XL Fine-Tuning on EU_ACT Dataset
Overview
This project fine-tunes the GPT-2 XL model (openai-community/gpt2-xl) using the EU_ACT.pdf dataset. The fine-tuned model is then uploaded to Hugging Face for easy access and deployment.
π Features
- Uses Hugging Face Transformers for model training
- Data preprocessing: Extracts text from PDF and cleans it
- Tokenizer: GPT-2 XL tokenizer with padding
- Fine-tuning on extracted dataset
- Mixed Precision Training (fp16) for faster computation
- Uploads Model to Hugging Face Hub
π Project Structure
.
βββ EU_ACT.pdf # Dataset (PDF format)
βββ gpt2xl_finetune.py # Fine-tuning script
βββ gpt2-xl-euact1/ # Trained model output
βββ README.md # Documentation
π§ Installation
Make sure you have Python installed. Then, install the required libraries:
pip install transformers datasets torch huggingface_hub PyPDF2
π Usage
Run the script to fine-tune GPT-2 XL:
python gpt2xl_finetune.py
This will:
- Extract text from
EU_ACT.pdf - Tokenize and preprocess the data
- Fine-tune GPT-2 XL
- Save and upload the model to Hugging Face Hub
π Model Training Pipeline
- Load and Preprocess Data
- Extract text from PDF
- Clean the text (remove special characters, whitespace, etc.)
- Tokenization
- Convert text to tokens using GPT-2 XL tokenizer
- Fine-Tuning
- Train using
TrainerAPI withTrainingArguments
- Train using
- Save & Upload
- Save model locally and upload it to Hugging Face Hub
π― Model Upload Link
Once training is complete, the model will be available at:
https://huggingface.co/sssdddwd/gpt2-xl-Transfer-euact1
π Example Code to Use the Fine-Tuned Model
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("sssdddwd/gpt2-xl-Transfer-euact1")
model = GPT2LMHeadModel.from_pretrained("sssdddwd/gpt2-xl-Transfer-euact1")
text = "EU regulations state that"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
π Author: Shreyash Darade
β
Last Updated: Feb 2025
π Powered by Hugging Face & Transformers
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support