|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
tags: |
|
|
- gpt2 |
|
|
- question-answering |
|
|
- reinforcement-learning |
|
|
- ppo |
|
|
- squad |
|
|
- fine-tuned |
|
|
datasets: |
|
|
- rajpurkar/squad |
|
|
base_model: openai-community/gpt2 |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# GPT2 Fine-tuned with Reinforcement Learning for Question Answering |
|
|
|
|
|
This model is a GPT2 (`openai-community/gpt2`) fine-tuned using **Reinforcement Learning (PPO)** on the **SQuAD dataset** for question-answering tasks. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Base Model:** [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) |
|
|
- **Training Method:** Proximal Policy Optimization (PPO) |
|
|
- **Dataset:** [SQuAD (Stanford Question Answering Dataset)](https://huggingface.co/datasets/rajpurkar/squad) |
|
|
- **Task:** Question Answering with formatted responses |
|
|
- **Language:** English |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Reinforcement Learning Approach |
|
|
This model was trained using PPO (Proximal Policy Optimization) with shaped rewards to encourage a specific response format: |
|
|
|
|
|
**Response Format:** |
|
|
- Starts with: `"That is a great question! "` |
|
|
- Ends with: `" Let me know if you have any other questions."` |
|
|
|
|
|
### Reward Shaping |
|
|
| Reward | Condition | |
|
|
|--------|-----------| |
|
|
| +5 | Response starts with correct prefix | |
|
|
| +5 | Response ends with correct suffix | |
|
|
| +3 | Contains meaningful content | |
|
|
| +5 | Contains reference answer | |
|
|
| -3 | Missing prefix or suffix | |
|
|
|
|
|
### Training Configuration |
|
|
- **Epochs:** 3 |
|
|
- **Batch Size:** 1 |
|
|
- **Learning Rate:** 1e-5 |
|
|
- **Max Sequence Length:** 128 |
|
|
- **Training Samples:** 300 (from SQuAD) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using Transformers |
|
|
|
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("StevenHuo/StevenHuo-gpt2-squad-rl") |
|
|
model = AutoModelForCausalLM.from_pretrained("StevenHuo/StevenHuo-gpt2-squad-rl") |
|
|
|
|
|
# Prepare input |
|
|
question = "What is the capital of France?" |
|
|
context = "France is a country in Western Europe. Its capital is Paris, which is known for the Eiffel Tower." |
|
|
prompt = f"Question: {question}\nContext: {context}\nAnswer: That is a great question! " |
|
|
|
|
|
# Generate |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=100, |
|
|
temperature=0.8, |
|
|
do_sample=True, |
|
|
top_k=50, |
|
|
top_p=0.95, |
|
|
pad_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
# Decode |
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(response)### Example Output |
|
|
|
|
|
**Input:** |
|
|
|
|
|
Question: What is the capital of France? |
|
|
Context: France is a country in Western Europe. Its capital is Paris, which is known for the Eiffel Tower. |
|
|
|
|
|
**Output:** |
|
|
That is a great question! The capital of France is Paris. Let me know if you have any othertuning of language models |
|
|
- Question-answering tasks with formatted responses |
|
|
- Learning about PPO (Proximal Policy Optimization) for NLP |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Based on GPT2 (124M parameters), which has limited reasoning capabilities |
|
|
- Response format may not always be perfectly adhered to |
|
|
- Training was done on a subset of SQuAD (300 samples) |
|
|
- Best suited for simple factual questions |
|
|
|
|
|
|