YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

🧠 T5-Small · Automatic Question Generator

Answer-aware question generation fine-tuned on SQuAD 1.1

Model Dataset License Demo


πŸ“Œ Overview

This is a fully fine-tuned T5-small (60M parameters) model for automatic question generation (AQG) from educational text passages. Given a passage and a highlighted answer span, the model generates a grammatically correct, context-aware question.

This model was developed as part of an MPhil Data Science semester research project at Punjab University, Faisalabad, Pakistan, focusing on NLP applications in educational technology.


πŸ—οΈ Model Architecture

Base Model      β†’  google-t5/t5-small  (60M parameters)
Task            β†’  Text-to-Text Generation (Seq2Seq)
Input format    β†’  "generate question: {passage with <hl> answer <hl>}"
Output format   β†’  "{generated question}"
Max input len   β†’  512 tokens
Max output len  β†’  64 tokens
Decoding        β†’  Beam search (num_beams=4)

πŸ“Š Training Details

Hyperparameter Value
Base model google-t5/t5-small
Dataset SQuAD 1.1
Training samples 20,000
Validation samples 2,000
Epochs 3
Batch size 4
Gradient accumulation steps 2 (effective batch = 8)
Learning rate 3e-4
Weight decay 0.01
Warmup steps 500
Optimizer AdamW
Precision FP16 (mixed precision)
Hardware Google Colab T4 GPU (16GB)
Framework HuggingFace Transformers + PyTorch
Training time ~75 minutes

πŸ“ˆ Evaluation Results

Evaluated on 200 samples from the SQuAD 1.1 validation split:

Metric Score Description
ROUGE-1 0.47 Unigram overlap with reference questions
ROUGE-2 0.22 Bigram overlap β€” phrase-level similarity
ROUGE-L 0.44 Longest common subsequence β€” fluency
BLEU 0.16 N-gram precision score

πŸš€ Quick Start

Installation

pip install transformers torch sentencepiece

Basic Usage

from transformers import T5ForConditionalGeneration, AutoTokenizer

model_id  = "Hamzasajjad38/t5-small-qg"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model     = T5ForConditionalGeneration.from_pretrained(model_id)

def generate_question(passage: str, answer: str) -> str:
    """Generate a question given a passage and answer span."""
    highlighted = passage.replace(answer, f"<hl> {answer} <hl>", 1)
    input_text  = f"generate question: {highlighted}"
    inputs  = tokenizer(
        input_text,
        return_tensors="pt",
        max_length=512,
        truncation=True
    )
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        num_beams=4,
        early_stopping=True
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Example

passage = """
Photosynthesis is a process used by plants to convert light energy 
into chemical energy stored in glucose. It occurs mainly in the 
chloroplasts using chlorophyll pigment to absorb sunlight.
"""

examples = [
    ("plants",       "What organism uses photosynthesis?"),
    ("chloroplasts", "Where does photosynthesis take place?"),
    ("chlorophyll",  "What pigment absorbs sunlight in photosynthesis?"),
]

for answer, expected in examples:
    generated = generate_question(passage.strip(), answer)
    print(f"Answer   : {answer}")
    print(f"Generated: {generated}")
    print(f"Expected : {expected}")
    print("─" * 50)

Batch Generation with spaCy

import spacy
from transformers import T5ForConditionalGeneration, AutoTokenizer

nlp       = spacy.load("en_core_web_sm")
model_id  = "Hamzasajjad38/t5-small-qg"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model     = T5ForConditionalGeneration.from_pretrained(model_id)

def generate_all_questions(passage: str, num_questions: int = 3):
    doc        = nlp(passage)
    candidates = [e.text for e in doc.ents]
    candidates += [c.text for c in doc.noun_chunks]

    # Deduplicate
    seen, unique = set(), []
    for c in candidates:
        if c.lower() not in seen and 1 < len(c.split()) <= 5:
            seen.add(c.lower())
            unique.append(c)

    results = []
    for answer in unique[:num_questions]:
        q = generate_question(passage, answer)
        results.append({"question": q, "answer": answer})

    return results

passage = "The Amazon River is the largest river by discharge in the world, \
flowing through Brazil, Peru, and Colombia."

for item in generate_all_questions(passage, num_questions=3):
    print(f"Q: {item['question']}")
    print(f"A: {item['answer']}\n")

πŸ“ Repository Structure

Hamzasajjad38/t5-small-qg/
β”œβ”€β”€ config.json              # Model architecture config
β”œβ”€β”€ generation_config.json   # Generation hyperparameters
β”œβ”€β”€ model.safetensors        # Fine-tuned model weights (242 MB)
β”œβ”€β”€ tokenizer.json           # Fast tokenizer vocab
β”œβ”€β”€ tokenizer_config.json    # Tokenizer configuration
β”œβ”€β”€ spiece.model             # SentencePiece vocabulary
└── training_args.bin        # Saved training arguments

🎯 Intended Use

βœ… Suitable for:

  • Generating comprehension questions from textbook passages
  • Educational assessment and quiz creation
  • E-learning platforms requiring automated question banks
  • Research in NLP and educational technology
  • Prototype development for EdTech applications

❌ Not suitable for:

  • Questions requiring external knowledge beyond the passage
  • Non-English text (use mT5 for multilingual support)
  • High-stakes exam generation without human review
  • Very short passages (< 30 words)

⚠️ Limitations

  • Language: English only β€” trained exclusively on English Wikipedia text
  • Domain: Best performance on factual, encyclopedic-style passages similar to Wikipedia
  • Size: 60M parameter T5-small limits generation quality vs larger variants (T5-base, T5-large)
  • Answer extraction: Model requires answer spans to be provided β€” it does not extract answers automatically
  • Repetition: May generate similar questions for passages with repetitive content
  • Hallucination: Occasionally generates questions not fully grounded in the passage

πŸ”¬ Research Context

This model is part of ongoing research into NLP-powered educational tools. Potential future extensions include:

  • 🌐 Urdu language support using mT5 fine-tuning for Pakistani educational content
  • πŸ“š Bloom's Taxonomy alignment β€” categorizing questions by cognitive level
  • 🎯 MCQ distractor generation using WordNet and sense2vec
  • πŸ“Š Human evaluation beyond automated ROUGE/BLEU metrics

πŸ–₯️ Live Demo

Try the model without any code on Hugging Face Spaces:

πŸ‘‰ https://huggingface.co/spaces/Hamzasajjad38/Automatic-Question-Generator

Features:

  • Paste any passage and generate questions instantly
  • Adjustable number of questions (1–5)
  • Short answer and True/False question types
  • Download generated questions as .txt

πŸ“œ Citation

If you use this model in your research or project, please cite:

@misc{sajjad2025t5qg,
  author       = {Muhammad Hamza Sajjad},
  title        = {T5-Small Fine-tuned for Automatic Question Generation on SQuAD 1.1},
  year         = {2025},
  publisher    = {Hugging Face},
  journal      = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/Hamzasajjad38/t5-small-qg}}
}

πŸ‘€ Author

Muhammad Hamza Sajjad

  • πŸŽ“ MPhil Data Science β€” Punjab University, Lahore
  • πŸ”¬ Research interests: NLP, Computer Vision, Educational AI
  • πŸ€— Hugging Face

Fine-tuned with ❀️ using HuggingFace Transformers · PyTorch · Google Colab
Downloads last month
123
Safetensors
Model size
60.5M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Hamzasajjad38/t5-small-qg

Space using Hamzasajjad38/t5-small-qg 1