|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- text-generation |
|
|
- markdown |
|
|
- education |
|
|
- syllabus-generation |
|
|
- codet5 |
|
|
- fine-tuned |
|
|
- pedagogical-ai |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- validity-rate |
|
|
- pedagogical-quality |
|
|
widget: |
|
|
- text: 'Generate course syllabus: {"title": "Introduction to Python", "domain": "computer_science", "level": "beginner", "duration": "semester"}' |
|
|
--- |
|
|
|
|
|
# CodeT5 Syllabus Generator for Educational Content Creation |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This is a fine-tuned [Salesforce/codet5-small](https://huggingface.co/Salesforce/codet5-small) model trained to generate structured markdown syllabi with component selection indices. The model takes course requirements as input and generates markdown-formatted syllabi with index-based references to pre-defined educational components. |
|
|
|
|
|
**Key Features:** |
|
|
- Generates well-structured markdown syllabi |
|
|
- Selects appropriate components using index notation [0], [1], [2] |
|
|
- Understands educational domain concepts (learning objectives, Bloom's taxonomy, difficulty progression) |
|
|
- Produces prerequisite-aware module sequences |
|
|
- Trained with pedagogical quality metrics |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Training Examples:** 1300 curated course-to-code pairs |
|
|
- **Epochs:** 20 |
|
|
- **Data Quality:** High-quality examples covering: |
|
|
- Multiple difficulty levels (beginner, intermediate, advanced) |
|
|
- Various domains (computer science, data science, business, arts) |
|
|
- Diverse course structures and pedagogical approaches |
|
|
- Bloom's taxonomy alignment |
|
|
- Assessment types and learning activities |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
```python |
|
|
Model: Salesforce/codet5-small (60M parameters) |
|
|
Tokenizer: RobertaTokenizer |
|
|
Batch Size: 16 |
|
|
Gradient Accumulation: 2 (effective batch size: 32) |
|
|
Learning Rate: 3e-4 |
|
|
Weight Decay: 0.01 |
|
|
Label Smoothing: 0.1 |
|
|
Max Input Length: 640 tokens |
|
|
Max Output Length: 536 tokens |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```python |
|
|
from transformers import RobertaTokenizer, T5ForConditionalGeneration |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_id = "dewyn/educraft-t5-function-call" |
|
|
tokenizer = RobertaTokenizer.from_pretrained(model_id) |
|
|
model = T5ForConditionalGeneration.from_pretrained(model_id) |
|
|
|
|
|
# Prepare input |
|
|
requirements = { |
|
|
"title": "Machine Learning Fundamentals", |
|
|
"domain": "computer_science", |
|
|
"level": "intermediate", |
|
|
"duration": "semester", |
|
|
"description": "Introduction to machine learning algorithms and applications", |
|
|
"learning_objectives": [ |
|
|
"Understand supervised learning algorithms", |
|
|
"Implement neural networks", |
|
|
"Evaluate model performance" |
|
|
] |
|
|
} |
|
|
|
|
|
import json |
|
|
input_text = f"Generate course syllabus: {json.dumps(requirements)}" |
|
|
|
|
|
# Generate function calls |
|
|
input_ids = tokenizer( |
|
|
input_text, |
|
|
return_tensors="pt", |
|
|
max_length=640, |
|
|
truncation=True, |
|
|
padding=True |
|
|
).input_ids |
|
|
|
|
|
with torch.no_grad(): |
|
|
output = model.generate( |
|
|
input_ids, |
|
|
max_length=536, |
|
|
num_beams=4, |
|
|
early_stopping=False, |
|
|
no_repeat_ngram_size=2, |
|
|
) |
|
|
|
|
|
generated_code = tokenizer.decode(output[0], skip_special_tokens=True) |
|
|
print(generated_code) |
|
|
``` |
|
|
|
|
|
### Expected Output |
|
|
|
|
|
```markdown |
|
|
# Machine Learning Fundamentals |
|
|
|
|
|
**Domain:** Computer Science |
|
|
**Level:** Intermediate |
|
|
**Duration:** Semester |
|
|
|
|
|
## Course Description |
|
|
Introduction to machine learning algorithms and applications |
|
|
|
|
|
## Learning Objectives |
|
|
- Understand supervised learning algorithms |
|
|
- Implement neural networks |
|
|
- Evaluate model performance |
|
|
|
|
|
## Modules |
|
|
|
|
|
### Module 1: Introduction to Machine Learning [0] |
|
|
**Duration:** 8 weeks |
|
|
|
|
|
### Module 2: Supervised Learning Algorithms [1] |
|
|
**Duration:** 12 weeks |
|
|
**Prerequisites:** Module 1 |
|
|
|
|
|
### Module 3: Neural Networks [2] |
|
|
**Duration:** 16 weeks |
|
|
**Prerequisites:** Module 2 |
|
|
|
|
|
## Activities |
|
|
- Hands-on ML Exercise [0] - Apply level |
|
|
- Neural Network Workshop [1] - Create level |
|
|
|
|
|
## Assessments |
|
|
- Final Project [0] - Project type |
|
|
``` |
|
|
|
|
|
### Integration with Parsing Pipeline |
|
|
|
|
|
```python |
|
|
from scripts.markdown_syllabus_parser import MarkdownSyllabusParser |
|
|
|
|
|
# Parse generated markdown |
|
|
parser = MarkdownSyllabusParser( |
|
|
modules_file="data/components/modules.json", |
|
|
activities_file="data/components/activities.json", |
|
|
assessments_file="data/components/assessments.json" |
|
|
) |
|
|
|
|
|
syllabus = parser.parse_markdown(generated_markdown) |
|
|
|
|
|
# Result is a complete syllabus dictionary with resolved components |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
**Base Model:** Salesforce/codet5-small |
|
|
- Pre-trained on 8.35M code functions (Python, Java, Go, JavaScript, Ruby, PHP) |
|
|
- 60M parameters |
|
|
- Encoder-decoder transformer architecture |
|
|
|
|
|
**Why CodeT5 vs T5:** |
|
|
- CodeT5 is pre-trained on **code**, not natural language |
|
|
- Understands programming syntax and patterns |
|
|
- Better at generating valid Python function calls |
|
|
- Less prone to hallucination or syntax errors |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Optimized for educational content generation specifically |
|
|
- Requires structured input format (JSON with specific keys) |
|
|
- Generated code assumes SyllabusBuilder API availability |
|
|
- May need post-processing for edge cases or unusual course structures |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{codet5-syllabus-generator, |
|
|
author = {EduCraft MSc AI Capstone Project}, |
|
|
title = {CodeT5 Function Call Generator for Educational Syllabus Creation}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{dewyn/educraft-t5-function-call}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 (same as base CodeT5 model) |
|
|
|
|
|
## Training Framework |
|
|
|
|
|
- PyTorch |
|
|
- Transformers (Hugging Face) |
|
|
- Trained on CPU (WSL2) with gradient checkpointing |
|
|
- Training time: ~2.5 hours (20 epochs) |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please open an issue on the [project repository](https://github.com/dewynl/msc-ai-capstone-project). |
|
|
|