File size: 5,707 Bytes
84f7951 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
---
language: en
license: apache-2.0
tags:
- text-generation
- markdown
- education
- syllabus-generation
- codet5
- fine-tuned
- pedagogical-ai
datasets:
- custom
metrics:
- validity-rate
- pedagogical-quality
widget:
- text: 'Generate course syllabus: {"title": "Introduction to Python", "domain": "computer_science", "level": "beginner", "duration": "semester"}'
---
# CodeT5 Syllabus Generator for Educational Content Creation
## Model Description
This is a fine-tuned [Salesforce/codet5-small](https://huggingface.co/Salesforce/codet5-small) model trained to generate structured markdown syllabi with component selection indices. The model takes course requirements as input and generates markdown-formatted syllabi with index-based references to pre-defined educational components.
**Key Features:**
- Generates well-structured markdown syllabi
- Selects appropriate components using index notation [0], [1], [2]
- Understands educational domain concepts (learning objectives, Bloom's taxonomy, difficulty progression)
- Produces prerequisite-aware module sequences
- Trained with pedagogical quality metrics
## Training Data
- **Training Examples:** 1300 curated course-to-code pairs
- **Epochs:** 20
- **Data Quality:** High-quality examples covering:
- Multiple difficulty levels (beginner, intermediate, advanced)
- Various domains (computer science, data science, business, arts)
- Diverse course structures and pedagogical approaches
- Bloom's taxonomy alignment
- Assessment types and learning activities
## Training Configuration
```python
Model: Salesforce/codet5-small (60M parameters)
Tokenizer: RobertaTokenizer
Batch Size: 16
Gradient Accumulation: 2 (effective batch size: 32)
Learning Rate: 3e-4
Weight Decay: 0.01
Label Smoothing: 0.1
Max Input Length: 640 tokens
Max Output Length: 536 tokens
```
## Usage
### Quick Start
```python
from transformers import RobertaTokenizer, T5ForConditionalGeneration
import torch
# Load model and tokenizer
model_id = "dewyn/educraft-t5-function-call"
tokenizer = RobertaTokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id)
# Prepare input
requirements = {
"title": "Machine Learning Fundamentals",
"domain": "computer_science",
"level": "intermediate",
"duration": "semester",
"description": "Introduction to machine learning algorithms and applications",
"learning_objectives": [
"Understand supervised learning algorithms",
"Implement neural networks",
"Evaluate model performance"
]
}
import json
input_text = f"Generate course syllabus: {json.dumps(requirements)}"
# Generate function calls
input_ids = tokenizer(
input_text,
return_tensors="pt",
max_length=640,
truncation=True,
padding=True
).input_ids
with torch.no_grad():
output = model.generate(
input_ids,
max_length=536,
num_beams=4,
early_stopping=False,
no_repeat_ngram_size=2,
)
generated_code = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_code)
```
### Expected Output
```markdown
# Machine Learning Fundamentals
**Domain:** Computer Science
**Level:** Intermediate
**Duration:** Semester
## Course Description
Introduction to machine learning algorithms and applications
## Learning Objectives
- Understand supervised learning algorithms
- Implement neural networks
- Evaluate model performance
## Modules
### Module 1: Introduction to Machine Learning [0]
**Duration:** 8 weeks
### Module 2: Supervised Learning Algorithms [1]
**Duration:** 12 weeks
**Prerequisites:** Module 1
### Module 3: Neural Networks [2]
**Duration:** 16 weeks
**Prerequisites:** Module 2
## Activities
- Hands-on ML Exercise [0] - Apply level
- Neural Network Workshop [1] - Create level
## Assessments
- Final Project [0] - Project type
```
### Integration with Parsing Pipeline
```python
from scripts.markdown_syllabus_parser import MarkdownSyllabusParser
# Parse generated markdown
parser = MarkdownSyllabusParser(
modules_file="data/components/modules.json",
activities_file="data/components/activities.json",
assessments_file="data/components/assessments.json"
)
syllabus = parser.parse_markdown(generated_markdown)
# Result is a complete syllabus dictionary with resolved components
```
## Model Details
**Base Model:** Salesforce/codet5-small
- Pre-trained on 8.35M code functions (Python, Java, Go, JavaScript, Ruby, PHP)
- 60M parameters
- Encoder-decoder transformer architecture
**Why CodeT5 vs T5:**
- CodeT5 is pre-trained on **code**, not natural language
- Understands programming syntax and patterns
- Better at generating valid Python function calls
- Less prone to hallucination or syntax errors
## Limitations
- Optimized for educational content generation specifically
- Requires structured input format (JSON with specific keys)
- Generated code assumes SyllabusBuilder API availability
- May need post-processing for edge cases or unusual course structures
## Citation
If you use this model, please cite:
```bibtex
@misc{codet5-syllabus-generator,
author = {EduCraft MSc AI Capstone Project},
title = {CodeT5 Function Call Generator for Educational Syllabus Creation},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{dewyn/educraft-t5-function-call}}
}
```
## License
Apache 2.0 (same as base CodeT5 model)
## Training Framework
- PyTorch
- Transformers (Hugging Face)
- Trained on CPU (WSL2) with gradient checkpointing
- Training time: ~2.5 hours (20 epochs)
## Contact
For questions or issues, please open an issue on the [project repository](https://github.com/dewynl/msc-ai-capstone-project).
|