File size: 5,707 Bytes
84f7951
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
language: en
license: apache-2.0
tags:
- text-generation
- markdown
- education
- syllabus-generation
- codet5
- fine-tuned
- pedagogical-ai
datasets:
- custom
metrics:
- validity-rate
- pedagogical-quality
widget:
- text: 'Generate course syllabus: {"title": "Introduction to Python", "domain": "computer_science", "level": "beginner", "duration": "semester"}'
---

# CodeT5 Syllabus Generator for Educational Content Creation

## Model Description

This is a fine-tuned [Salesforce/codet5-small](https://huggingface.co/Salesforce/codet5-small) model trained to generate structured markdown syllabi with component selection indices. The model takes course requirements as input and generates markdown-formatted syllabi with index-based references to pre-defined educational components.

**Key Features:**
- Generates well-structured markdown syllabi
- Selects appropriate components using index notation [0], [1], [2]
- Understands educational domain concepts (learning objectives, Bloom's taxonomy, difficulty progression)
- Produces prerequisite-aware module sequences
- Trained with pedagogical quality metrics

## Training Data

- **Training Examples:** 1300 curated course-to-code pairs
- **Epochs:** 20
- **Data Quality:** High-quality examples covering:
  - Multiple difficulty levels (beginner, intermediate, advanced)
  - Various domains (computer science, data science, business, arts)
  - Diverse course structures and pedagogical approaches
  - Bloom's taxonomy alignment
  - Assessment types and learning activities

## Training Configuration

```python
Model: Salesforce/codet5-small (60M parameters)
Tokenizer: RobertaTokenizer
Batch Size: 16
Gradient Accumulation: 2 (effective batch size: 32)
Learning Rate: 3e-4
Weight Decay: 0.01
Label Smoothing: 0.1
Max Input Length: 640 tokens
Max Output Length: 536 tokens
```

## Usage

### Quick Start

```python
from transformers import RobertaTokenizer, T5ForConditionalGeneration
import torch

# Load model and tokenizer
model_id = "dewyn/educraft-t5-function-call"
tokenizer = RobertaTokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id)

# Prepare input
requirements = {
    "title": "Machine Learning Fundamentals",
    "domain": "computer_science",
    "level": "intermediate",
    "duration": "semester",
    "description": "Introduction to machine learning algorithms and applications",
    "learning_objectives": [
        "Understand supervised learning algorithms",
        "Implement neural networks",
        "Evaluate model performance"
    ]
}

import json
input_text = f"Generate course syllabus: {json.dumps(requirements)}"

# Generate function calls
input_ids = tokenizer(
    input_text,
    return_tensors="pt",
    max_length=640,
    truncation=True,
    padding=True
).input_ids

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_length=536,
        num_beams=4,
        early_stopping=False,
        no_repeat_ngram_size=2,
    )

generated_code = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_code)
```

### Expected Output

```markdown
# Machine Learning Fundamentals

**Domain:** Computer Science
**Level:** Intermediate
**Duration:** Semester

## Course Description
Introduction to machine learning algorithms and applications

## Learning Objectives
- Understand supervised learning algorithms
- Implement neural networks
- Evaluate model performance

## Modules

### Module 1: Introduction to Machine Learning [0]
**Duration:** 8 weeks

### Module 2: Supervised Learning Algorithms [1]
**Duration:** 12 weeks
**Prerequisites:** Module 1

### Module 3: Neural Networks [2]
**Duration:** 16 weeks
**Prerequisites:** Module 2

## Activities
- Hands-on ML Exercise [0] - Apply level
- Neural Network Workshop [1] - Create level

## Assessments
- Final Project [0] - Project type
```

### Integration with Parsing Pipeline

```python
from scripts.markdown_syllabus_parser import MarkdownSyllabusParser

# Parse generated markdown
parser = MarkdownSyllabusParser(
    modules_file="data/components/modules.json",
    activities_file="data/components/activities.json",
    assessments_file="data/components/assessments.json"
)

syllabus = parser.parse_markdown(generated_markdown)

# Result is a complete syllabus dictionary with resolved components
```

## Model Details

**Base Model:** Salesforce/codet5-small
- Pre-trained on 8.35M code functions (Python, Java, Go, JavaScript, Ruby, PHP)
- 60M parameters
- Encoder-decoder transformer architecture

**Why CodeT5 vs T5:**
- CodeT5 is pre-trained on **code**, not natural language
- Understands programming syntax and patterns
- Better at generating valid Python function calls
- Less prone to hallucination or syntax errors

## Limitations

- Optimized for educational content generation specifically
- Requires structured input format (JSON with specific keys)
- Generated code assumes SyllabusBuilder API availability
- May need post-processing for edge cases or unusual course structures

## Citation

If you use this model, please cite:

```bibtex
@misc{codet5-syllabus-generator,
  author = {EduCraft MSc AI Capstone Project},
  title = {CodeT5 Function Call Generator for Educational Syllabus Creation},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{dewyn/educraft-t5-function-call}}
}
```

## License

Apache 2.0 (same as base CodeT5 model)

## Training Framework

- PyTorch
- Transformers (Hugging Face)
- Trained on CPU (WSL2) with gradient checkpointing
- Training time: ~2.5 hours (20 epochs)

## Contact

For questions or issues, please open an issue on the [project repository](https://github.com/dewynl/msc-ai-capstone-project).