---
language: en
license: apache-2.0
tags:
- text-generation
- markdown
- education
- syllabus-generation
- codet5
- fine-tuned
- pedagogical-ai
datasets:
- custom
metrics:
- validity-rate
- pedagogical-quality
widget:
- text: 'Generate course syllabus: {"title": "Introduction to Python", "domain": "computer_science", "level": "beginner", "duration": "semester"}'
---

# CodeT5 Syllabus Generator for Educational Content Creation

## Model Description

This is a fine-tuned [Salesforce/codet5-small](https://huggingface.co/Salesforce/codet5-small) model trained to generate structured markdown syllabi with component selection indices. The model takes course requirements as input and generates markdown-formatted syllabi with index-based references to pre-defined educational components.

**Key Features:**
- Generates well-structured markdown syllabi
- Selects appropriate components using index notation [0], [1], [2]
- Understands educational domain concepts (learning objectives, Bloom's taxonomy, difficulty progression)
- Produces prerequisite-aware module sequences
- Trained with pedagogical quality metrics

## Training Data

- **Training Examples:** 1300 curated course-to-code pairs
- **Epochs:** 20
- **Data Quality:** High-quality examples covering:
  - Multiple difficulty levels (beginner, intermediate, advanced)
  - Various domains (computer science, data science, business, arts)
  - Diverse course structures and pedagogical approaches
  - Bloom's taxonomy alignment
  - Assessment types and learning activities

## Training Configuration

```python
Model: Salesforce/codet5-small (60M parameters)
Tokenizer: RobertaTokenizer
Batch Size: 16
Gradient Accumulation: 2 (effective batch size: 32)
Learning Rate: 3e-4
Weight Decay: 0.01
Label Smoothing: 0.1
Max Input Length: 640 tokens
Max Output Length: 536 tokens
```

## Usage

### Quick Start

```python
from transformers import RobertaTokenizer, T5ForConditionalGeneration
import torch

# Load model and tokenizer
model_id = "dewyn/educraft-t5-function-call"
tokenizer = RobertaTokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id)

# Prepare input
requirements = {
    "title": "Machine Learning Fundamentals",
    "domain": "computer_science",
    "level": "intermediate",
    "duration": "semester",
    "description": "Introduction to machine learning algorithms and applications",
    "learning_objectives": [
        "Understand supervised learning algorithms",
        "Implement neural networks",
        "Evaluate model performance"
    ]
}

import json
input_text = f"Generate course syllabus: {json.dumps(requirements)}"

# Generate function calls
input_ids = tokenizer(
    input_text,
    return_tensors="pt",
    max_length=640,
    truncation=True,
    padding=True
).input_ids

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_length=536,
        num_beams=4,
        early_stopping=False,
        no_repeat_ngram_size=2,
    )

generated_code = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_code)
```

### Expected Output

```markdown
# Machine Learning Fundamentals

**Domain:** Computer Science
**Level:** Intermediate
**Duration:** Semester

## Course Description
Introduction to machine learning algorithms and applications

## Learning Objectives
- Understand supervised learning algorithms
- Implement neural networks
- Evaluate model performance

## Modules

### Module 1: Introduction to Machine Learning [0]
**Duration:** 8 weeks

### Module 2: Supervised Learning Algorithms [1]
**Duration:** 12 weeks
**Prerequisites:** Module 1

### Module 3: Neural Networks [2]
**Duration:** 16 weeks
**Prerequisites:** Module 2

## Activities
- Hands-on ML Exercise [0] - Apply level
- Neural Network Workshop [1] - Create level

## Assessments
- Final Project [0] - Project type
```

### Integration with Parsing Pipeline

```python
from scripts.markdown_syllabus_parser import MarkdownSyllabusParser

# Parse generated markdown
parser = MarkdownSyllabusParser(
    modules_file="data/components/modules.json",
    activities_file="data/components/activities.json",
    assessments_file="data/components/assessments.json"
)

syllabus = parser.parse_markdown(generated_markdown)

# Result is a complete syllabus dictionary with resolved components
```

## Model Details

**Base Model:** Salesforce/codet5-small
- Pre-trained on 8.35M code functions (Python, Java, Go, JavaScript, Ruby, PHP)
- 60M parameters
- Encoder-decoder transformer architecture

**Why CodeT5 vs T5:**
- CodeT5 is pre-trained on **code**, not natural language
- Understands programming syntax and patterns
- Better at generating valid Python function calls
- Less prone to hallucination or syntax errors

## Limitations

- Optimized for educational content generation specifically
- Requires structured input format (JSON with specific keys)
- Generated code assumes SyllabusBuilder API availability
- May need post-processing for edge cases or unusual course structures

## Citation

If you use this model, please cite:

```bibtex
@misc{codet5-syllabus-generator,
  author = {EduCraft MSc AI Capstone Project},
  title = {CodeT5 Function Call Generator for Educational Syllabus Creation},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{dewyn/educraft-t5-function-call}}
}
```

## License

Apache 2.0 (same as base CodeT5 model)

## Training Framework

- PyTorch
- Transformers (Hugging Face)
- Trained on CPU (WSL2) with gradient checkpointing
- Training time: ~2.5 hours (20 epochs)

## Contact

For questions or issues, please open an issue on the [project repository](https://github.com/dewynl/msc-ai-capstone-project).