Upload trained CodeT5 model (20 epochs, 1300 examples)

84f7951 verified 3 months ago

5.71 kB

	---
	language: en
	license: apache-2.0
	tags:
	- text-generation
	- markdown
	- education
	- syllabus-generation
	- codet5
	- fine-tuned
	- pedagogical-ai
	datasets:
	- custom
	metrics:
	- validity-rate
	- pedagogical-quality
	widget:
	- text: 'Generate course syllabus: {"title": "Introduction to Python", "domain": "computer_science", "level": "beginner", "duration": "semester"}'
	---

	# CodeT5 Syllabus Generator for Educational Content Creation

	## Model Description

	This is a fine-tuned [Salesforce/codet5-small](https://huggingface.co/Salesforce/codet5-small) model trained to generate structured markdown syllabi with component selection indices. The model takes course requirements as input and generates markdown-formatted syllabi with index-based references to pre-defined educational components.

	Key Features:
	- Generates well-structured markdown syllabi
	- Selects appropriate components using index notation [0], [1], [2]
	- Understands educational domain concepts (learning objectives, Bloom's taxonomy, difficulty progression)
	- Produces prerequisite-aware module sequences
	- Trained with pedagogical quality metrics

	## Training Data

	- Training Examples: 1300 curated course-to-code pairs
	- Epochs: 20
	- Data Quality: High-quality examples covering:
	- Multiple difficulty levels (beginner, intermediate, advanced)
	- Various domains (computer science, data science, business, arts)
	- Diverse course structures and pedagogical approaches
	- Bloom's taxonomy alignment
	- Assessment types and learning activities

	## Training Configuration

	```python
	Model: Salesforce/codet5-small (60M parameters)
	Tokenizer: RobertaTokenizer
	Batch Size: 16
	Gradient Accumulation: 2 (effective batch size: 32)
	Learning Rate: 3e-4
	Weight Decay: 0.01
	Label Smoothing: 0.1
	Max Input Length: 640 tokens
	Max Output Length: 536 tokens
	```

	## Usage

	### Quick Start

	```python
	from transformers import RobertaTokenizer, T5ForConditionalGeneration
	import torch

	# Load model and tokenizer
	model_id = "dewyn/educraft-t5-function-call"
	tokenizer = RobertaTokenizer.from_pretrained(model_id)
	model = T5ForConditionalGeneration.from_pretrained(model_id)

	# Prepare input
	requirements = {
	"title": "Machine Learning Fundamentals",
	"domain": "computer_science",
	"level": "intermediate",
	"duration": "semester",
	"description": "Introduction to machine learning algorithms and applications",
	"learning_objectives": [
	"Understand supervised learning algorithms",
	"Implement neural networks",
	"Evaluate model performance"
	]
	}

	import json
	input_text = f"Generate course syllabus: {json.dumps(requirements)}"

	# Generate function calls
	input_ids = tokenizer(
	input_text,
	return_tensors="pt",
	max_length=640,
	truncation=True,
	padding=True
	).input_ids

	with torch.no_grad():
	output = model.generate(
	input_ids,
	max_length=536,
	num_beams=4,
	early_stopping=False,
	no_repeat_ngram_size=2,
	)

	generated_code = tokenizer.decode(output[0], skip_special_tokens=True)
	print(generated_code)
	```

	### Expected Output

	```markdown
	# Machine Learning Fundamentals

	Domain: Computer Science
	Level: Intermediate
	Duration: Semester

	## Course Description
	Introduction to machine learning algorithms and applications

	## Learning Objectives
	- Understand supervised learning algorithms
	- Implement neural networks
	- Evaluate model performance

	## Modules

	### Module 1: Introduction to Machine Learning [0]
	Duration: 8 weeks

	### Module 2: Supervised Learning Algorithms [1]
	Duration: 12 weeks
	Prerequisites: Module 1

	### Module 3: Neural Networks [2]
	Duration: 16 weeks
	Prerequisites: Module 2

	## Activities
	- Hands-on ML Exercise [0] - Apply level
	- Neural Network Workshop [1] - Create level

	## Assessments
	- Final Project [0] - Project type
	```

	### Integration with Parsing Pipeline

	```python
	from scripts.markdown_syllabus_parser import MarkdownSyllabusParser

	# Parse generated markdown
	parser = MarkdownSyllabusParser(
	modules_file="data/components/modules.json",
	activities_file="data/components/activities.json",
	assessments_file="data/components/assessments.json"
	)

	syllabus = parser.parse_markdown(generated_markdown)

	# Result is a complete syllabus dictionary with resolved components
	```

	## Model Details

	Base Model: Salesforce/codet5-small
	- Pre-trained on 8.35M code functions (Python, Java, Go, JavaScript, Ruby, PHP)
	- 60M parameters
	- Encoder-decoder transformer architecture

	Why CodeT5 vs T5:
	- CodeT5 is pre-trained on code, not natural language
	- Understands programming syntax and patterns
	- Better at generating valid Python function calls
	- Less prone to hallucination or syntax errors

	## Limitations

	- Optimized for educational content generation specifically
	- Requires structured input format (JSON with specific keys)
	- Generated code assumes SyllabusBuilder API availability
	- May need post-processing for edge cases or unusual course structures

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{codet5-syllabus-generator,
	author = {EduCraft MSc AI Capstone Project},
	title = {CodeT5 Function Call Generator for Educational Syllabus Creation},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{dewyn/educraft-t5-function-call}}
	}
	```

	## License

	Apache 2.0 (same as base CodeT5 model)

	## Training Framework

	- PyTorch
	- Transformers (Hugging Face)
	- Trained on CPU (WSL2) with gradient checkpointing
	- Training time: ~2.5 hours (20 epochs)

	## Contact

	For questions or issues, please open an issue on the [project repository](https://github.com/dewynl/msc-ai-capstone-project).