--- language: en license: apache-2.0 tags: - text-generation - markdown - education - syllabus-generation - codet5 - fine-tuned - pedagogical-ai datasets: - custom metrics: - validity-rate - pedagogical-quality widget: - text: 'Generate course syllabus: {"title": "Introduction to Python", "domain": "computer_science", "level": "beginner", "duration": "semester"}' --- # CodeT5 Syllabus Generator for Educational Content Creation ## Model Description This is a fine-tuned [Salesforce/codet5-small](https://huggingface.co/Salesforce/codet5-small) model trained to generate structured markdown syllabi with component selection indices. The model takes course requirements as input and generates markdown-formatted syllabi with index-based references to pre-defined educational components. **Key Features:** - Generates well-structured markdown syllabi - Selects appropriate components using index notation [0], [1], [2] - Understands educational domain concepts (learning objectives, Bloom's taxonomy, difficulty progression) - Produces prerequisite-aware module sequences - Trained with pedagogical quality metrics ## Training Data - **Training Examples:** 1300 curated course-to-code pairs - **Epochs:** 20 - **Data Quality:** High-quality examples covering: - Multiple difficulty levels (beginner, intermediate, advanced) - Various domains (computer science, data science, business, arts) - Diverse course structures and pedagogical approaches - Bloom's taxonomy alignment - Assessment types and learning activities ## Training Configuration ```python Model: Salesforce/codet5-small (60M parameters) Tokenizer: RobertaTokenizer Batch Size: 16 Gradient Accumulation: 2 (effective batch size: 32) Learning Rate: 3e-4 Weight Decay: 0.01 Label Smoothing: 0.1 Max Input Length: 640 tokens Max Output Length: 536 tokens ``` ## Usage ### Quick Start ```python from transformers import RobertaTokenizer, T5ForConditionalGeneration import torch # Load model and tokenizer model_id = "dewyn/educraft-t5-function-call" tokenizer = RobertaTokenizer.from_pretrained(model_id) model = T5ForConditionalGeneration.from_pretrained(model_id) # Prepare input requirements = { "title": "Machine Learning Fundamentals", "domain": "computer_science", "level": "intermediate", "duration": "semester", "description": "Introduction to machine learning algorithms and applications", "learning_objectives": [ "Understand supervised learning algorithms", "Implement neural networks", "Evaluate model performance" ] } import json input_text = f"Generate course syllabus: {json.dumps(requirements)}" # Generate function calls input_ids = tokenizer( input_text, return_tensors="pt", max_length=640, truncation=True, padding=True ).input_ids with torch.no_grad(): output = model.generate( input_ids, max_length=536, num_beams=4, early_stopping=False, no_repeat_ngram_size=2, ) generated_code = tokenizer.decode(output[0], skip_special_tokens=True) print(generated_code) ``` ### Expected Output ```markdown # Machine Learning Fundamentals **Domain:** Computer Science **Level:** Intermediate **Duration:** Semester ## Course Description Introduction to machine learning algorithms and applications ## Learning Objectives - Understand supervised learning algorithms - Implement neural networks - Evaluate model performance ## Modules ### Module 1: Introduction to Machine Learning [0] **Duration:** 8 weeks ### Module 2: Supervised Learning Algorithms [1] **Duration:** 12 weeks **Prerequisites:** Module 1 ### Module 3: Neural Networks [2] **Duration:** 16 weeks **Prerequisites:** Module 2 ## Activities - Hands-on ML Exercise [0] - Apply level - Neural Network Workshop [1] - Create level ## Assessments - Final Project [0] - Project type ``` ### Integration with Parsing Pipeline ```python from scripts.markdown_syllabus_parser import MarkdownSyllabusParser # Parse generated markdown parser = MarkdownSyllabusParser( modules_file="data/components/modules.json", activities_file="data/components/activities.json", assessments_file="data/components/assessments.json" ) syllabus = parser.parse_markdown(generated_markdown) # Result is a complete syllabus dictionary with resolved components ``` ## Model Details **Base Model:** Salesforce/codet5-small - Pre-trained on 8.35M code functions (Python, Java, Go, JavaScript, Ruby, PHP) - 60M parameters - Encoder-decoder transformer architecture **Why CodeT5 vs T5:** - CodeT5 is pre-trained on **code**, not natural language - Understands programming syntax and patterns - Better at generating valid Python function calls - Less prone to hallucination or syntax errors ## Limitations - Optimized for educational content generation specifically - Requires structured input format (JSON with specific keys) - Generated code assumes SyllabusBuilder API availability - May need post-processing for edge cases or unusual course structures ## Citation If you use this model, please cite: ```bibtex @misc{codet5-syllabus-generator, author = {EduCraft MSc AI Capstone Project}, title = {CodeT5 Function Call Generator for Educational Syllabus Creation}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{dewyn/educraft-t5-function-call}} } ``` ## License Apache 2.0 (same as base CodeT5 model) ## Training Framework - PyTorch - Transformers (Hugging Face) - Trained on CPU (WSL2) with gradient checkpointing - Training time: ~2.5 hours (20 epochs) ## Contact For questions or issues, please open an issue on the [project repository](https://github.com/dewynl/msc-ai-capstone-project).