Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.10.0
AI Course Assessment Generator - Functionality Report
Table of Contents
- Overview
- System Architecture
- Data Models
- Application Entry Point
- User Interface Structure
- Complete Workflow
- Detailed Component Functionality
- Quality Standards and Prompts
Overview
The AI Course Assessment Generator is a sophisticated educational tool that automates the creation of learning objectives and multiple-choice questions from course materials. It leverages OpenAI's language models with structured output generation to produce high-quality educational assessments that adhere to specified quality standards and Bloom's Taxonomy levels.
Key Capabilities
- Multi-format Content Processing: Accepts
.vtt,.srt(subtitle files), and.ipynb(Jupyter notebooks) - AI-Powered Generation: Uses OpenAI's GPT models with configurable parameters
- Quality Assurance: Implements LLM-based quality assessment and ranking
- Source Tracking: Maintains XML-tagged references from source materials to generated content
- Iterative Improvement: Supports feedback-based regeneration and enhancement
- Parallel Processing: Generates questions concurrently for improved performance
System Architecture
Architectural Patterns
1. Orchestrator Pattern
Both LearningObjectiveGenerator and QuizGenerator act as orchestrators that coordinate calls to specialized generation functions rather than implementing generation logic directly.
2. Modular Prompt System
The prompts/ directory contains reusable prompt components that are imported and combined in generation modules, allowing for consistent quality standards across different generation tasks.
3. Structured Output Generation
All LLM interactions use Pydantic models with the instructor library to ensure consistent, validated output formats using OpenAI's structured output API.
4. Source Tracking via XML Tags
Content is wrapped in XML tags (e.g., <source file="example.ipynb">content</source>) throughout the pipeline to maintain traceability from source files to generated questions.
Technology Stack
- Python 3.8+
- Gradio 5.29.0+: Web-based UI framework
- Pydantic 2.8.0+: Data validation and schema management
- OpenAI 1.52.0+: LLM API integration
- Instructor 1.7.9+: Structured output generation
- nbformat 5.9.2: Jupyter notebook parsing
- python-dotenv 1.0.0: Environment variable management
Data Models
Learning Objectives Progression
The system uses a hierarchical progression of learning objective models:
1. BaseLearningObjectiveWithoutCorrectAnswer
- id: int
- learning_objective: str
- source_reference: Union[List[str], str]
Initial generation without correct answers.
2. BaseLearningObjective
- id: int
- learning_objective: str
- source_reference: Union[List[str], str]
- correct_answer: str
Base objectives with correct answers added.
3. LearningObjective
- id: int
- learning_objective: str
- source_reference: Union[List[str], str]
- correct_answer: str
- incorrect_answer_options: Union[List[str], str]
- in_group: Optional[bool]
- group_members: Optional[List[int]]
- best_in_group: Optional[bool]
Enhanced with incorrect answer suggestions and grouping metadata.
4. GroupedLearningObjective
(All fields from LearningObjective)
- in_group: bool (required)
- group_members: List[int] (required)
- best_in_group: bool (required)
Fully grouped and ranked objectives.
Question Models Progression
1. MultipleChoiceOption
- option_text: str
- is_correct: bool
- feedback: str
2. MultipleChoiceQuestion
- id: int
- question_text: str
- options: List[MultipleChoiceOption]
- learning_objective_id: int
- learning_objective: str
- correct_answer: str
- source_reference: Union[List[str], str]
- judge_feedback: Optional[str]
- approved: Optional[bool]
3. RankedMultipleChoiceQuestion
(All fields from MultipleChoiceQuestion)
- rank: int
- ranking_reasoning: str
- in_group: bool
- group_members: List[int]
- best_in_group: bool
4. Assessment
- learning_objectives: List[LearningObjective]
- questions: List[RankedMultipleChoiceQuestion]
Final output containing both objectives and questions.
Configuration Models
MODELS
Available OpenAI models: ["o3-mini", "o1", "gpt-4.1", "gpt-4o", "gpt-4o-mini", "gpt-4", "gpt-3.5-turbo", "gpt-5", "gpt-5-mini", "gpt-5-nano"]
TEMPERATURE_UNAVAILABLE
Dictionary mapping models to temperature availability (some models like o1, o3-mini, and gpt-5 variants don't support temperature settings).
Application Entry Point
app.py
The root-level entry point that:
- Loads environment variables from
.envfile - Checks for
OPENAI_API_KEYpresence - Creates the Gradio UI via
ui.app.create_ui() - Launches the web interface at
http://127.0.0.1:7860
# Workflow:
load_dotenv() β Check API key β create_ui() β app.launch()
User Interface Structure
ui/app.py - Gradio Interface
The UI is organized into 3 main tabs:
Tab 1: Generate Learning Objectives
Input Components:
- File uploader (accepts
.ipynb,.vtt,.srt) - Number of objectives per run (slider: 1-20, default: 3)
- Number of generation runs (dropdown: 1-5, default: 3)
- Model selection (dropdown, default: "gpt-5")
- Incorrect answer model selection (dropdown, default: "gpt-5")
- Temperature setting (dropdown: 0.0-1.0, default: 1.0)
- Generate button
- Feedback input textbox
- Regenerate button
Output Components:
- Status textbox
- Best-in-Group Learning Objectives (JSON)
- All Grouped Learning Objectives (JSON)
- Raw Ungrouped Learning Objectives (JSON) - for debugging
Event Handler: process_files() from objective_handlers.py
Tab 2: Generate Questions
Input Components:
- Learning Objectives JSON (auto-populated from Tab 1)
- Model selection
- Temperature setting
- Number of question generation runs (slider: 1-5, default: 1)
- Generate Questions button
Output Components:
- Status textbox
- Ranked Best-in-Group Questions (JSON)
- All Grouped Questions (JSON)
- Formatted Quiz (human-readable format)
Event Handler: generate_questions() from question_handlers.py
Tab 3: Propose/Edit Question
Input Components:
- Question guidance/feedback textbox
- Model selection
- Temperature setting
- Generate Question button
Output Components:
- Status textbox
- Generated Question (JSON)
Event Handler: propose_question_handler() from feedback_handlers.py
Complete Workflow
Phase 1: File Upload and Content Processing
Step 1.1: File Upload
User uploads one or more files (.vtt, .srt, .ipynb) through the Gradio interface.
Step 1.2: File Path Extraction (objective_handlers._extract_file_paths())
# Handles different input formats:
- List of file paths
- Single file path string
- File objects with .name attribute
Step 1.3: Content Processing (ui/content_processor.py)
For Subtitle Files (.vtt, .srt):
1. Read file with UTF-8 encoding
2. Split into lines
3. Filter out:
- Empty lines
- Numeric timestamp indicators
- Lines containing '-->' (timestamps)
- 'WEBVTT' header lines
4. Combine remaining text lines
5. Wrap in XML tags: <source file='filename.vtt'>content</source>
For Jupyter Notebooks (.ipynb):
1. Validate JSON format
2. Parse with nbformat.read()
3. Extract from cells:
- Markdown cells: [Markdown]\n{content}
- Code cells: [Code]\n```python\n{content}\n```
4. Combine all cell content
5. Wrap in XML tags: <source file='filename.ipynb'>content</source>
Error Handling:
- Invalid JSON: Wraps raw content in code blocks
- Parsing failures: Falls back to plain text extraction
- All errors logged to console
Step 1.4: State Storage
Processed content stored in global state (ui/state.py):
processed_file_contents = [tagged_content_1, tagged_content_2, ...]
Phase 2: Learning Objective Generation
Step 2.1: Multi-Run Base Generation
Process: objective_handlers._generate_multiple_runs()
For each run (user-specified, typically 3 runs):
Call:
QuizGenerator.generate_base_learning_objectives()Workflow:
generate_base_learning_objectives() β generate_base_learning_objectives_without_correct_answers() β Creates prompt with: - BASE_LEARNING_OBJECTIVES_PROMPT - BLOOMS_TAXONOMY_LEVELS - LEARNING_OBJECTIVE_EXAMPLES_WITHOUT_ANSWERS - Combined file contents β Calls OpenAI API with structured output β Returns List[BaseLearningObjectiveWithoutCorrectAnswer] β generate_correct_answers_for_objectives() β For each objective: - Creates prompt with objective and course content - Calls OpenAI API (unstructured text response) - Extracts correct answer β Returns List[BaseLearningObjective]ID Assignment:
# Temporary IDs by run: Run 1: 1001, 1002, 1003 Run 2: 2001, 2002, 2003 Run 3: 3001, 3002, 3003Aggregation: All objectives from all runs combined into single list.
Example: 3 runs Γ 3 objectives = 9 total base objectives
Step 2.2: Grouping and Ranking
Process: objective_handlers._group_base_objectives_add_incorrect_answers()
Step 2.2.1: Group Base Objectives
QuizGenerator.group_base_learning_objectives()
β
learning_objective_generator/grouping_and_ranking.py
β group_base_learning_objectives()
Grouping Logic:
Creates prompt containing:
- Original generation criteria
- All base objectives with IDs
- Course content for context
- Grouping instructions
Special Rule: All objectives with IDs ending in 1 (1001, 2001, 3001) are grouped together and ONE is marked as best-in-group (this becomes the primary/first objective)
LLM Call:
- Model:
gpt-5-mini - Response format:
GroupedBaseLearningObjectivesResponse - Returns: Grouped objectives with metadata
- Model:
Output Structure:
{ "all_grouped": [all objectives with group metadata], "best_in_group": [objectives marked as best in their groups] }
Step 2.2.2: ID Reassignment (_reassign_objective_ids())
1. Find best objective from the 001 group
2. Assign it ID = 1
3. Assign remaining objectives IDs starting from 2
Step 2.2.3: Generate Incorrect Answer Options
Only for best-in-group objectives:
QuizGenerator.generate_lo_incorrect_answer_options()
β
learning_objective_generator/enhancement.py
β generate_incorrect_answer_options()
Process:
For each best-in-group objective:
- Creates prompt with:
- Objective and correct answer
- INCORRECT_ANSWER_PROMPT guidelines
- INCORRECT_ANSWER_EXAMPLES
- Course content
- Calls OpenAI API (with optional model override)
- Generates 5 plausible incorrect answer options
- Creates prompt with:
Returns:
List[LearningObjective]with incorrect_answer_options populated
Step 2.2.4: Improve Incorrect Answers
learning_objective_generator.regenerate_incorrect_answers()
β
learning_objective_generator/suggestion_improvement.py
Quality Check Process:
For each objective's incorrect answers:
- Checks for red flags (contradictory phrases, absolute terms)
- Examples of red flags:
- "but not necessarily"
- "at the expense of"
- "rather than"
- "always", "never", "exclusively"
If problems found:
- Logs issue to
incorrect_suggestion_debug/directory - Regenerates incorrect answers with additional constraints
- Updates objective with improved answers
- Logs issue to
Step 2.2.5: Final Assembly
Creates final list where:
- Best-in-group objectives have enhanced incorrect answers
- Non-best-in-group objectives have empty
incorrect_answer_options: []
Step 2.3: Display Results
Three output formats:
Best-in-Group Objectives (primary output):
- Only objectives marked as best_in_group
- Includes incorrect answer options
- Sorted by ID
- Formatted as JSON
All Grouped Objectives:
- All objectives with grouping metadata
- Shows group_members arrays
- Best-in-group flags visible
Raw Ungrouped (debug):
- Original objectives from all runs
- No grouping metadata
- Original temporary IDs
Step 2.4: State Update
set_learning_objectives(grouped_result["all_grouped"])
set_processed_contents(file_contents) # Already set, but persisted
Phase 3: Question Generation
Step 3.1: Parse Learning Objectives
Process: question_handlers._parse_learning_objectives()
1. Parse JSON from Tab 1 output
2. Create LearningObjective objects from dictionaries
3. Validate required fields
4. Return List[LearningObjective]
Step 3.2: Multi-Run Question Generation
Process: question_handlers._generate_questions_multiple_runs()
For each run (user-specified, typically 1 run):
QuizGenerator.generate_questions_in_parallel()
β
quiz_generator/assessment.py
β generate_questions_in_parallel()
Parallel Generation Process:
Thread Pool Setup:
max_workers = min(len(learning_objectives), 5) ThreadPoolExecutor(max_workers=max_workers)For Each Learning Objective (in parallel):
Step 3.2.1: Question Generation (
quiz_generator/question_generation.py)generate_multiple_choice_question()a) Source Content Matching:
- Extract source_reference from objective - Search file_contents for matching XML tags - Exact match: <source file='filename.vtt'> - Fallback: Partial filename match - Last resort: Use all file contents combinedb) Multi-Source Handling:
if len(source_references) > 1: Add special instruction: "Question should synthesize information across sources"c) Prompt Construction:
Combines: - Learning objective - Correct answer - Incorrect answer options from objective - GENERAL_QUALITY_STANDARDS - MULTIPLE_CHOICE_STANDARDS - EXAMPLE_QUESTIONS - QUESTION_SPECIFIC_QUALITY_STANDARDS - CORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS - INCORRECT_ANSWER_EXAMPLES_WITH_EXPLANATION - ANSWER_FEEDBACK_QUALITY_STANDARDS - Matched course contentd) API Call:
- Model: User-selected (default: gpt-5) - Temperature: User-selected (if supported by model) - Response format: MultipleChoiceQuestion - Returns: Question with 4 options, each with feedbacke) Post-Processing:
- Set question ID = learning_objective ID - Verify all options have feedback - Add default feedback if missingStep 3.2.2: Quality Assessment (
quiz_generator/question_improvement.py)judge_question_quality()Quality Judging Process:
1. Creates evaluation prompt with: - Question text and all options - Quality criteria from prompts - Evaluation instructions 2. LLM evaluates question for: - Clarity and unambiguity - Alignment with learning objective - Quality of incorrect options - Feedback quality - Appropriate difficulty 3. Returns: - approved: bool - feedback: str (reasoning for judgment) 4. Updates question: question.approved = approved question.judge_feedback = feedbackResults Collection:
- Questions collected as futures complete - IDs assigned sequentially across runs - All questions aggregated into single list
Example: 3 objectives Γ 1 run = 3 questions generated in parallel
Step 3.3: Grouping Questions
Process: quiz_generator/question_ranking.py β group_questions()
1. Creates prompt with:
- All generated questions
- Grouping instructions
- Example format
2. LLM identifies:
- Questions testing same concept (same learning_objective_id)
- Groups of similar questions
- Best question in each group
3. Model: gpt-5-mini
Response format: GroupedMultipleChoiceQuestionsResponse
4. Returns:
{
"grouped": [all questions with group metadata],
"best_in_group": [best questions from each group]
}
Step 3.4: Ranking Questions
Process: quiz_generator/question_ranking.py β rank_questions()
Only ranks best-in-group questions:
1. Creates prompt with:
- RANK_QUESTIONS_PROMPT
- All quality standards
- Best-in-group questions only
- Course content for context
2. Ranking Criteria:
- Question clarity and unambiguity
- Alignment with learning objective
- Quality of incorrect options
- Feedback quality
- Appropriate difficulty (prefers simple English)
- Adherence to all guidelines
- Avoidance of absolute terms
3. Special Instructions:
- NEVER change question with ID=1
- Each question gets unique rank (2, 3, 4, ...)
- Rank 1 is reserved
- All questions must be returned
4. Model: User-selected
Response format: RankedMultipleChoiceQuestionsResponse
5. Returns:
{
"ranked": [questions with rank and ranking_reasoning]
}
Step 3.5: Format Results
Process: question_handlers._format_question_results()
Three outputs:
Best-in-Group Ranked Questions:
- Sorted by rank - Includes all question data - Includes rank and ranking_reasoning - Includes group metadata - Formatted as JSONAll Grouped Questions:
- All questions with group metadata - No ranking information - Shows which questions are in groups - Formatted as JSONFormatted Quiz:
format_quiz_for_ui() creates human-readable format: **Question 1 [Rank: 2]:** What is... Ranking Reasoning: ... β’ A [Correct]: Option text β¦ Feedback: Correct feedback β’ B: Option text β¦ Feedback: Incorrect feedback [continues for all questions]
Phase 4: Custom Question Generation (Optional)
Tab 3 Workflow:
Step 4.1: User Input
User provides:
- Free-form guidance/feedback text
- Model selection
- Temperature setting
Step 4.2: Generation
Process: feedback_handlers.propose_question_handler()
QuizGenerator.generate_multiple_choice_question_from_feedback()
β
quiz_generator/feedback_questions.py
Workflow:
1. Retrieves processed file contents from state
2. Creates prompt combining:
- User feedback/guidance
- All quality standards
- Course content
- Generation criteria
3. Model generates:
- Single question
- With learning objective inferred from guidance
- 4 options with feedback
- Source references
4. Returns: MultipleChoiceQuestionFromFeedback object
(includes user feedback as metadata)
5. Formatted as JSON for display
Phase 5: Assessment Export (Automated)
The final assessment can be saved using:
QuizGenerator.save_assessment_to_json()
β
quiz_generator/assessment.py β save_assessment_to_json()
Process:
1. Convert Assessment object to dictionary
assessment_dict = assessment.model_dump()
2. Write to JSON file with indent=2
Default filename: "assessment.json"
3. Contains:
- All learning objectives (best-in-group)
- All ranked questions
- Complete metadata
Detailed Component Functionality
Content Processor (ui/content_processor.py)
Class: ContentProcessor
Methods:
process_files(file_paths: List[str]) -> List[str]- Main entry point for processing multiple files
- Returns list of XML-tagged content strings
- Stores results in
self.file_contents
process_file(file_path: str) -> List[str]- Routes to appropriate handler based on file extension
- Returns single-item list with tagged content
_process_subtitle_file(file_path: str) -> List[str]- Filters out timestamps and metadata
- Preserves actual subtitle text
- Wraps in
<source file='...'>tags
_process_notebook_file(file_path: str) -> List[str]- Validates JSON structure
- Parses with nbformat
- Extracts markdown and code cells
- Falls back to raw text on parsing errors
- Wraps in
<source file='...'>tags
Learning Objective Generator (learning_objective_generator/)
generator.py - LearningObjectiveGenerator Class
Orchestrator that delegates to specialized modules:
Methods:
generate_base_learning_objectives()- Delegates to
base_generation.py - Returns base objectives with correct answers
- Delegates to
group_base_learning_objectives()- Delegates to
grouping_and_ranking.py - Groups similar objectives
- Identifies best in each group
- Delegates to
generate_incorrect_answer_options()- Delegates to
enhancement.py - Adds 5 incorrect answer suggestions per objective
- Delegates to
regenerate_incorrect_answers()- Delegates to
suggestion_improvement.py - Quality-checks and improves incorrect answers
- Delegates to
generate_and_group_learning_objectives()- Complete workflow method
- Combines: base generation β grouping β incorrect answers
- Returns dict with all_grouped and best_in_group
base_generation.py
Key Functions:
generate_base_learning_objectives()
- Wrapper that calls two separate functions
- First: Generate objectives without correct answers
- Second: Generate correct answers for those objectives
generate_base_learning_objectives_without_correct_answers()
Process:
1. Extract source filenames from XML tags
2. Combine all file contents
3. Create prompt with:
- BASE_LEARNING_OBJECTIVES_PROMPT
- BLOOMS_TAXONOMY_LEVELS
- LEARNING_OBJECTIVE_EXAMPLES_WITHOUT_ANSWERS
- Course content
4. API call:
- Model: User-selected
- Temperature: User-selected (if supported)
- Response format: BaseLearningObjectivesWithoutCorrectAnswerResponse
5. Post-process:
- Assign sequential IDs
- Normalize source_reference (extract basenames)
6. Returns: List[BaseLearningObjectiveWithoutCorrectAnswer]
generate_correct_answers_for_objectives()
Process:
1. For each objective without answer:
- Create prompt with objective + course content
- Call OpenAI API (text response, not structured)
- Extract correct answer
- Create BaseLearningObjective with answer
2. Error handling: Add "[Error generating correct answer]" on failure
3. Returns: List[BaseLearningObjective]
Quality Guidelines in Prompt:
- Objectives must be assessable via multiple-choice
- Start with action verbs (identify, describe, define, list, compare)
- One goal per objective
- Derived directly from course content
- Tool/framework agnostic (focus on principles, not specific implementations)
- First objective should be relatively easy recall question
- Avoid objectives about "building" or "creating" (not MC-assessable)
grouping_and_ranking.py
Key Functions:
group_base_learning_objectives()
Process:
1. Format objectives for display in prompt
2. Create grouping prompt with:
- Original generation criteria
- All base objectives
- Course content
- Grouping instructions
3. Special rule:
- All objectives with IDs ending in 1 grouped together
- Best one selected from this group
- Will become primary objective (ID=1)
4. API call:
- Model: "gpt-5-mini" (hardcoded for efficiency)
- Response format: GroupedBaseLearningObjectivesResponse
5. Post-process:
- Normalize best_in_group to Python bool
- Filter for best-in-group objectives
6. Returns:
{
"all_grouped": List[GroupedBaseLearningObjective],
"best_in_group": List[GroupedBaseLearningObjective]
}
Grouping Criteria:
- Topic overlap
- Similarity of concepts
- Quality based on original generation criteria
- Clarity and specificity
- Alignment with course content
enhancement.py
Key Function: generate_incorrect_answer_options()
Process:
1. For each base objective:
- Create prompt with:
- Learning objective and correct answer
- INCORRECT_ANSWER_PROMPT (detailed guidelines)
- INCORRECT_ANSWER_EXAMPLES
- Course content
- Request 5 plausible incorrect options
2. API call:
- Model: model_override or default
- Temperature: User-selected (if supported)
- Response format: LearningObjective (includes incorrect_answer_options)
3. Returns: List[LearningObjective] with all fields populated
Incorrect Answer Quality Principles:
- Create common misunderstandings
- Maintain identical structure to correct answer
- Use course terminology correctly but in wrong contexts
- Include partially correct information
- Avoid obviously wrong answers
- Mirror detail level and style of correct answer
- Avoid absolute terms ("always", "never", "exclusively")
- Avoid contradictory second clauses
suggestion_improvement.py
Key Function: regenerate_incorrect_answers()
Process:
1. For each learning objective:
- Call should_regenerate_incorrect_answers()
2. should_regenerate_incorrect_answers():
- Creates evaluation prompt with:
- Objective and all incorrect options
- IMMEDIATE_RED_FLAGS checklist
- RULES_FOR_SECOND_CLAUSES
- LLM evaluates each option
- Returns: needs_regeneration: bool
3. If regeneration needed:
- Logs to incorrect_suggestion_debug/{id}.txt
- Creates new prompt with additional constraints
- Regenerates incorrect answers
- Validates again
4. Returns: List[LearningObjective] with improved incorrect answers
Red Flags Checked:
- Contradictory second clauses ("but not necessarily")
- Explicit negations ("without automating")
- Opposite descriptions ("fixed steps" for flexible systems)
- Absolute/comparative terms
- Hedging that creates limitations
- Trade-off language creating false dichotomies
Quiz Generator (quiz_generator/)
generator.py - QuizGenerator Class
Orchestrator with LearningObjectiveGenerator embedded:
Initialization:
def __init__(self, api_key, model="gpt-5", temperature=1.0):
self.client = OpenAI(api_key=api_key)
self.model = model
self.temperature = temperature
self.learning_objective_generator = LearningObjectiveGenerator(
api_key=api_key, model=model, temperature=temperature
)
Methods (delegates to specialized modules):
generate_base_learning_objectives()β delegates to LearningObjectiveGeneratorgenerate_lo_incorrect_answer_options()β delegates to LearningObjectiveGeneratorgroup_base_learning_objectives()β delegates to grouping_and_ranking.pygenerate_multiple_choice_question()β delegates to question_generation.pygenerate_questions_in_parallel()β delegates to assessment.pygroup_questions()β delegates to question_ranking.pyrank_questions()β delegates to question_ranking.pyjudge_question_quality()β delegates to question_improvement.pyregenerate_incorrect_answers()β delegates to question_improvement.pygenerate_multiple_choice_question_from_feedback()β delegates to feedback_questions.pysave_assessment_to_json()β delegates to assessment.py
question_generation.py
Key Function: generate_multiple_choice_question()
Detailed Process:
1. Source Content Matching:
source_references = learning_objective.source_reference
if isinstance(source_references, str):
source_references = [source_references]
combined_content = ""
for source_file in source_references:
# Try exact match: <source file='filename'>
for file_content in file_contents:
if f"<source file='{source_file}'>" in file_content:
combined_content += file_content
break
# Fallback: partial match
if not found:
for file_content in file_contents:
if source_file in file_content:
combined_content += file_content
break
# Last resort: use all content
if not combined_content:
combined_content = "\n\n".join(file_contents)
2. Multi-Source Instruction:
if len(source_references) > 1:
Add special instruction:
"This learning objective spans multiple sources.
Your question should:
1. Synthesize information across these sources
2. Test understanding of overarching themes
3. Require knowledge from multiple sources"
3. Prompt Construction: Combines extensive quality standards:
- Learning objective
- Correct answer
- Incorrect answer options from objective
- GENERAL_QUALITY_STANDARDS
- MULTIPLE_CHOICE_STANDARDS
- EXAMPLE_QUESTIONS
- QUESTION_SPECIFIC_QUALITY_STANDARDS
- CORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS
- INCORRECT_ANSWER_EXAMPLES_WITH_EXPLANATION
- ANSWER_FEEDBACK_QUALITY_STANDARDS
- Multi-source instruction (if applicable)
- Matched course content
4. API Call:
params = {
"model": model,
"messages": [
{"role": "system", "content": "Expert educational assessment creator"},
{"role": "user", "content": prompt}
],
"response_format": MultipleChoiceQuestion
}
if not TEMPERATURE_UNAVAILABLE.get(model, True):
params["temperature"] = temperature
response = client.beta.chat.completions.parse(**params)
5. Post-Processing:
- Set response.id = learning_objective.id
- Set response.learning_objective_id = learning_objective.id
- Set response.learning_objective = learning_objective.learning_objective
- Set response.source_reference = learning_objective.source_reference
- Verify all options have feedback
- Add default feedback if missing
6. Error Handling:
On exception:
- Create fallback question with 4 generic options
- Include error message in question_text
- Mark as questionable quality
question_ranking.py
Key Functions:
group_questions(questions, file_contents)
Process:
1. Create prompt with:
- GROUP_QUESTIONS_PROMPT
- All questions with complete data
- Grouping instructions
2. Grouping Logic:
- Questions with same learning_objective_id are similar
- Group by topic overlap
- Mark best_in_group within each group
- Single-member groups: best_in_group = true by default
3. API call:
- Model: User-selected
- Response format: GroupedMultipleChoiceQuestionsResponse
4. Critical Instructions:
- MUST return ALL questions
- Each question must have group metadata
- best_in_group set appropriately
5. Returns:
{
"grouped": List[GroupedMultipleChoiceQuestion],
"best_in_group": [questions where best_in_group=true]
}
rank_questions(questions, file_contents)
Process:
1. Create prompt with:
- RANK_QUESTIONS_PROMPT
- ALL quality standards (comprehensive)
- Best-in-group questions only
- Course content
2. Ranking Criteria (from prompt):
- Question clarity and unambiguity
- Alignment with learning objective
- Quality of incorrect options
- Feedback quality
- Appropriate difficulty (simple English preferred)
- Adherence to all guidelines
- Avoidance of problematic words/phrases
3. Special Instructions:
- DO NOT change question with ID=1
- Rank starting from 2 (rank 1 reserved)
- Each question gets unique rank
- Must return ALL questions
4. API call:
- Model: User-selected
- Response format: RankedMultipleChoiceQuestionsResponse
5. Returns:
{
"ranked": List[RankedMultipleChoiceQuestion]
(includes rank and ranking_reasoning for each)
}
Simple vs Complex English Examples (from ranking criteria):
Simple: "AI engineers create computer programs that can learn from data"
Complex: "AI engineering practitioners architect computational paradigms
exhibiting autonomous erudition capabilities"
question_improvement.py
Key Functions:
judge_question_quality(client, model, temperature, question)
Process:
1. Create evaluation prompt with:
- Question text
- All options with feedback
- Quality criteria
- Evaluation instructions
2. LLM evaluates:
- Clarity and lack of ambiguity
- Alignment with learning objective
- Quality of distractors (incorrect options)
- Feedback quality and helpfulness
- Appropriate difficulty level
- Adherence to all standards
3. API call:
- Unstructured text response
- LLM returns: APPROVED or NOT APPROVED + reasoning
4. Parsing:
approved = "APPROVED" in response.upper()
feedback = full response text
5. Returns: (approved: bool, feedback: str)
should_regenerate_incorrect_answers(client, question, file_contents, model_name)
Process:
1. Extract incorrect options from question
2. Create evaluation prompt with:
- Each incorrect option
- IMMEDIATE_RED_FLAGS checklist
- Course content for context
3. LLM checks each option for:
- Contradictory second clauses
- Explicit negations
- Absolute terms
- Opposite descriptions
- Trade-off language
4. Returns: needs_regeneration: bool
5. If true:
- Log to wrong_answer_debug/ directory
- Provides detailed feedback on issues
regenerate_incorrect_answers(client, model, temperature, questions, file_contents)
Process:
1. For each question:
- Check if regeneration needed
- If yes:
a. Create new prompt with stricter constraints
b. Include original question for context
c. Add specific rules about avoiding red flags
d. Regenerate options
e. Validate again
- If no: keep original
2. Returns: List of questions with improved incorrect answers
feedback_questions.py
Key Function: generate_multiple_choice_question_from_feedback()
Process:
1. Accept user feedback/guidance as free-form text
2. Create prompt combining:
- User feedback
- All quality standards
- Course content
- Standard generation criteria
3. LLM infers:
- Learning objective from feedback
- Appropriate question
- 4 options with feedback
- Source references
4. API call:
- Model: User-selected
- Response format: MultipleChoiceQuestionFromFeedback
5. Includes user feedback as metadata in response
6. Returns: Single question object
assessment.py
Key Functions:
generate_questions_in_parallel()
Parallel Processing Details:
1. Setup:
max_workers = min(len(learning_objectives), 5)
# Limits to 5 concurrent threads
2. Thread Pool Executor:
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
3. For each objective (in separate thread):
Worker function:
def generate_question_for_objective(objective, idx):
- Generate question
- Judge quality
- Update with approval and feedback
- Handle errors gracefully
- Return complete question
4. Submit all tasks:
future_to_idx = {
executor.submit(generate_question_for_objective, obj, i): i
for i, obj in enumerate(learning_objectives)
}
5. Collect results as completed:
for future in concurrent.futures.as_completed(future_to_idx):
question = future.result()
questions.append(question)
print progress
6. Error handling:
- Individual failures don't stop other threads
- Placeholder questions created on error
- All errors logged
7. Returns: List[MultipleChoiceQuestion] with quality judgments
save_assessment_to_json(assessment, output_path)
1. Convert Pydantic model to dict:
assessment_dict = assessment.model_dump()
2. Write to JSON file:
with open(output_path, "w") as f:
json.dump(assessment_dict, f, indent=2)
3. File contains:
{
"learning_objectives": [...],
"questions": [...]
}
State Management (ui/state.py)
Global State Variables:
processed_file_contents = [] # List of XML-tagged content strings
generated_learning_objectives = [] # List of learning objective objects
Functions:
get_processed_contents()β retrieves file contentsset_processed_contents(contents)β stores file contentsget_learning_objectives()β retrieves objectivesset_learning_objectives(objectives)β stores objectivesclear_state()β resets both variables
Purpose:
- Persists data between UI tabs
- Allows Tab 2 to access content processed in Tab 1
- Allows Tab 3 to access content for custom questions
- Enables regeneration with feedback
UI Handlers
objective_handlers.py
process_files(files, num_objectives, num_runs, model_name, incorrect_answer_model_name, temperature)
Complete Workflow:
1. Validate inputs (files exist, API key present)
2. Extract file paths from Gradio file objects
3. Process files β get XML-tagged content
4. Store in state
5. Create QuizGenerator
6. Generate multiple runs of base objectives
7. Group and rank objectives
8. Generate incorrect answers for best-in-group
9. Improve incorrect answers
10. Reassign IDs (best from 001 group β ID=1)
11. Format results for display
12. Store in state
13. Return 4 outputs: status, best-in-group, all-grouped, raw
regenerate_objectives(objectives_json, feedback, num_objectives, num_runs, model_name, temperature)
Workflow:
1. Retrieve processed contents from state
2. Append feedback to content:
file_contents_with_feedback.append(f"FEEDBACK: {feedback}")
3. Generate new objectives with feedback context
4. Group and rank
5. Return regenerated objectives
_reassign_objective_ids(grouped_objectives)
ID Assignment Logic:
1. Find all objectives with IDs ending in 001 (1001, 2001, etc.)
2. Identify their groups
3. Find best_in_group objective from these groups
4. Assign it ID = 1
5. Assign all other objectives sequential IDs starting from 2
_format_objective_results(grouped_result, all_learning_objectives)
Formatting:
1. Sort by ID
2. Create dictionaries from Pydantic objects
3. Include all metadata fields
4. Convert to JSON with indent=2
5. Return 3 formatted outputs + status message
question_handlers.py
generate_questions(objectives_json, model_name, temperature, num_runs)
Complete Workflow:
1. Validate inputs
2. Parse objectives JSON β create LearningObjective objects
3. Retrieve processed contents from state
4. Create QuizGenerator
5. Generate questions (multiple runs in parallel)
6. Group questions by similarity
7. Rank best-in-group questions
8. Optionally improve incorrect answers (currently commented out)
9. Format results
10. Return 4 outputs: status, best-ranked, all-grouped, formatted
_generate_questions_multiple_runs()
For each run:
1. Call generate_questions_in_parallel()
2. Assign unique IDs across runs:
start_id = len(all_questions) + 1
for i, q in enumerate(run_questions):
q.id = start_id + i
3. Aggregate all questions
_group_and_rank_questions()
1. Group all questions β get grouped and best_in_group
2. Rank only best_in_group questions
3. Return:
{
"grouped": all with group metadata,
"best_in_group_ranked": best with ranks
}
feedback_handlers.py
propose_question_handler(guidance, model_name, temperature)
Workflow:
1. Validate state (processed contents available)
2. Create QuizGenerator
3. Call generate_multiple_choice_question_from_feedback()
- Passes user guidance and course content
- LLM infers learning objective
- Generates complete question
4. Format as JSON
5. Return status and question JSON
Formatting Utilities (ui/formatting.py)
format_quiz_for_ui(questions_json)
Process:
1. Parse JSON to list of question dictionaries
2. Sort by rank if available
3. For each question:
- Add header: "**Question N [Rank: X]:** {question_text}"
- Add ranking reasoning if available
- For each option:
- Add letter (A, B, C, D)
- Mark correct option
- Include option text
- Include feedback indented
4. Return formatted string with markdown
Output Example:
**Question 1 [Rank: 2]:** What is the primary purpose of AI agents?
Ranking Reasoning: Clear question that tests fundamental understanding...
β’ A [Correct]: To automate tasks and make decisions
β¦ Feedback: Correct! AI agents are designed to automate tasks...
β’ B: To replace human workers entirely
β¦ Feedback: While AI agents can automate tasks, they are not...
[continues...]
Quality Standards and Prompts
Learning Objectives Quality Standards
From prompts/learning_objectives.py:
BASE_LEARNING_OBJECTIVES_PROMPT - Key Requirements:
Assessability:
- Must be testable via multiple-choice questions
- Cannot be about "building", "creating", "developing"
- Should use verbs like: identify, list, describe, define, compare
Specificity:
- One goal per objective
- Don't combine multiple action verbs
- Example of what NOT to do: "identify X and explain Y"
Source Alignment:
- Derived DIRECTLY from course content
- No topics not covered in content
- Appropriate difficulty level for course
Independence:
- Each objective stands alone
- No dependencies on other objectives
- No context required from other objectives
Focus:
- Address "why" over "what" when possible
- Critical knowledge over trivial facts
- Principles over specific implementation details
Tool/Framework Agnosticism:
- Don't mention specific tools/frameworks
- Focus on underlying principles
- Example: Don't ask about "Pandas DataFrame methods", ask about "data filtering concepts"
First Objective Rule:
- Should be relatively easy recall question
- Address main topic/concept of course
- Format: "Identify what X is" or "Explain why X is important"
Answer Length:
- Aim for β€20 words in correct answer
- Avoid unnecessary elaboration
- No compound sentences with extra consequences
BLOOMS_TAXONOMY_LEVELS:
Levels from lowest to highest:
- Recall: Retention of key concepts (not trivialities)
- Comprehension: Connect ideas, demonstrate understanding
- Application: Apply concept to new but similar scenario
- Analysis: Examine parts, determine relationships, make inferences
- Evaluation: Make judgments requiring critical thinking
LEARNING_OBJECTIVE_EXAMPLES:
Includes 7 high-quality examples with:
- Appropriate action verbs
- Clear learning objectives
- Concise correct answers (mostly <20 words)
- Multiple source references
- Framework-agnostic language
Question Quality Standards
From prompts/questions.py:
GENERAL_QUALITY_STANDARDS:
- Overall goal: Set learner up for success
- Perfect score attainable for thoughtful students
- Aligned with course content
- Aligned with learning objective and correct answer
- No references to manual intervention (software/AI course)
MULTIPLE_CHOICE_STANDARDS:
- EXACTLY ONE correct answer per question
- Clear, unambiguous correct answer
- Plausible distractors representing common misconceptions
- Not obviously wrong distractors
- All options similar length and detail
- Mutually exclusive options
- Avoid "all/none of the above"
- Typically 4 options (A, B, C, D)
- Don't start feedback with "Correct" or "Incorrect"
QUESTION_SPECIFIC_QUALITY_STANDARDS:
Questions must:
- Match language and tone of course
- Match difficulty level of course
- Assess only course information
- Not teach as part of quiz
- Use clear, concise language
- Not induce confusion
- Provide slight (not major) challenge
- Be easily interpreted and unambiguous
- Have proper grammar and sentence structure
- Be thoughtful and specific (not broad and ambiguous)
- Be complete in wording (understanding question shouldn't be part of assessment)
CORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS:
Correct answers must:
- Be factually correct and unambiguous
- Match course language and tone
- Be complete sentences
- Match course difficulty level
- Contain only course information
- Not teach during quiz
- Use clear, concise language
- Be thoughtful and specific
- Be complete (identifying correct answer shouldn't require interpretation)
INCORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS:
Incorrect answers should:
- Represent reasonable potential misconceptions
- Sound plausible to non-experts
- Require thought even from diligent learners
- Not be obviously wrong
- Use incorrect_answer_suggestions from objective (as starting point)
Avoid:
- Obviously wrong options anyone can eliminate
- Absolute terms: "always", "never", "only", "exclusively"
- Phrases like "used exclusively for scenarios where..."
ANSWER_FEEDBACK_QUALITY_STANDARDS:
For Incorrect Answers:
- Be informational and encouraging (not punitive)
- Single sentence, concise
- Do NOT say "Incorrect" or "Wrong"
For Correct Answers:
- Be informational and encouraging
- Single sentence, concise
- Do NOT say "Correct!" (redundant after "Correct: " prefix)
Incorrect Answer Generation Guidelines
From prompts/incorrect_answers.py:
Core Principles:
Create Common Misunderstandings:
- Represent how students actually misunderstand
- Confuse related concepts
- Mix up terminology
Maintain Identical Structure:
- Match grammatical pattern of correct answer
- Same length and complexity
- Same formatting style
Use Course Terminology Correctly but in Wrong Contexts:
- Apply correct terms incorrectly
- Confuse with related concepts
- Example: Describe backpropagation but actually describe forward propagation
Include Partially Correct Information:
- First part correct, second part wrong
- Correct process but wrong application
- Correct concept but incomplete
Avoid Obviously Wrong Answers:
- No contradictions with basic knowledge
- Not immediately eliminable
- Require course knowledge to reject
Mirror Detail Level and Style:
- Match technical depth
- Match tone
- Same level of specificity
For Lists, Maintain Consistency:
- Same number of items
- Same format
- Mix some correct with incorrect items
AVOID ABSOLUTE TERMS:
- "always", "never", "exclusively", "primarily"
- "all", "every", "none", "nothing", "only"
- "must", "required", "impossible"
- "rather than", "as opposed to", "instead of"
IMMEDIATE_RED_FLAGS (triggers regeneration):
Contradictory Second Clauses:
- "but not necessarily"
- "at the expense of"
- "rather than [core concept]"
- "ensuring X rather than Y"
- "without necessarily"
- "but has no impact on"
- "but cannot", "but prevents", "but limits"
Explicit Negations:
- "without automating", "without incorporating"
- "preventing [main benefit]"
- "limiting [main capability]"
Opposite Descriptions:
- "fixed steps" (for flexible systems)
- "manual intervention" (for automation)
- "simple question answering" (for complex processing)
Hedging Creating Limitations:
- "sometimes", "occasionally", "might"
- "to some extent", "partially", "somewhat"
INCORRECT_ANSWER_EXAMPLES:
Includes 10 detailed examples showing:
- Learning objective
- Correct answer
- 3 plausible incorrect suggestions
- Explanation of why each is plausible but wrong
- Consistent formatting across all options
Ranking and Grouping
RANK_QUESTIONS_PROMPT:
Criteria:
- Question clarity and unambiguity
- Alignment with learning objective
- Quality of incorrect options
- Quality of feedback
- Appropriate difficulty (simple English preferred)
- Adherence to all guidelines
Critical Instructions:
- DO NOT change question with ID=1
- Rank starting from 2
- Each question unique rank
- Must return ALL questions
- No omissions
- No duplicate ranks
Simple vs Complex English:
Simple: "AI engineers create computer programs that learn from data"
Complex: "AI engineering practitioners architect computational paradigms
exhibiting autonomous erudition capabilities"
GROUP_QUESTIONS_PROMPT:
Grouping Logic:
- Questions with same learning_objective_id are similar
- Identify topic overlap
- Mark best_in_group within each group
- Single-member groups: best_in_group = true
Critical Instructions:
- Must return ALL questions
- Each question needs group metadata
- No omissions
- Best in group marked appropriately
Summary of Data Flow
Complete End-to-End Flow
User Uploads Files
β
ContentProcessor extracts and tags content
β
[Stored in global state]
β
Generate Base Objectives (multiple runs)
β
Group Base Objectives (by similarity)
β
Generate Incorrect Answers (for best-in-group only)
β
Improve Incorrect Answers (quality check)
β
Reassign IDs (best from 001 group β ID=1)
β
[Objectives displayed in UI, stored in state]
β
Generate Questions (parallel, multiple runs)
β
Judge Question Quality (parallel)
β
Group Questions (by similarity)
β
Rank Questions (best-in-group only)
β
[Questions displayed in UI]
β
Format for Display
β
Export to JSON (optional)
Key Optimization Strategies
Multiple Generation Runs:
- Generates variety of objectives/questions
- Grouping identifies best versions
- Reduces risk of poor quality individual outputs
Hierarchical Processing:
- Generate base β Group β Enhance β Improve
- Only enhances best candidates (saves API calls)
- Progressive refinement
Parallel Processing:
- Questions generated concurrently (up to 5 threads)
- Significant time savings for multiple objectives
- Independent evaluations
Quality Gating:
- LLM judges question quality
- Checks for red flags in incorrect answers
- Regenerates problematic content
Source Tracking:
- XML tags preserve origin
- Questions link back to source materials
- Enables accurate content matching
Modular Prompts:
- Reusable quality standards
- Consistent across all generations
- Easy to update centrally
Configuration and Customization
Available Models
Configured in models/config.py:
MODELS = [
"o3-mini", "o1", # Reasoning models (no temperature)
"gpt-4.1", "gpt-4o", # GPT-4 variants
"gpt-4o-mini", "gpt-4",
"gpt-3.5-turbo", # Legacy
"gpt-5", # Latest (no temperature)
"gpt-5-mini", # Efficient (no temperature)
"gpt-5-nano" # Ultra-efficient (no temperature)
]
Temperature Support:
- Models with reasoning (o1, o3-mini, gpt-5 variants): No temperature
- Other models: Temperature 0.0 to 1.0
Model Selection Strategy:
- Base objectives: User-selected (default: gpt-5)
- Grouping: Hardcoded gpt-5-mini (efficiency)
- Incorrect answers: Separate user selection (default: gpt-5)
- Questions: User-selected (default: gpt-5)
- Quality judging: User-selected or gpt-5-mini
Environment Variables
Required:
OPENAI_API_KEY=your_api_key_here
Configured via .env file in project root
Customization Points
Quality Standards:
- Edit
prompts/learning_objectives.py - Edit
prompts/questions.py - Edit
prompts/incorrect_answers.py - Changes apply to all future generations
- Edit
Example Questions/Objectives:
- Modify LEARNING_OBJECTIVE_EXAMPLES
- Modify EXAMPLE_QUESTIONS
- Modify INCORRECT_ANSWER_EXAMPLES
- LLM learns from these examples
Generation Parameters:
- Number of objectives per run
- Number of runs (variety)
- Temperature (creativity vs consistency)
- Model selection (quality vs cost/speed)
Parallel Processing:
max_workersin assessment.py- Currently: min(len(objectives), 5)
- Adjust for your rate limits
Output Formats:
- Modify
formatting.pyfor display - Assessment JSON structure in
models/assessment.py
- Modify
Error Handling and Resilience
Content Processing Errors
- Invalid JSON notebooks: Falls back to raw text
- Parse failures: Wraps in code blocks, continues
- Missing files: Logged, skipped
- Encoding issues: UTF-8 fallback
Generation Errors
- API failures: Logged with traceback
- Structured output parse errors: Fallback responses created
- Missing required fields: Default values assigned
- Validation errors: Caught and logged
Parallel Processing Errors
- Individual thread failures: Don't stop other threads
- Placeholder questions: Created on error
- Complete error details: Logged for debugging
- Graceful degradation: Partial results returned
Quality Check Failures
- Regeneration failures: Original kept with warning
- Judge unavailable: Questions marked unapproved
- Validation failures: Detailed logs in debug directories
Debug and Logging
Debug Directories
incorrect_suggestion_debug/- Created during objective enhancement
- Contains logs of problematic incorrect answers
- Format:
{objective_id}.txt - Includes: Original suggestions, identified issues, regeneration attempts
wrong_answer_debug/- Created during question improvement
- Logs question-level incorrect answer issues
- Regeneration history
Console Logging
Extensive logging throughout:
- File processing status
- Generation progress (run numbers)
- Parallel thread activity (thread IDs)
- API call results
- Error messages with tracebacks
- Timing information (start/end times)
Example Log Output:
DEBUG - Processing 3 files: ['file1.vtt', 'file2.ipynb', 'file3.srt']
DEBUG - Found source file: file1.vtt
Generating 3 learning objectives from 3 files
Successfully generated 3 learning objectives without correct answers
Generated correct answer for objective 1
Grouping 9 base learning objectives
Received 9 grouped results
Generating incorrect answer options only for best-in-group objectives...
PARALLEL: Starting ThreadPoolExecutor with 3 workers
PARALLEL: Worker 1 (Thread ID: 12345): Starting work on objective...
Question generation completed in 45.23 seconds
Performance Considerations
API Call Optimization
Calls per Workflow:
For 3 objectives Γ 3 runs = 9 base objectives:
Learning Objectives:
- Base generation: 3 calls (one per run)
- Correct answers: 9 calls (one per objective)
- Grouping: 1 call
- Incorrect answers: ~3 calls (best-in-group only)
- Improvement checks: ~3 calls
- Total: ~19 calls
Questions (for 3 objectives Γ 1 run):
- Question generation: 3 calls (parallel)
- Quality judging: 3 calls (parallel)
- Grouping: 1 call
- Ranking: 1 call
- Total: ~8 calls
Total for complete workflow: ~27 API calls
Time Estimates
Typical Execution Times:
- File processing: <1 second
- Objective generation (3Γ3): 30-60 seconds
- Question generation (3Γ1): 20-40 seconds (with parallelization)
- Total: 1-2 minutes for small course
Factors Affecting Speed:
- Model selection (gpt-5 slower than gpt-5-mini)
- Number of runs
- Number of objectives/questions
- API rate limits
- Network latency
- Parallel worker count
Cost Optimization
Strategies:
- Use gpt-5-mini for grouping/ranking (hardcoded)
- Reduce number of runs (trade-off: variety)
- Generate fewer objectives initially
- Use faster models for initial exploration
- Use premium models for final production
Conclusion
The AI Course Assessment Generator is a sophisticated, multi-stage system that transforms raw course materials into high-quality educational assessments. It employs:
- Modular architecture for maintainability
- Structured output generation for reliability
- Quality-driven iterative refinement for excellence
- Parallel processing for efficiency
- Comprehensive error handling for resilience
The system successfully balances automation with quality control, producing assessments that align with educational best practices and Bloom's Taxonomy while maintaining complete traceability to source materials.