quiz-generator-v3 / APP_FUNCTIONALITY_REPORT.md
ecuartasm's picture
Initial commit: AI Course Assessment Generator
217abc3

A newer version of the Gradio SDK is available: 6.10.0

Upgrade

AI Course Assessment Generator - Functionality Report

Table of Contents

  1. Overview
  2. System Architecture
  3. Data Models
  4. Application Entry Point
  5. User Interface Structure
  6. Complete Workflow
  7. Detailed Component Functionality
  8. Quality Standards and Prompts

Overview

The AI Course Assessment Generator is a sophisticated educational tool that automates the creation of learning objectives and multiple-choice questions from course materials. It leverages OpenAI's language models with structured output generation to produce high-quality educational assessments that adhere to specified quality standards and Bloom's Taxonomy levels.

Key Capabilities

  • Multi-format Content Processing: Accepts .vtt, .srt (subtitle files), and .ipynb (Jupyter notebooks)
  • AI-Powered Generation: Uses OpenAI's GPT models with configurable parameters
  • Quality Assurance: Implements LLM-based quality assessment and ranking
  • Source Tracking: Maintains XML-tagged references from source materials to generated content
  • Iterative Improvement: Supports feedback-based regeneration and enhancement
  • Parallel Processing: Generates questions concurrently for improved performance

System Architecture

Architectural Patterns

1. Orchestrator Pattern

Both LearningObjectiveGenerator and QuizGenerator act as orchestrators that coordinate calls to specialized generation functions rather than implementing generation logic directly.

2. Modular Prompt System

The prompts/ directory contains reusable prompt components that are imported and combined in generation modules, allowing for consistent quality standards across different generation tasks.

3. Structured Output Generation

All LLM interactions use Pydantic models with the instructor library to ensure consistent, validated output formats using OpenAI's structured output API.

4. Source Tracking via XML Tags

Content is wrapped in XML tags (e.g., <source file="example.ipynb">content</source>) throughout the pipeline to maintain traceability from source files to generated questions.

Technology Stack

  • Python 3.8+
  • Gradio 5.29.0+: Web-based UI framework
  • Pydantic 2.8.0+: Data validation and schema management
  • OpenAI 1.52.0+: LLM API integration
  • Instructor 1.7.9+: Structured output generation
  • nbformat 5.9.2: Jupyter notebook parsing
  • python-dotenv 1.0.0: Environment variable management

Data Models

Learning Objectives Progression

The system uses a hierarchical progression of learning objective models:

1. BaseLearningObjectiveWithoutCorrectAnswer

- id: int
- learning_objective: str
- source_reference: Union[List[str], str]

Initial generation without correct answers.

2. BaseLearningObjective

- id: int
- learning_objective: str
- source_reference: Union[List[str], str]
- correct_answer: str

Base objectives with correct answers added.

3. LearningObjective

- id: int
- learning_objective: str
- source_reference: Union[List[str], str]
- correct_answer: str
- incorrect_answer_options: Union[List[str], str]
- in_group: Optional[bool]
- group_members: Optional[List[int]]
- best_in_group: Optional[bool]

Enhanced with incorrect answer suggestions and grouping metadata.

4. GroupedLearningObjective

(All fields from LearningObjective)
- in_group: bool (required)
- group_members: List[int] (required)
- best_in_group: bool (required)

Fully grouped and ranked objectives.

Question Models Progression

1. MultipleChoiceOption

- option_text: str
- is_correct: bool
- feedback: str

2. MultipleChoiceQuestion

- id: int
- question_text: str
- options: List[MultipleChoiceOption]
- learning_objective_id: int
- learning_objective: str
- correct_answer: str
- source_reference: Union[List[str], str]
- judge_feedback: Optional[str]
- approved: Optional[bool]

3. RankedMultipleChoiceQuestion

(All fields from MultipleChoiceQuestion)
- rank: int
- ranking_reasoning: str
- in_group: bool
- group_members: List[int]
- best_in_group: bool

4. Assessment

- learning_objectives: List[LearningObjective]
- questions: List[RankedMultipleChoiceQuestion]

Final output containing both objectives and questions.

Configuration Models

MODELS

Available OpenAI models: ["o3-mini", "o1", "gpt-4.1", "gpt-4o", "gpt-4o-mini", "gpt-4", "gpt-3.5-turbo", "gpt-5", "gpt-5-mini", "gpt-5-nano"]

TEMPERATURE_UNAVAILABLE

Dictionary mapping models to temperature availability (some models like o1, o3-mini, and gpt-5 variants don't support temperature settings).


Application Entry Point

app.py

The root-level entry point that:

  1. Loads environment variables from .env file
  2. Checks for OPENAI_API_KEY presence
  3. Creates the Gradio UI via ui.app.create_ui()
  4. Launches the web interface at http://127.0.0.1:7860
# Workflow:
load_dotenv() β†’ Check API key β†’ create_ui() β†’ app.launch()

User Interface Structure

ui/app.py - Gradio Interface

The UI is organized into 3 main tabs:

Tab 1: Generate Learning Objectives

Input Components:

  • File uploader (accepts .ipynb, .vtt, .srt)
  • Number of objectives per run (slider: 1-20, default: 3)
  • Number of generation runs (dropdown: 1-5, default: 3)
  • Model selection (dropdown, default: "gpt-5")
  • Incorrect answer model selection (dropdown, default: "gpt-5")
  • Temperature setting (dropdown: 0.0-1.0, default: 1.0)
  • Generate button
  • Feedback input textbox
  • Regenerate button

Output Components:

  • Status textbox
  • Best-in-Group Learning Objectives (JSON)
  • All Grouped Learning Objectives (JSON)
  • Raw Ungrouped Learning Objectives (JSON) - for debugging

Event Handler: process_files() from objective_handlers.py

Tab 2: Generate Questions

Input Components:

  • Learning Objectives JSON (auto-populated from Tab 1)
  • Model selection
  • Temperature setting
  • Number of question generation runs (slider: 1-5, default: 1)
  • Generate Questions button

Output Components:

  • Status textbox
  • Ranked Best-in-Group Questions (JSON)
  • All Grouped Questions (JSON)
  • Formatted Quiz (human-readable format)

Event Handler: generate_questions() from question_handlers.py

Tab 3: Propose/Edit Question

Input Components:

  • Question guidance/feedback textbox
  • Model selection
  • Temperature setting
  • Generate Question button

Output Components:

  • Status textbox
  • Generated Question (JSON)

Event Handler: propose_question_handler() from feedback_handlers.py


Complete Workflow

Phase 1: File Upload and Content Processing

Step 1.1: File Upload

User uploads one or more files (.vtt, .srt, .ipynb) through the Gradio interface.

Step 1.2: File Path Extraction (objective_handlers._extract_file_paths())

# Handles different input formats:
- List of file paths
- Single file path string
- File objects with .name attribute

Step 1.3: Content Processing (ui/content_processor.py)

For Subtitle Files (.vtt, .srt):

1. Read file with UTF-8 encoding
2. Split into lines
3. Filter out:
   - Empty lines
   - Numeric timestamp indicators
   - Lines containing '-->' (timestamps)
   - 'WEBVTT' header lines
4. Combine remaining text lines
5. Wrap in XML tags: <source file='filename.vtt'>content</source>

For Jupyter Notebooks (.ipynb):

1. Validate JSON format
2. Parse with nbformat.read()
3. Extract from cells:
   - Markdown cells: [Markdown]\n{content}
   - Code cells: [Code]\n```python\n{content}\n```
4. Combine all cell content
5. Wrap in XML tags: <source file='filename.ipynb'>content</source>

Error Handling:

  • Invalid JSON: Wraps raw content in code blocks
  • Parsing failures: Falls back to plain text extraction
  • All errors logged to console

Step 1.4: State Storage

Processed content stored in global state (ui/state.py):

processed_file_contents = [tagged_content_1, tagged_content_2, ...]

Phase 2: Learning Objective Generation

Step 2.1: Multi-Run Base Generation

Process: objective_handlers._generate_multiple_runs()

For each run (user-specified, typically 3 runs):

  1. Call: QuizGenerator.generate_base_learning_objectives()

  2. Workflow:

    generate_base_learning_objectives()
      ↓
    generate_base_learning_objectives_without_correct_answers()
      β†’ Creates prompt with:
         - BASE_LEARNING_OBJECTIVES_PROMPT
         - BLOOMS_TAXONOMY_LEVELS
         - LEARNING_OBJECTIVE_EXAMPLES_WITHOUT_ANSWERS
         - Combined file contents
      β†’ Calls OpenAI API with structured output
      β†’ Returns List[BaseLearningObjectiveWithoutCorrectAnswer]
      ↓
    generate_correct_answers_for_objectives()
      β†’ For each objective:
         - Creates prompt with objective and course content
         - Calls OpenAI API (unstructured text response)
         - Extracts correct answer
      β†’ Returns List[BaseLearningObjective]
    
  3. ID Assignment:

    # Temporary IDs by run:
    Run 1: 1001, 1002, 1003
    Run 2: 2001, 2002, 2003
    Run 3: 3001, 3002, 3003
    
  4. Aggregation: All objectives from all runs combined into single list.

Example: 3 runs Γ— 3 objectives = 9 total base objectives

Step 2.2: Grouping and Ranking

Process: objective_handlers._group_base_objectives_add_incorrect_answers()

Step 2.2.1: Group Base Objectives

QuizGenerator.group_base_learning_objectives()
  ↓
learning_objective_generator/grouping_and_ranking.py
  β†’ group_base_learning_objectives()

Grouping Logic:

  1. Creates prompt containing:

    • Original generation criteria
    • All base objectives with IDs
    • Course content for context
    • Grouping instructions
  2. Special Rule: All objectives with IDs ending in 1 (1001, 2001, 3001) are grouped together and ONE is marked as best-in-group (this becomes the primary/first objective)

  3. LLM Call:

    • Model: gpt-5-mini
    • Response format: GroupedBaseLearningObjectivesResponse
    • Returns: Grouped objectives with metadata
  4. Output Structure:

    {
      "all_grouped": [all objectives with group metadata],
      "best_in_group": [objectives marked as best in their groups]
    }
    

Step 2.2.2: ID Reassignment (_reassign_objective_ids())

1. Find best objective from the 001 group
2. Assign it ID = 1
3. Assign remaining objectives IDs starting from 2

Step 2.2.3: Generate Incorrect Answer Options

Only for best-in-group objectives:

QuizGenerator.generate_lo_incorrect_answer_options()
  ↓
learning_objective_generator/enhancement.py
  β†’ generate_incorrect_answer_options()

Process:

  1. For each best-in-group objective:

    • Creates prompt with:
      • Objective and correct answer
      • INCORRECT_ANSWER_PROMPT guidelines
      • INCORRECT_ANSWER_EXAMPLES
      • Course content
    • Calls OpenAI API (with optional model override)
    • Generates 5 plausible incorrect answer options
  2. Returns: List[LearningObjective] with incorrect_answer_options populated

Step 2.2.4: Improve Incorrect Answers

learning_objective_generator.regenerate_incorrect_answers()
  ↓
learning_objective_generator/suggestion_improvement.py

Quality Check Process:

  1. For each objective's incorrect answers:

    • Checks for red flags (contradictory phrases, absolute terms)
    • Examples of red flags:
      • "but not necessarily"
      • "at the expense of"
      • "rather than"
      • "always", "never", "exclusively"
  2. If problems found:

    • Logs issue to incorrect_suggestion_debug/ directory
    • Regenerates incorrect answers with additional constraints
    • Updates objective with improved answers

Step 2.2.5: Final Assembly

Creates final list where:

  • Best-in-group objectives have enhanced incorrect answers
  • Non-best-in-group objectives have empty incorrect_answer_options: []

Step 2.3: Display Results

Three output formats:

  1. Best-in-Group Objectives (primary output):

    • Only objectives marked as best_in_group
    • Includes incorrect answer options
    • Sorted by ID
    • Formatted as JSON
  2. All Grouped Objectives:

    • All objectives with grouping metadata
    • Shows group_members arrays
    • Best-in-group flags visible
  3. Raw Ungrouped (debug):

    • Original objectives from all runs
    • No grouping metadata
    • Original temporary IDs

Step 2.4: State Update

set_learning_objectives(grouped_result["all_grouped"])
set_processed_contents(file_contents)  # Already set, but persisted

Phase 3: Question Generation

Step 3.1: Parse Learning Objectives

Process: question_handlers._parse_learning_objectives()

1. Parse JSON from Tab 1 output
2. Create LearningObjective objects from dictionaries
3. Validate required fields
4. Return List[LearningObjective]

Step 3.2: Multi-Run Question Generation

Process: question_handlers._generate_questions_multiple_runs()

For each run (user-specified, typically 1 run):

QuizGenerator.generate_questions_in_parallel()
  ↓
quiz_generator/assessment.py
  β†’ generate_questions_in_parallel()

Parallel Generation Process:

  1. Thread Pool Setup:

    max_workers = min(len(learning_objectives), 5)
    ThreadPoolExecutor(max_workers=max_workers)
    
  2. For Each Learning Objective (in parallel):

    Step 3.2.1: Question Generation (quiz_generator/question_generation.py)

    generate_multiple_choice_question()
    

    a) Source Content Matching:

    - Extract source_reference from objective
    - Search file_contents for matching XML tags
    - Exact match: <source file='filename.vtt'>
    - Fallback: Partial filename match
    - Last resort: Use all file contents combined
    

    b) Multi-Source Handling:

    if len(source_references) > 1:
        Add special instruction:
        "Question should synthesize information across sources"
    

    c) Prompt Construction:

    Combines:
    - Learning objective
    - Correct answer
    - Incorrect answer options from objective
    - GENERAL_QUALITY_STANDARDS
    - MULTIPLE_CHOICE_STANDARDS
    - EXAMPLE_QUESTIONS
    - QUESTION_SPECIFIC_QUALITY_STANDARDS
    - CORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS
    - INCORRECT_ANSWER_EXAMPLES_WITH_EXPLANATION
    - ANSWER_FEEDBACK_QUALITY_STANDARDS
    - Matched course content
    

    d) API Call:

    - Model: User-selected (default: gpt-5)
    - Temperature: User-selected (if supported by model)
    - Response format: MultipleChoiceQuestion
    - Returns: Question with 4 options, each with feedback
    

    e) Post-Processing:

    - Set question ID = learning_objective ID
    - Verify all options have feedback
    - Add default feedback if missing
    

    Step 3.2.2: Quality Assessment (quiz_generator/question_improvement.py)

    judge_question_quality()
    

    Quality Judging Process:

    1. Creates evaluation prompt with:
       - Question text and all options
       - Quality criteria from prompts
       - Evaluation instructions
    
    2. LLM evaluates question for:
       - Clarity and unambiguity
       - Alignment with learning objective
       - Quality of incorrect options
       - Feedback quality
       - Appropriate difficulty
    
    3. Returns:
       - approved: bool
       - feedback: str (reasoning for judgment)
    
    4. Updates question:
       question.approved = approved
       question.judge_feedback = feedback
    
  3. Results Collection:

    - Questions collected as futures complete
    - IDs assigned sequentially across runs
    - All questions aggregated into single list
    

Example: 3 objectives Γ— 1 run = 3 questions generated in parallel

Step 3.3: Grouping Questions

Process: quiz_generator/question_ranking.py β†’ group_questions()

1. Creates prompt with:
   - All generated questions
   - Grouping instructions
   - Example format

2. LLM identifies:
   - Questions testing same concept (same learning_objective_id)
   - Groups of similar questions
   - Best question in each group

3. Model: gpt-5-mini
   Response format: GroupedMultipleChoiceQuestionsResponse

4. Returns:
   {
     "grouped": [all questions with group metadata],
     "best_in_group": [best questions from each group]
   }

Step 3.4: Ranking Questions

Process: quiz_generator/question_ranking.py β†’ rank_questions()

Only ranks best-in-group questions:

1. Creates prompt with:
   - RANK_QUESTIONS_PROMPT
   - All quality standards
   - Best-in-group questions only
   - Course content for context

2. Ranking Criteria:
   - Question clarity and unambiguity
   - Alignment with learning objective
   - Quality of incorrect options
   - Feedback quality
   - Appropriate difficulty (prefers simple English)
   - Adherence to all guidelines
   - Avoidance of absolute terms

3. Special Instructions:
   - NEVER change question with ID=1
   - Each question gets unique rank (2, 3, 4, ...)
   - Rank 1 is reserved
   - All questions must be returned

4. Model: User-selected
   Response format: RankedMultipleChoiceQuestionsResponse

5. Returns:
   {
     "ranked": [questions with rank and ranking_reasoning]
   }

Step 3.5: Format Results

Process: question_handlers._format_question_results()

Three outputs:

  1. Best-in-Group Ranked Questions:

    - Sorted by rank
    - Includes all question data
    - Includes rank and ranking_reasoning
    - Includes group metadata
    - Formatted as JSON
    
  2. All Grouped Questions:

    - All questions with group metadata
    - No ranking information
    - Shows which questions are in groups
    - Formatted as JSON
    
  3. Formatted Quiz:

    format_quiz_for_ui() creates human-readable format:
    
    **Question 1 [Rank: 2]:** What is...
    
    Ranking Reasoning: ...
    
    β€’ A [Correct]: Option text
      β—¦ Feedback: Correct feedback
    
    β€’ B: Option text
      β—¦ Feedback: Incorrect feedback
    
    [continues for all questions]
    

Phase 4: Custom Question Generation (Optional)

Tab 3 Workflow:

Step 4.1: User Input

User provides:

  • Free-form guidance/feedback text
  • Model selection
  • Temperature setting

Step 4.2: Generation

Process: feedback_handlers.propose_question_handler()

QuizGenerator.generate_multiple_choice_question_from_feedback()
  ↓
quiz_generator/feedback_questions.py

Workflow:

1. Retrieves processed file contents from state

2. Creates prompt combining:
   - User feedback/guidance
   - All quality standards
   - Course content
   - Generation criteria

3. Model generates:
   - Single question
   - With learning objective inferred from guidance
   - 4 options with feedback
   - Source references

4. Returns: MultipleChoiceQuestionFromFeedback object
   (includes user feedback as metadata)

5. Formatted as JSON for display

Phase 5: Assessment Export (Automated)

The final assessment can be saved using:

QuizGenerator.save_assessment_to_json()
  ↓
quiz_generator/assessment.py β†’ save_assessment_to_json()

Process:

1. Convert Assessment object to dictionary
   assessment_dict = assessment.model_dump()

2. Write to JSON file with indent=2
   Default filename: "assessment.json"

3. Contains:
   - All learning objectives (best-in-group)
   - All ranked questions
   - Complete metadata

Detailed Component Functionality

Content Processor (ui/content_processor.py)

Class: ContentProcessor

Methods:

  1. process_files(file_paths: List[str]) -> List[str]

    • Main entry point for processing multiple files
    • Returns list of XML-tagged content strings
    • Stores results in self.file_contents
  2. process_file(file_path: str) -> List[str]

    • Routes to appropriate handler based on file extension
    • Returns single-item list with tagged content
  3. _process_subtitle_file(file_path: str) -> List[str]

    • Filters out timestamps and metadata
    • Preserves actual subtitle text
    • Wraps in <source file='...'> tags
  4. _process_notebook_file(file_path: str) -> List[str]

    • Validates JSON structure
    • Parses with nbformat
    • Extracts markdown and code cells
    • Falls back to raw text on parsing errors
    • Wraps in <source file='...'> tags

Learning Objective Generator (learning_objective_generator/)

generator.py - LearningObjectiveGenerator Class

Orchestrator that delegates to specialized modules:

Methods:

  1. generate_base_learning_objectives()

    • Delegates to base_generation.py
    • Returns base objectives with correct answers
  2. group_base_learning_objectives()

    • Delegates to grouping_and_ranking.py
    • Groups similar objectives
    • Identifies best in each group
  3. generate_incorrect_answer_options()

    • Delegates to enhancement.py
    • Adds 5 incorrect answer suggestions per objective
  4. regenerate_incorrect_answers()

    • Delegates to suggestion_improvement.py
    • Quality-checks and improves incorrect answers
  5. generate_and_group_learning_objectives()

    • Complete workflow method
    • Combines: base generation β†’ grouping β†’ incorrect answers
    • Returns dict with all_grouped and best_in_group

base_generation.py

Key Functions:

generate_base_learning_objectives()

  • Wrapper that calls two separate functions
  • First: Generate objectives without correct answers
  • Second: Generate correct answers for those objectives

generate_base_learning_objectives_without_correct_answers()

Process:

1. Extract source filenames from XML tags
2. Combine all file contents
3. Create prompt with:
   - BASE_LEARNING_OBJECTIVES_PROMPT
   - BLOOMS_TAXONOMY_LEVELS
   - LEARNING_OBJECTIVE_EXAMPLES_WITHOUT_ANSWERS
   - Course content
4. API call:
   - Model: User-selected
   - Temperature: User-selected (if supported)
   - Response format: BaseLearningObjectivesWithoutCorrectAnswerResponse
5. Post-process:
   - Assign sequential IDs
   - Normalize source_reference (extract basenames)
6. Returns: List[BaseLearningObjectiveWithoutCorrectAnswer]

generate_correct_answers_for_objectives()

Process:

1. For each objective without answer:
   - Create prompt with objective + course content
   - Call OpenAI API (text response, not structured)
   - Extract correct answer
   - Create BaseLearningObjective with answer
2. Error handling: Add "[Error generating correct answer]" on failure
3. Returns: List[BaseLearningObjective]

Quality Guidelines in Prompt:

  • Objectives must be assessable via multiple-choice
  • Start with action verbs (identify, describe, define, list, compare)
  • One goal per objective
  • Derived directly from course content
  • Tool/framework agnostic (focus on principles, not specific implementations)
  • First objective should be relatively easy recall question
  • Avoid objectives about "building" or "creating" (not MC-assessable)

grouping_and_ranking.py

Key Functions:

group_base_learning_objectives()

Process:

1. Format objectives for display in prompt
2. Create grouping prompt with:
   - Original generation criteria
   - All base objectives
   - Course content
   - Grouping instructions
3. Special rule:
   - All objectives with IDs ending in 1 grouped together
   - Best one selected from this group
   - Will become primary objective (ID=1)
4. API call:
   - Model: "gpt-5-mini" (hardcoded for efficiency)
   - Response format: GroupedBaseLearningObjectivesResponse
5. Post-process:
   - Normalize best_in_group to Python bool
   - Filter for best-in-group objectives
6. Returns:
   {
     "all_grouped": List[GroupedBaseLearningObjective],
     "best_in_group": List[GroupedBaseLearningObjective]
   }

Grouping Criteria:

  • Topic overlap
  • Similarity of concepts
  • Quality based on original generation criteria
  • Clarity and specificity
  • Alignment with course content

enhancement.py

Key Function: generate_incorrect_answer_options()

Process:

1. For each base objective:
   - Create prompt with:
     - Learning objective and correct answer
     - INCORRECT_ANSWER_PROMPT (detailed guidelines)
     - INCORRECT_ANSWER_EXAMPLES
     - Course content
   - Request 5 plausible incorrect options
2. API call:
   - Model: model_override or default
   - Temperature: User-selected (if supported)
   - Response format: LearningObjective (includes incorrect_answer_options)
3. Returns: List[LearningObjective] with all fields populated

Incorrect Answer Quality Principles:

  • Create common misunderstandings
  • Maintain identical structure to correct answer
  • Use course terminology correctly but in wrong contexts
  • Include partially correct information
  • Avoid obviously wrong answers
  • Mirror detail level and style of correct answer
  • Avoid absolute terms ("always", "never", "exclusively")
  • Avoid contradictory second clauses

suggestion_improvement.py

Key Function: regenerate_incorrect_answers()

Process:

1. For each learning objective:
   - Call should_regenerate_incorrect_answers()

2. should_regenerate_incorrect_answers():
   - Creates evaluation prompt with:
     - Objective and all incorrect options
     - IMMEDIATE_RED_FLAGS checklist
     - RULES_FOR_SECOND_CLAUSES
   - LLM evaluates each option
   - Returns: needs_regeneration: bool

3. If regeneration needed:
   - Logs to incorrect_suggestion_debug/{id}.txt
   - Creates new prompt with additional constraints
   - Regenerates incorrect answers
   - Validates again

4. Returns: List[LearningObjective] with improved incorrect answers

Red Flags Checked:

  • Contradictory second clauses ("but not necessarily")
  • Explicit negations ("without automating")
  • Opposite descriptions ("fixed steps" for flexible systems)
  • Absolute/comparative terms
  • Hedging that creates limitations
  • Trade-off language creating false dichotomies

Quiz Generator (quiz_generator/)

generator.py - QuizGenerator Class

Orchestrator with LearningObjectiveGenerator embedded:

Initialization:

def __init__(self, api_key, model="gpt-5", temperature=1.0):
    self.client = OpenAI(api_key=api_key)
    self.model = model
    self.temperature = temperature
    self.learning_objective_generator = LearningObjectiveGenerator(
        api_key=api_key, model=model, temperature=temperature
    )

Methods (delegates to specialized modules):

  1. generate_base_learning_objectives() β†’ delegates to LearningObjectiveGenerator
  2. generate_lo_incorrect_answer_options() β†’ delegates to LearningObjectiveGenerator
  3. group_base_learning_objectives() β†’ delegates to grouping_and_ranking.py
  4. generate_multiple_choice_question() β†’ delegates to question_generation.py
  5. generate_questions_in_parallel() β†’ delegates to assessment.py
  6. group_questions() β†’ delegates to question_ranking.py
  7. rank_questions() β†’ delegates to question_ranking.py
  8. judge_question_quality() β†’ delegates to question_improvement.py
  9. regenerate_incorrect_answers() β†’ delegates to question_improvement.py
  10. generate_multiple_choice_question_from_feedback() β†’ delegates to feedback_questions.py
  11. save_assessment_to_json() β†’ delegates to assessment.py

question_generation.py

Key Function: generate_multiple_choice_question()

Detailed Process:

1. Source Content Matching:

source_references = learning_objective.source_reference
if isinstance(source_references, str):
    source_references = [source_references]

combined_content = ""
for source_file in source_references:
    # Try exact match: <source file='filename'>
    for file_content in file_contents:
        if f"<source file='{source_file}'>" in file_content:
            combined_content += file_content
            break

    # Fallback: partial match
    if not found:
        for file_content in file_contents:
            if source_file in file_content:
                combined_content += file_content
                break

# Last resort: use all content
if not combined_content:
    combined_content = "\n\n".join(file_contents)

2. Multi-Source Instruction:

if len(source_references) > 1:
    Add special instruction:
    "This learning objective spans multiple sources.
     Your question should:
     1. Synthesize information across these sources
     2. Test understanding of overarching themes
     3. Require knowledge from multiple sources"

3. Prompt Construction: Combines extensive quality standards:

- Learning objective
- Correct answer
- Incorrect answer options from objective
- GENERAL_QUALITY_STANDARDS
- MULTIPLE_CHOICE_STANDARDS
- EXAMPLE_QUESTIONS
- QUESTION_SPECIFIC_QUALITY_STANDARDS
- CORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS
- INCORRECT_ANSWER_EXAMPLES_WITH_EXPLANATION
- ANSWER_FEEDBACK_QUALITY_STANDARDS
- Multi-source instruction (if applicable)
- Matched course content

4. API Call:

params = {
    "model": model,
    "messages": [
        {"role": "system", "content": "Expert educational assessment creator"},
        {"role": "user", "content": prompt}
    ],
    "response_format": MultipleChoiceQuestion
}
if not TEMPERATURE_UNAVAILABLE.get(model, True):
    params["temperature"] = temperature

response = client.beta.chat.completions.parse(**params)

5. Post-Processing:

- Set response.id = learning_objective.id
- Set response.learning_objective_id = learning_objective.id
- Set response.learning_objective = learning_objective.learning_objective
- Set response.source_reference = learning_objective.source_reference
- Verify all options have feedback
- Add default feedback if missing

6. Error Handling:

On exception:
- Create fallback question with 4 generic options
- Include error message in question_text
- Mark as questionable quality

question_ranking.py

Key Functions:

group_questions(questions, file_contents)

Process:

1. Create prompt with:
   - GROUP_QUESTIONS_PROMPT
   - All questions with complete data
   - Grouping instructions

2. Grouping Logic:
   - Questions with same learning_objective_id are similar
   - Group by topic overlap
   - Mark best_in_group within each group
   - Single-member groups: best_in_group = true by default

3. API call:
   - Model: User-selected
   - Response format: GroupedMultipleChoiceQuestionsResponse

4. Critical Instructions:
   - MUST return ALL questions
   - Each question must have group metadata
   - best_in_group set appropriately

5. Returns:
   {
     "grouped": List[GroupedMultipleChoiceQuestion],
     "best_in_group": [questions where best_in_group=true]
   }

rank_questions(questions, file_contents)

Process:

1. Create prompt with:
   - RANK_QUESTIONS_PROMPT
   - ALL quality standards (comprehensive)
   - Best-in-group questions only
   - Course content

2. Ranking Criteria (from prompt):
   - Question clarity and unambiguity
   - Alignment with learning objective
   - Quality of incorrect options
   - Feedback quality
   - Appropriate difficulty (simple English preferred)
   - Adherence to all guidelines
   - Avoidance of problematic words/phrases

3. Special Instructions:
   - DO NOT change question with ID=1
   - Rank starting from 2 (rank 1 reserved)
   - Each question gets unique rank
   - Must return ALL questions

4. API call:
   - Model: User-selected
   - Response format: RankedMultipleChoiceQuestionsResponse

5. Returns:
   {
     "ranked": List[RankedMultipleChoiceQuestion]
              (includes rank and ranking_reasoning for each)
   }

Simple vs Complex English Examples (from ranking criteria):

Simple: "AI engineers create computer programs that can learn from data"
Complex: "AI engineering practitioners architect computational paradigms
          exhibiting autonomous erudition capabilities"

question_improvement.py

Key Functions:

judge_question_quality(client, model, temperature, question)

Process:

1. Create evaluation prompt with:
   - Question text
   - All options with feedback
   - Quality criteria
   - Evaluation instructions

2. LLM evaluates:
   - Clarity and lack of ambiguity
   - Alignment with learning objective
   - Quality of distractors (incorrect options)
   - Feedback quality and helpfulness
   - Appropriate difficulty level
   - Adherence to all standards

3. API call:
   - Unstructured text response
   - LLM returns: APPROVED or NOT APPROVED + reasoning

4. Parsing:
   approved = "APPROVED" in response.upper()
   feedback = full response text

5. Returns: (approved: bool, feedback: str)

should_regenerate_incorrect_answers(client, question, file_contents, model_name)

Process:

1. Extract incorrect options from question

2. Create evaluation prompt with:
   - Each incorrect option
   - IMMEDIATE_RED_FLAGS checklist
   - Course content for context

3. LLM checks each option for:
   - Contradictory second clauses
   - Explicit negations
   - Absolute terms
   - Opposite descriptions
   - Trade-off language

4. Returns: needs_regeneration: bool

5. If true:
   - Log to wrong_answer_debug/ directory
   - Provides detailed feedback on issues

regenerate_incorrect_answers(client, model, temperature, questions, file_contents)

Process:

1. For each question:
   - Check if regeneration needed
   - If yes:
     a. Create new prompt with stricter constraints
     b. Include original question for context
     c. Add specific rules about avoiding red flags
     d. Regenerate options
     e. Validate again
   - If no: keep original

2. Returns: List of questions with improved incorrect answers

feedback_questions.py

Key Function: generate_multiple_choice_question_from_feedback()

Process:

1. Accept user feedback/guidance as free-form text

2. Create prompt combining:
   - User feedback
   - All quality standards
   - Course content
   - Standard generation criteria

3. LLM infers:
   - Learning objective from feedback
   - Appropriate question
   - 4 options with feedback
   - Source references

4. API call:
   - Model: User-selected
   - Response format: MultipleChoiceQuestionFromFeedback

5. Includes user feedback as metadata in response

6. Returns: Single question object

assessment.py

Key Functions:

generate_questions_in_parallel()

Parallel Processing Details:

1. Setup:
   max_workers = min(len(learning_objectives), 5)
   # Limits to 5 concurrent threads

2. Thread Pool Executor:
   with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:

3. For each objective (in separate thread):

   Worker function:
   def generate_question_for_objective(objective, idx):
       - Generate question
       - Judge quality
       - Update with approval and feedback
       - Handle errors gracefully
       - Return complete question

4. Submit all tasks:
   future_to_idx = {
       executor.submit(generate_question_for_objective, obj, i): i
       for i, obj in enumerate(learning_objectives)
   }

5. Collect results as completed:
   for future in concurrent.futures.as_completed(future_to_idx):
       question = future.result()
       questions.append(question)
       print progress

6. Error handling:
   - Individual failures don't stop other threads
   - Placeholder questions created on error
   - All errors logged

7. Returns: List[MultipleChoiceQuestion] with quality judgments

save_assessment_to_json(assessment, output_path)

1. Convert Pydantic model to dict:
   assessment_dict = assessment.model_dump()

2. Write to JSON file:
   with open(output_path, "w") as f:
       json.dump(assessment_dict, f, indent=2)

3. File contains:
   {
     "learning_objectives": [...],
     "questions": [...]
   }

State Management (ui/state.py)

Global State Variables:

processed_file_contents = []  # List of XML-tagged content strings
generated_learning_objectives = []  # List of learning objective objects

Functions:

  • get_processed_contents() β†’ retrieves file contents
  • set_processed_contents(contents) β†’ stores file contents
  • get_learning_objectives() β†’ retrieves objectives
  • set_learning_objectives(objectives) β†’ stores objectives
  • clear_state() β†’ resets both variables

Purpose:

  • Persists data between UI tabs
  • Allows Tab 2 to access content processed in Tab 1
  • Allows Tab 3 to access content for custom questions
  • Enables regeneration with feedback

UI Handlers

objective_handlers.py

process_files(files, num_objectives, num_runs, model_name, incorrect_answer_model_name, temperature)

Complete Workflow:

1. Validate inputs (files exist, API key present)
2. Extract file paths from Gradio file objects
3. Process files β†’ get XML-tagged content
4. Store in state
5. Create QuizGenerator
6. Generate multiple runs of base objectives
7. Group and rank objectives
8. Generate incorrect answers for best-in-group
9. Improve incorrect answers
10. Reassign IDs (best from 001 group β†’ ID=1)
11. Format results for display
12. Store in state
13. Return 4 outputs: status, best-in-group, all-grouped, raw

regenerate_objectives(objectives_json, feedback, num_objectives, num_runs, model_name, temperature)

Workflow:

1. Retrieve processed contents from state
2. Append feedback to content:
   file_contents_with_feedback.append(f"FEEDBACK: {feedback}")
3. Generate new objectives with feedback context
4. Group and rank
5. Return regenerated objectives

_reassign_objective_ids(grouped_objectives)

ID Assignment Logic:

1. Find all objectives with IDs ending in 001 (1001, 2001, etc.)
2. Identify their groups
3. Find best_in_group objective from these groups
4. Assign it ID = 1
5. Assign all other objectives sequential IDs starting from 2

_format_objective_results(grouped_result, all_learning_objectives)

Formatting:

1. Sort by ID
2. Create dictionaries from Pydantic objects
3. Include all metadata fields
4. Convert to JSON with indent=2
5. Return 3 formatted outputs + status message

question_handlers.py

generate_questions(objectives_json, model_name, temperature, num_runs)

Complete Workflow:

1. Validate inputs
2. Parse objectives JSON β†’ create LearningObjective objects
3. Retrieve processed contents from state
4. Create QuizGenerator
5. Generate questions (multiple runs in parallel)
6. Group questions by similarity
7. Rank best-in-group questions
8. Optionally improve incorrect answers (currently commented out)
9. Format results
10. Return 4 outputs: status, best-ranked, all-grouped, formatted

_generate_questions_multiple_runs()

For each run:
1. Call generate_questions_in_parallel()
2. Assign unique IDs across runs:
   start_id = len(all_questions) + 1
   for i, q in enumerate(run_questions):
       q.id = start_id + i
3. Aggregate all questions

_group_and_rank_questions()

1. Group all questions β†’ get grouped and best_in_group
2. Rank only best_in_group questions
3. Return:
   {
     "grouped": all with group metadata,
     "best_in_group_ranked": best with ranks
   }

feedback_handlers.py

propose_question_handler(guidance, model_name, temperature)

Workflow:

1. Validate state (processed contents available)
2. Create QuizGenerator
3. Call generate_multiple_choice_question_from_feedback()
   - Passes user guidance and course content
   - LLM infers learning objective
   - Generates complete question
4. Format as JSON
5. Return status and question JSON

Formatting Utilities (ui/formatting.py)

format_quiz_for_ui(questions_json)

Process:

1. Parse JSON to list of question dictionaries
2. Sort by rank if available
3. For each question:
   - Add header: "**Question N [Rank: X]:** {question_text}"
   - Add ranking reasoning if available
   - For each option:
     - Add letter (A, B, C, D)
     - Mark correct option
     - Include option text
     - Include feedback indented
4. Return formatted string with markdown

Output Example:

**Question 1 [Rank: 2]:** What is the primary purpose of AI agents?

Ranking Reasoning: Clear question that tests fundamental understanding...

    β€’ A [Correct]: To automate tasks and make decisions
      β—¦ Feedback: Correct! AI agents are designed to automate tasks...

    β€’ B: To replace human workers entirely
      β—¦ Feedback: While AI agents can automate tasks, they are not...

[continues...]

Quality Standards and Prompts

Learning Objectives Quality Standards

From prompts/learning_objectives.py:

BASE_LEARNING_OBJECTIVES_PROMPT - Key Requirements:

  1. Assessability:

    • Must be testable via multiple-choice questions
    • Cannot be about "building", "creating", "developing"
    • Should use verbs like: identify, list, describe, define, compare
  2. Specificity:

    • One goal per objective
    • Don't combine multiple action verbs
    • Example of what NOT to do: "identify X and explain Y"
  3. Source Alignment:

    • Derived DIRECTLY from course content
    • No topics not covered in content
    • Appropriate difficulty level for course
  4. Independence:

    • Each objective stands alone
    • No dependencies on other objectives
    • No context required from other objectives
  5. Focus:

    • Address "why" over "what" when possible
    • Critical knowledge over trivial facts
    • Principles over specific implementation details
  6. Tool/Framework Agnosticism:

    • Don't mention specific tools/frameworks
    • Focus on underlying principles
    • Example: Don't ask about "Pandas DataFrame methods", ask about "data filtering concepts"
  7. First Objective Rule:

    • Should be relatively easy recall question
    • Address main topic/concept of course
    • Format: "Identify what X is" or "Explain why X is important"
  8. Answer Length:

    • Aim for ≀20 words in correct answer
    • Avoid unnecessary elaboration
    • No compound sentences with extra consequences

BLOOMS_TAXONOMY_LEVELS:

Levels from lowest to highest:

  • Recall: Retention of key concepts (not trivialities)
  • Comprehension: Connect ideas, demonstrate understanding
  • Application: Apply concept to new but similar scenario
  • Analysis: Examine parts, determine relationships, make inferences
  • Evaluation: Make judgments requiring critical thinking

LEARNING_OBJECTIVE_EXAMPLES:

Includes 7 high-quality examples with:

  • Appropriate action verbs
  • Clear learning objectives
  • Concise correct answers (mostly <20 words)
  • Multiple source references
  • Framework-agnostic language

Question Quality Standards

From prompts/questions.py:

GENERAL_QUALITY_STANDARDS:

  • Overall goal: Set learner up for success
  • Perfect score attainable for thoughtful students
  • Aligned with course content
  • Aligned with learning objective and correct answer
  • No references to manual intervention (software/AI course)

MULTIPLE_CHOICE_STANDARDS:

  • EXACTLY ONE correct answer per question
  • Clear, unambiguous correct answer
  • Plausible distractors representing common misconceptions
  • Not obviously wrong distractors
  • All options similar length and detail
  • Mutually exclusive options
  • Avoid "all/none of the above"
  • Typically 4 options (A, B, C, D)
  • Don't start feedback with "Correct" or "Incorrect"

QUESTION_SPECIFIC_QUALITY_STANDARDS:

Questions must:

  • Match language and tone of course
  • Match difficulty level of course
  • Assess only course information
  • Not teach as part of quiz
  • Use clear, concise language
  • Not induce confusion
  • Provide slight (not major) challenge
  • Be easily interpreted and unambiguous
  • Have proper grammar and sentence structure
  • Be thoughtful and specific (not broad and ambiguous)
  • Be complete in wording (understanding question shouldn't be part of assessment)

CORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS:

Correct answers must:

  • Be factually correct and unambiguous
  • Match course language and tone
  • Be complete sentences
  • Match course difficulty level
  • Contain only course information
  • Not teach during quiz
  • Use clear, concise language
  • Be thoughtful and specific
  • Be complete (identifying correct answer shouldn't require interpretation)

INCORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS:

Incorrect answers should:

  • Represent reasonable potential misconceptions
  • Sound plausible to non-experts
  • Require thought even from diligent learners
  • Not be obviously wrong
  • Use incorrect_answer_suggestions from objective (as starting point)

Avoid:

  • Obviously wrong options anyone can eliminate
  • Absolute terms: "always", "never", "only", "exclusively"
  • Phrases like "used exclusively for scenarios where..."

ANSWER_FEEDBACK_QUALITY_STANDARDS:

For Incorrect Answers:

  • Be informational and encouraging (not punitive)
  • Single sentence, concise
  • Do NOT say "Incorrect" or "Wrong"

For Correct Answers:

  • Be informational and encouraging
  • Single sentence, concise
  • Do NOT say "Correct!" (redundant after "Correct: " prefix)

Incorrect Answer Generation Guidelines

From prompts/incorrect_answers.py:

Core Principles:

  1. Create Common Misunderstandings:

    • Represent how students actually misunderstand
    • Confuse related concepts
    • Mix up terminology
  2. Maintain Identical Structure:

    • Match grammatical pattern of correct answer
    • Same length and complexity
    • Same formatting style
  3. Use Course Terminology Correctly but in Wrong Contexts:

    • Apply correct terms incorrectly
    • Confuse with related concepts
    • Example: Describe backpropagation but actually describe forward propagation
  4. Include Partially Correct Information:

    • First part correct, second part wrong
    • Correct process but wrong application
    • Correct concept but incomplete
  5. Avoid Obviously Wrong Answers:

    • No contradictions with basic knowledge
    • Not immediately eliminable
    • Require course knowledge to reject
  6. Mirror Detail Level and Style:

    • Match technical depth
    • Match tone
    • Same level of specificity
  7. For Lists, Maintain Consistency:

    • Same number of items
    • Same format
    • Mix some correct with incorrect items
  8. AVOID ABSOLUTE TERMS:

    • "always", "never", "exclusively", "primarily"
    • "all", "every", "none", "nothing", "only"
    • "must", "required", "impossible"
    • "rather than", "as opposed to", "instead of"

IMMEDIATE_RED_FLAGS (triggers regeneration):

Contradictory Second Clauses:

  • "but not necessarily"
  • "at the expense of"
  • "rather than [core concept]"
  • "ensuring X rather than Y"
  • "without necessarily"
  • "but has no impact on"
  • "but cannot", "but prevents", "but limits"

Explicit Negations:

  • "without automating", "without incorporating"
  • "preventing [main benefit]"
  • "limiting [main capability]"

Opposite Descriptions:

  • "fixed steps" (for flexible systems)
  • "manual intervention" (for automation)
  • "simple question answering" (for complex processing)

Hedging Creating Limitations:

  • "sometimes", "occasionally", "might"
  • "to some extent", "partially", "somewhat"

INCORRECT_ANSWER_EXAMPLES:

Includes 10 detailed examples showing:

  • Learning objective
  • Correct answer
  • 3 plausible incorrect suggestions
  • Explanation of why each is plausible but wrong
  • Consistent formatting across all options

Ranking and Grouping

RANK_QUESTIONS_PROMPT:

Criteria:

  1. Question clarity and unambiguity
  2. Alignment with learning objective
  3. Quality of incorrect options
  4. Quality of feedback
  5. Appropriate difficulty (simple English preferred)
  6. Adherence to all guidelines

Critical Instructions:

  • DO NOT change question with ID=1
  • Rank starting from 2
  • Each question unique rank
  • Must return ALL questions
  • No omissions
  • No duplicate ranks

Simple vs Complex English:

Simple: "AI engineers create computer programs that learn from data"
Complex: "AI engineering practitioners architect computational paradigms
          exhibiting autonomous erudition capabilities"

GROUP_QUESTIONS_PROMPT:

Grouping Logic:

  • Questions with same learning_objective_id are similar
  • Identify topic overlap
  • Mark best_in_group within each group
  • Single-member groups: best_in_group = true

Critical Instructions:

  • Must return ALL questions
  • Each question needs group metadata
  • No omissions
  • Best in group marked appropriately

Summary of Data Flow

Complete End-to-End Flow

User Uploads Files
      ↓
ContentProcessor extracts and tags content
      ↓
[Stored in global state]
      ↓
Generate Base Objectives (multiple runs)
      ↓
Group Base Objectives (by similarity)
      ↓
Generate Incorrect Answers (for best-in-group only)
      ↓
Improve Incorrect Answers (quality check)
      ↓
Reassign IDs (best from 001 group β†’ ID=1)
      ↓
[Objectives displayed in UI, stored in state]
      ↓
Generate Questions (parallel, multiple runs)
      ↓
Judge Question Quality (parallel)
      ↓
Group Questions (by similarity)
      ↓
Rank Questions (best-in-group only)
      ↓
[Questions displayed in UI]
      ↓
Format for Display
      ↓
Export to JSON (optional)

Key Optimization Strategies

  1. Multiple Generation Runs:

    • Generates variety of objectives/questions
    • Grouping identifies best versions
    • Reduces risk of poor quality individual outputs
  2. Hierarchical Processing:

    • Generate base β†’ Group β†’ Enhance β†’ Improve
    • Only enhances best candidates (saves API calls)
    • Progressive refinement
  3. Parallel Processing:

    • Questions generated concurrently (up to 5 threads)
    • Significant time savings for multiple objectives
    • Independent evaluations
  4. Quality Gating:

    • LLM judges question quality
    • Checks for red flags in incorrect answers
    • Regenerates problematic content
  5. Source Tracking:

    • XML tags preserve origin
    • Questions link back to source materials
    • Enables accurate content matching
  6. Modular Prompts:

    • Reusable quality standards
    • Consistent across all generations
    • Easy to update centrally

Configuration and Customization

Available Models

Configured in models/config.py:

MODELS = [
    "o3-mini", "o1",           # Reasoning models (no temperature)
    "gpt-4.1", "gpt-4o",       # GPT-4 variants
    "gpt-4o-mini", "gpt-4",
    "gpt-3.5-turbo",           # Legacy
    "gpt-5",                   # Latest (no temperature)
    "gpt-5-mini",              # Efficient (no temperature)
    "gpt-5-nano"               # Ultra-efficient (no temperature)
]

Temperature Support:

  • Models with reasoning (o1, o3-mini, gpt-5 variants): No temperature
  • Other models: Temperature 0.0 to 1.0

Model Selection Strategy:

  • Base objectives: User-selected (default: gpt-5)
  • Grouping: Hardcoded gpt-5-mini (efficiency)
  • Incorrect answers: Separate user selection (default: gpt-5)
  • Questions: User-selected (default: gpt-5)
  • Quality judging: User-selected or gpt-5-mini

Environment Variables

Required:

OPENAI_API_KEY=your_api_key_here

Configured via .env file in project root

Customization Points

  1. Quality Standards:

    • Edit prompts/learning_objectives.py
    • Edit prompts/questions.py
    • Edit prompts/incorrect_answers.py
    • Changes apply to all future generations
  2. Example Questions/Objectives:

    • Modify LEARNING_OBJECTIVE_EXAMPLES
    • Modify EXAMPLE_QUESTIONS
    • Modify INCORRECT_ANSWER_EXAMPLES
    • LLM learns from these examples
  3. Generation Parameters:

    • Number of objectives per run
    • Number of runs (variety)
    • Temperature (creativity vs consistency)
    • Model selection (quality vs cost/speed)
  4. Parallel Processing:

    • max_workers in assessment.py
    • Currently: min(len(objectives), 5)
    • Adjust for your rate limits
  5. Output Formats:

    • Modify formatting.py for display
    • Assessment JSON structure in models/assessment.py

Error Handling and Resilience

Content Processing Errors

  • Invalid JSON notebooks: Falls back to raw text
  • Parse failures: Wraps in code blocks, continues
  • Missing files: Logged, skipped
  • Encoding issues: UTF-8 fallback

Generation Errors

  • API failures: Logged with traceback
  • Structured output parse errors: Fallback responses created
  • Missing required fields: Default values assigned
  • Validation errors: Caught and logged

Parallel Processing Errors

  • Individual thread failures: Don't stop other threads
  • Placeholder questions: Created on error
  • Complete error details: Logged for debugging
  • Graceful degradation: Partial results returned

Quality Check Failures

  • Regeneration failures: Original kept with warning
  • Judge unavailable: Questions marked unapproved
  • Validation failures: Detailed logs in debug directories

Debug and Logging

Debug Directories

  1. incorrect_suggestion_debug/

    • Created during objective enhancement
    • Contains logs of problematic incorrect answers
    • Format: {objective_id}.txt
    • Includes: Original suggestions, identified issues, regeneration attempts
  2. wrong_answer_debug/

    • Created during question improvement
    • Logs question-level incorrect answer issues
    • Regeneration history

Console Logging

Extensive logging throughout:

  • File processing status
  • Generation progress (run numbers)
  • Parallel thread activity (thread IDs)
  • API call results
  • Error messages with tracebacks
  • Timing information (start/end times)

Example Log Output:

DEBUG - Processing 3 files: ['file1.vtt', 'file2.ipynb', 'file3.srt']
DEBUG - Found source file: file1.vtt
Generating 3 learning objectives from 3 files
Successfully generated 3 learning objectives without correct answers
Generated correct answer for objective 1
Grouping 9 base learning objectives
Received 9 grouped results
Generating incorrect answer options only for best-in-group objectives...
PARALLEL: Starting ThreadPoolExecutor with 3 workers
PARALLEL: Worker 1 (Thread ID: 12345): Starting work on objective...
Question generation completed in 45.23 seconds

Performance Considerations

API Call Optimization

Calls per Workflow:

For 3 objectives Γ— 3 runs = 9 base objectives:

  1. Learning Objectives:

    • Base generation: 3 calls (one per run)
    • Correct answers: 9 calls (one per objective)
    • Grouping: 1 call
    • Incorrect answers: ~3 calls (best-in-group only)
    • Improvement checks: ~3 calls
    • Total: ~19 calls
  2. Questions (for 3 objectives Γ— 1 run):

    • Question generation: 3 calls (parallel)
    • Quality judging: 3 calls (parallel)
    • Grouping: 1 call
    • Ranking: 1 call
    • Total: ~8 calls

Total for complete workflow: ~27 API calls

Time Estimates

Typical Execution Times:

  • File processing: <1 second
  • Objective generation (3Γ—3): 30-60 seconds
  • Question generation (3Γ—1): 20-40 seconds (with parallelization)
  • Total: 1-2 minutes for small course

Factors Affecting Speed:

  • Model selection (gpt-5 slower than gpt-5-mini)
  • Number of runs
  • Number of objectives/questions
  • API rate limits
  • Network latency
  • Parallel worker count

Cost Optimization

Strategies:

  1. Use gpt-5-mini for grouping/ranking (hardcoded)
  2. Reduce number of runs (trade-off: variety)
  3. Generate fewer objectives initially
  4. Use faster models for initial exploration
  5. Use premium models for final production

Conclusion

The AI Course Assessment Generator is a sophisticated, multi-stage system that transforms raw course materials into high-quality educational assessments. It employs:

  • Modular architecture for maintainability
  • Structured output generation for reliability
  • Quality-driven iterative refinement for excellence
  • Parallel processing for efficiency
  • Comprehensive error handling for resilience

The system successfully balances automation with quality control, producing assessments that align with educational best practices and Bloom's Taxonomy while maintaining complete traceability to source materials.